Title: Characterisation of language registers using emerging sequential pattern extraction
Abstract:
This PhD thesis aims at automatically characterising language registers. From a linguistic point of view, our contribution is to study the potential of natural language processing techniques to extract new knowledge about the casual, neutral, and formal registers. On the computational side, we have proposed a sufficiently generic and unsupervised method to characterise any type of linguistic variation, the registers then being similar to a use case. The manuscript first draws up an inventory of the many different definitions present in the literature, against which we position our work. Second, the constitution of a large lingustically-motivated corpus of French tweets annotated in registers is presented. The annotations result from a semi-supervised process based on a seed manually annotated in registers and a classifier that generalizes the annotations to all the tweets. Based on this annotated corpus, we then show that the use of emergent sequential pattern extraction techniques enables the extraction of linguistic peculiarities of the registers under study. Finally, we detail our approach for reducing the number of extracted patterns, which allows a better interpretability of the characterizations produced.