Title: Characterisation of language registers using emerging sequential pattern extraction
This PhD thesis aims at automatically characterising language registers. From a linguistic point of view, our contribution is to study the potential of natural language processing techniques to extract new knowledge about the casual, neutral, and formal registers. On the computational side, we have proposed a sufficiently generic and unsupervised method to characterise any type of linguistic variation, the registers then being similar to a use case. The manuscript first draws up an inventory of the many different definitions present in the literature, against which we position our work. Second, the constitution of a large lingustically-motivated corpus of French tweets annotated in registers is presented. The annotations result from a semi-supervised process based on a seed manually annotated in registers and a classifier that generalizes the annotations to all the tweets. Based on this annotated corpus, we then show that the use of emergent sequential pattern extraction techniques enables the extraction of linguistic peculiarities of the registers under study. Finally, we detail our approach for reducing the number of extracted patterns, which allows a better interpretability of the characterizations produced.
Controllable Paraphrase Generation with Multiple Types of Constraints [paper]
Nazanin Dehghani, Hassan Hajipoor, Jonathan Chevelu, Gwénolé Lecorvé
TREMoLo-Tweets: a Multi-Label Corpus of French Tweets for Language Register Characterization
Jade Mekki, Gwénolé Lecorvé, Delphine Battistelli, and Nicolas Béchet
Style as Sentiment versus Style as Formality: the same or different?
Somayeh Jafaritazehjani, Gwénolé Lecorvé, Damien Lolive, and John D. Kelleher
We are looking for a new collaborator to work on paraphrase generation / natural language generation. Details can be found here. Feel free to apply if you are interested!
The paper “Towards the Automatic Processing of Language Registers: Semi-supervisedly Built Corpus and Classifier for French” has been accepted at the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing). The authors are Gwénolé Lecorvé, Hugo Ayats, Benoît Fournier, Jade Mekki, Jonathan Chevelu, Delphine Battistelli, and Nicolas Béchet. Come talk with us if you attend the conference!
First work on TREMoLo has just been accepted to CORIA-TALN-RJC 2018, the French NLP conference to be held in Rennes in May.
Paper titles (translated from French) :
- Feature identification for register characterization. Jade Mekki, Delphine Battistelli, Gwénolé Lecorvé, Nicolas Béchet.
- Joint building of a corpus and a classifier for language registers in French. Gwénolé Lecorvé, Hugo Ayats, Benoît Fournier, Jade Mekki, Jonathan Chevelu, Delphine Battistelli, Nicolas Béchet.
The kickoff meeting has just happened this Monday at IRISA Lannion, within the office of team Expression.
Let’s get to work now! 🙂
The main objectives of the project are to study linguistic registers per se, and to develop methods for automatic transformation of linguistic registers across texts, i.e., translating a text from a register to another. This work will rely on the extraction of register-specific linguistic patterns and their integration in an automatic paraphrase generation process. These objectives are enabled by the strong and complementary skills of the consortium members.
The project is driven from a perspective of exploratory research where the goal is the production of fundamental knowledge for style-specific pattern extraction and automatic natural language generation. Linguistic registers are a well-suited case study to achieve this long term objective.