Interview with Benoît Sagot
A specialist in automatic language processing, Benoît Sagot has headed the ALMAnaCH research team at Inria since 2017. This polytechnician with a passion for linguistics is interested in the design and learning of language models, the problems of linguistic variability, and the development of resources for French in a field dominated by English. In 2023-2024, he has been invited to occupy the annual Computer Sciences chair at the Collège de France.
How did your interest in science, and in particular computer science, come about ?
Benoît Sagot : Generally speaking, my interest in science is a family legacy. Initially, I wanted to go into mathematics, and then I moved on to theoretical physics at the École Polytechnique. Throughout my studies, I also developed a keen interest in languages : English and German, thanks to a few months' internship in Austria, Ancient Greek and later Slovak. It was this curiosity for languages, but also for their history, that led me to discover the research discipline of automatic language processing (ALP)[1], a discipline in which I decided to do a thesis at Inria. I had taken computer science lectures as a student-engineer in the Telecommunications Corps, but I didn't have a full computer science education like most of my colleagues. Researchers in my field often come from a variety of backgrounds. In my team, for example, some have studied science, while others began by studying literature, language or linguistics, before later turning to computing, which they sometimes already pursued as a hobby. But this rich multidisciplinarity has tended to fade in recent years. The reason for this : the IT tools deployed when working with textual data have evolved considerably over time. It's not unique to my field, but in this age of data, we're sometimes more interested in processing the data than in the data itself.
You first became interested in formal grammars and syntactic analysis. What are these ?
Syntactic analysis is a task akin to the grammatical or logical analysis of a sentence as taught to schoolchildren. It involves determining the nature of each word (verb, adjective, noun, etc.) and constructing the grammatical structure of the sentence. A useful approach in the field at the time, as it was difficult to analyze a sentence semantically - i.e. to understand its meaning - without first analyzing its structure. While there are several ways of representing the grammatical structure of a sentence, representing its semantic structure is a less well-defined problem. Syntactic analysis has therefore long been a flagship task in the field, sufficiently well defined but very difficult. It involves describing the grammar of the language, including as many rare phenomena as possible, in a formal way - and therefore usable by computer - and then developing automatic analysis algorithms based on such grammars. This involved generalizing algorithms originally used for compiling programming languages. I also worked on the lexical aspect, i.e. the representation of the morphological properties of words - how they conjugate, agree in the plural, etc. - and their syntactic properties.
Your discipline is then evolving in the light of new approaches such as machine learning and deep learning. What impact has this had on your work ?
These new approaches have changed the computational and algorithmic tools we use to solve a number of problems. When you've delved into a certain area of computer science and trained in formal linguistics - so how to formally describe the properties of words and languages - it's stimulating, if difficult, to have to familiarize yourself with a new discipline. Indeed, machine learning[2] has sometimes given the impression of hiding the importance of linguistic expertise. If we take a tree corpus, i.e. a collection of sentences grammatically analyzed by hand, and give it as input to a machine learning algorithm, it is possible to obtain a syntactic analyzer that often outperforms those based on hand-written grammars. However, linguistic expertise is still required : whereas previously it was encoded explicitly in these grammar rules and in the lexicons associated with each word, in the age of statistical machine processing it is being shifted to annotated data, such as tree corpora. These annotations must be consistent from one sentence to the next, and be as linguistically correct as possible. About ten years ago, neural learning and deep learning[3] entered our field. Once again, we had to familiarize ourselves with a new family of algorithms. Deep neural networks have enabled the emergence of so-called " end-to-end " approaches, where intermediate steps such as syntactic analysis can be dispensed with to perform a task such as machine translation. Finally, language models, which learn directly from raw texts without any annotation or information other than the simple sentences provided, have further contributed to diminishing the effective importance of linguistic expertise.
In a discipline so dominated by vast quantities of English-language data, how do you go about making room for slightly more minority languages, and in particular the French language ?
English is by far the majority, both as an object of study and as the language in which models are developed. French has fewer resources, but since I've been in this field, I've tried to help bridge this gap, by developing syntactic and semantic lexicons, tree corpora and, more recently, large raw corpora with my colleagues. However, our language is well endowed compared to many others. It is possible to train a model such as GPT-4[4]on English, as all the data required to do so are available : raw texts, dialogues and human annotations in enormous quantities. However, for a more minority language, such as Breton, it is likely that its complete history has not produced enough text to train this kind of model. The problem is even greater for exclusively oral languages. What's more, we often forget - when we talk about the multiplicity of languages, especially the poorly endowed - that even within a language, there is no such thing as homogeneity. A Wikipedia text, a poem by Baudelaire or comments written on a social network by angry Internet users using creative spelling are all very different. A model trained on one type of text may have difficulty processing another. We are therefore also interested in the issue of linguistic variability. Our aim is to understand how to make models more adaptable to new types of text. And we're trying to determine what different types of linguistic variation have in common, so we can devise approaches that are as general as possible.
You worked on the design of a French language model[5], CamemBERT. What major difficulties did you have to overcome to achieve this ?
First of all, we didn't have enough texts to learn a good quality language model. We built up a huge collection of texts for almost 180 languages, relying on Common Crawl, an American institution that regularly harvests large quantities of data from the Internet to make them available to the public. We then developed CamemBERT for the French part of this collection, called OSCAR. The second challenge of this project was to reflect on the following question : to what extent is the need for a French language model a research task ? Fundamentally, once we had our textual data and the runtime code, what we were doing, even if it remained technically challenging, was initially an engineering project. What was scientifically new about this ? Today, NLP and other fields of artificial intelligence are disciplines where the boundary between engineering and research is moving very fast. So we wanted to know how we could put a scientific spin on this project to develop a language model. At the time, it was generally thought that huge amounts of data were required to train a language model of a certain quality. Our work has put this assumption into perspective. We trained models respectively with the 128 GB of French texts in OSCAR, a random selection of 4 GB of these same texts and the 4 GB of texts that make up the sum of the articles in the French Wikipedia. Result : the model trained on the 4 GB of OSCAR was almost as good as the one fed with all the data, while the model trained on the 4 GB of Wikipedia was significantly less so. We have good reason to believe that this is due to the fact that the encyclopedic language of this platform is fairly homogeneous, whereas OSCAR's data, taken from the Internet, is more diverse. Thus, a model can be trained efficiently with a smaller amount of data than previously imagined, provided that this data reflects the variability of the language.
How do you navigate this living frontier between research and engineering ?
It's almost an epistemological question. Over the last few years, with ChatGPT now available to everyone, we've seen an acceleration between research results and their implementation in publicly accessible tools. This compresses not only time, but also the relationships between individuals : researchers, start-ups, major groups and public authorities are increasingly coming together around the same table to discuss, which is very interesting. Today, training a language model on a classical architecture is an engineering problem. Just a few years ago, it was a research issue. The shift in this boundary therefore raises existential questions : is training a large language model for French, such as CamemBERT, still a matter for a public research team, or rather for companies ? And, if an object of study that was research yesterday is engineering today, should the laboratory that studied it then move with this object of study and become a research and engineering structure ? Or should we accept that others are taking over ? This question is made all the more complex by the fact that the current government is pursuing a research policy of " with impact ", the aim of which is to ensure that, as far as possible, research results are followed by engineering work leading to an exploitable outcome. However, as these applications rapidly reach the general public, major issues arise in terms of training society and decision-makers to understand these technologies. Without such training, we may hear nonsense, such as the unfounded fear that systems like ChatGPT pose an existential risk to humanity. Yes, certain professions will evolve, some will disappear and others will emerge, but I think we need to accompany these changes and reflect on their consequences, rather than fear them.
This year, you've been invited to occupy the annual Computer Sciences Chair at the Collège de France. What are your expectations of this experience ? What's in it for you ?
It's a great honor, and one I must admit I never expected. It's also a great opportunity, because I'll be facing an audience I'm not used to : the general public. For me, it's an opportunity to try and demystify and explain these subjects. The acceleration between research into new technologies and their application in society, and their acceptability, are all important subjects. I think it's a good thing that the Collège, by inviting me to occupy this chair, wants them to be debated, and this is exactly the right time to do so. The release of ChatGPT and, beyond that, the multiplication of technologies for processing and generating text, speech and images raise considerable stakes. It's important to understand how these systems work, and to reflect on their risks and consequences to avoid unfounded concerns. After all, Plato was afraid of writing, denouncing it as an artificial device likely to weaken the mind. The same criticism is levelled at ChatGPT today.
Interview by William Rowe-Pirra
Glossary
[1] Automatic Language Processing (ALP) : research discipline aimed at developing tools for processing, transforming and generating text, with numerous applications such as machine translation, dialogue systems like ChatGPT, or automatic analysis of large volumes of documents (e.g. in the healthcare sector).
[2] Machine learning : machine learning consists in giving machines the ability to learn (knowledge, how to perform a task, etc.) from a certain number of examples (training data) using dedicated algorithms.
[3] Deep learning: a type of machine learning based on the use of multi-layer neural networks.
[4] GPT-4 (Generative Pretrained Transformer 4) : language model developed by OpenAI, the latest version of the ChatGPT conversational model, capable of maintaining conversations with human users, including answering questions or translating texts.
[5] Language model : in NLP, a statistical model that models the distribution of sequences of direct symbols (letters, phonemes, words) in a natural language and, typically, predicts the next words in a text.