Abstract
The oral modality is the most natural channel for linguistic interaction, but current language technologies (NLP) are mainly based on the written word, requiring large quantities of text to develop language models. Even voice assistants or speech translation systems use text as an intermediary, which is inefficient and limits the technology to languages with significant textual resources. What's more, it neglects speech characteristics such as rhythm and intonation. Yet children learn their mother tongue(s) long before they learn to read or write.
In this presentation, we will discuss recent advances in learning audio representations that pave the way for NLP applications directly from speech without any text. These models can capture the nuances of spoken language, including dialogues. We will also discuss the technical challenges still to be overcome in order to reproduce learning that would approach that of the human baby.