Meta Platforms Inc.’s artificial intelligence research team said today it has made big progress as it strives to create more realistic AI-generated speech systems.
Its latest advances in what it calls “textless natural language processing” mean it’s now able to model expressive vocalizations, such as laughter, yawning and cries, in addition to “spontaneous chit-chat” in real time.
Meta’s work has to do with so-called Generative Spoken Language Models, which are a breakthrough natural language processing model that make it possible to build speech recognition systems without using any transcribed audio data to train them.
In a blog post, Meta AI’s team explained that traditional AI systems remain quite limited in their ability to capture rich, expressive non-verbal signals in speech, such as intonations, emotional expressions, pauses, accents and rhythms, all of which can play a key role in human interactions. That’s because those systems can only learn from written text, which captures what people say but not how they say it.
Meta’s GSLMs are different, because they enable natural language processing models to capture the full expressive nature of oral language. It’s a powerful capability, and Meta said it has been training its GSLMs to use that data, either to build downstream applications or as a generative tool for creating language from an audio prompt.
The result is that Meta says it can now model expressive vocalizations that are essential to understanding the context of an interaction in the same way as a person would. Such vocalizations allow AI systems to convey nuances about their communicative intent, Meta explained, or the sentiment they want to convey – such as boredom, irony and irritation.
In addition, Meta said it’s now able to model spontaneous real-time chit-chat between two AI agents in a highly realistic way. The agents can factor in behavior such as the occasional overlap or pause, “ums” and “ahs” and so on. Meta said this is an important development because it will allow virtual agents, for example, to recognize more nuanced social cues and signals. AI systems will also be able to interpret whether nonvocal expressions suggest positive or negative feedback, Meta said.
Meta explained that its intent is to create more natural and engaging AI speech systems. For instance, it’s planning to apply textless model training techniques to build more useful downstream applications that don’t rely on resource-intensive text labels or automatic speech recognition systems, such as apps that can answer questions about the weather and so on.
“We believe prosody in speech can help better parse a sentence, which in turn facilitates understanding the intent and improves the performance of question answering,” Meta said.
Another potential use case is speech-to-speech translation, which might be useful for dubbing movies. Most AI dubbing systems work by translating the text of a movie’s script in a roundabout way. First, the audio is transcribed into text, then translated, then finally converted back into audio. It’s extremely complicated and completely removes the expressivity of oral language as it misses out on idiomatic expressions unique to oral language. Meta said its GSLMs remove the need for text-based dubbing, meaning it can potentially come up with far more realistic audio translations.
“Because self-supervised speech representation approaches are able to learn discrete units from raw audio, it’s now possible to remove the need for text and replace it with the pseudo text extracted from each of the target and source languages,” Meta’s AI team explained.
As a final benefit, Meta said the advancement of textless natural language processing would help to make AI more inclusive. Traditional NLP applications need to be trained with enormous text resources, which mean they’re available in only a handful of languages. By training such systems from oral speech alone, textless NLP will bring the benefits of AI speech to hundreds of languages that lack a standardized writing system, including Swiss German, dialectal Arabic and many more.