Researchers at the artificial intelligence labs of Meta Platforms Inc. today announced a breakthrough with a generative AI model for speech dubbed “Voicebox,” which can accomplish a wide variety of tasks including synthesizing speech, styling, editing content and more.
According to researchers, what large language models such as OpenAI LP’s ChatGPT and diffusion models such as DALL-E did for text and images, Voicebox is now capable of doing for speech.
“Like generative systems for images and text, Voicebox creates outputs in a vast variety of styles, and it can create outputs from scratch as well as modify a sample it’s given,” the Meta AI researchers said in a blog post. “But instead of creating a picture or a passage of text, Voicebox produces high-quality audio clips.”
Voicebox is a broadly capable model that can synthesize speech across six different languages without specialized training. It can also do content editing – including fixing interruptions — style conversion and generate samples in diverse voices.
All the model needs to learn is raw audio and its accompanying transcription. According to researchers, other models cannot generalize across multiple tasks and must be pre-trained specifically for different tasks with specialized training, this sets the Voicebox model apart as it can do multiple different tasks without any specific training.
To make Voicebox sound more “human,” researchers built the model on the Flow Matching model, which allows the generative AI to learn from varied speech data without needing the variations to be specifically labeled. This allows the AI to perform different tasks and permits the training data to be ingested at a larger scale.
“We trained Voicebox with more than 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese,” the researchers said. “Voicebox is trained to predict a speech segment when given the surrounding speech and the transcript of the segment.”
According to the research, using Flow Matching, the model has achieved better results than Microsoft Corp.’s VALL-E model in terms of intelligibility (5.9% vs. 1.9% word error rates) and audio similarity, while being as much as 20 times faster. VALL-E
Voicebox can use as little as two seconds of audio to match a sample’s style and use it for text-to-speech generation. This could be used for future applications for individuals who cannot speak, virtual assistants and voice acting in video games.
The model is also capable of infilling speech from context, being able to predict what words may have been spoken and how they should sound, should they be interrupted in the middle of a clip. As a result, it can seamlessly edit audio clips if a speech is interrupted by short-duration noise (like a dog barking).
Having been trained on numerous voices, Voicebox is also capable of simulating natural speech that is more representative of how people talk in the real world across the six languages that it is currently capable of using. This means that it can be tuned to produce a variety of different voices, tones and cadences and even modify voice audio clips to match a different style or tone.
Although the researchers noted that this is an exciting breakthrough, researchers urged caution about its capabilities and its potential for misuse. As a result, the Voicebox model and its code are not being made available for public consumption.
“While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility,” the researchers said.
This concern is not without precedent as voice simulation has existed for years and has been used for nefarious purposes before. Microsoft’s VALL-E model has similarly not been released to the public due to its capability of simulating people’s voices and thus creating a potential for misuse.
Right now, the information on Voicebox that Meta AI is sharing is in the form of the announcement, audio samples, and a research paper detailing the results it has achieved.
Your vote of support is important to us and it helps us keep the content FREE.
One-click below supports our mission to provide free, deep and relevant content.
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.