Voicebox

voicebox.metademolab.com

Is a

Product

Product attributes

Industry

Speech recognition

Artificial Intelligence (AI)

Generative AI

Launch Date

June 16, 2023

Product Parent Company

Meta AI

Overview

Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text. Two versions of the model are under development—an English-only version trained on 60K hours of data and a multilingual version trained using 50K hours of data covering six languages (English, French, German, Spanish, Polish, and Portuguese). Voicebox can perform a range of tasks, including speech synthesis across six languages, removal of transient noise, content editing, the transfer of audio style within and across languages, and the generating of diverse speech samples.

Meta AI announced Voicebox on June 16, 2023, sharing audio samples from the model and a research paper detailing the methodology behind the model. However, due to the potential misuse of the Voicebox model, Meta AI chose not to make the code publicly available, stating:

While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility.

The research paper accompanying the model's release, titled "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale," also details a classifier that can distinguish between authentic speech and audio generated with Voicebox to help mitigate against future misuse of the model.

Voicebox can generalize to various speech-generation tasks that it was not specifically trained for. Previously generative AI models for speech required training for each task using carefully prepared training data. These inputs, known as monotonic, clean data, are difficult to produce and only exist in limited quantities, and they result in outputs that sound monotonous. In contrast, Voicebox can learn from raw audio and an accompanying transcription.

Building on Meta's flow matching model means Voicebox can learn highly non-deterministic mapping between text and speech. This enables Voicebox to learn from varied speech data without variations having to be carefully labeled. Therefore, Voicebox can train on more diverse data and a much larger scale of data. Voicebox was trained on recorded speech and transcripts from public-domain audiobooks. Meta states that its flow matching method shows improvement compared to auto-regressive models, outperforming VALL-E on zero-shot text-to-speech in terms of intelligibility (5.9% vs. 1.9% word error rates) and audio similarity (0.580 vs. 0.681) while being as much as 20 times faster. For cross-lingual style transfer, Voicebox outperforms YourTTS to reduce the average word error rate from 10.9% to 5.2% and improve audio similarity from 0.335 to 0.481.

Demos

Upon the release of Voicebox, Meta AI provided a series of demos demonstrating the model's capabilities; these included the following:

Transient noise removal—re-generating noise-corrupted speech and removing unwanted sounds
Content editing—correcting recorded speech, including misspoken words
Zero-shot text-to-speech synthesis—generating speech with any audio style from input reference audio
Cross-lingual style transfer—transferring style across languages, enabling speakers to translate their own voice into another language
Diverse speech generation—sampling without conditioning on any audio, to create unique and expressive audio styles

Timeline

No Timeline data yet.

Further Resources

Title

Author

Link

Type

Date

Introducing Voicebox: The first generative AI model for speech to generalize across tasks with state-of-the-art performance

https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/

Web

June 16, 2023

Voicebox: Text-Guided Multilingual

Universal Speech Generation at Scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz,

Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

https://dl.fbaipublicfiles.com/voicebox/paper.pdf

Paper

Voicebox

Contents

Product attributes

Timeline

Further Resources

References

Find more entities like Voicebox