Log in
Enquire now
Voicebox

Voicebox

Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text.

OverviewStructured DataIssuesContributors

Contents

voicebox.metademolab.com
Is a
Product
Product

Product attributes

Industry
Speech recognition
Speech recognition
Artificial Intelligence (AI)
Artificial Intelligence (AI)
Generative AI
Generative AI
Launch Date
June 16, 2023
Product Parent Company
Meta AI
Meta AI
Overview

Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text. Two versions of the model are under development—an English-only version trained on 60K hours of data and a multilingual version trained using 50K hours of data covering six languages (English, French, German, Spanish, Polish, and Portuguese). Voicebox can perform a range of tasks, including speech synthesis across six languages, removal of transient noise, content editing, the transfer of audio style within and across languages, and the generating of diverse speech samples.

Meta AI announced Voicebox on June 16, 2023, sharing audio samples from the model and a research paper detailing the methodology behind the model. However, due to the potential misuse of the Voicebox model, Meta AI chose not to make the code publicly available, stating:

While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility.

The research paper accompanying the model's release, titled "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale," also details a classifier that can distinguish between authentic speech and audio generated with Voicebox to help mitigate against future misuse of the model.

Voicebox can generalize to various speech-generation tasks that it was not specifically trained for. Previously generative AI models for speech required training for each task using carefully prepared training data. These inputs, known as monotonic, clean data, are difficult to produce and only exist in limited quantities, and they result in outputs that sound monotonous. In contrast, Voicebox can learn from raw audio and an accompanying transcription.

Building on Meta's flow matching model means Voicebox can learn highly non-deterministic mapping between text and speech. This enables Voicebox to learn from varied speech data without variations having to be carefully labeled. Therefore, Voicebox can train on more diverse data and a much larger scale of data. Voicebox was trained on recorded speech and transcripts from public-domain audiobooks. Meta states that its flow matching method shows improvement compared to auto-regressive models, outperforming VALL-E on zero-shot text-to-speech in terms of intelligibility (5.9% vs. 1.9% word error rates) and audio similarity (0.580 vs. 0.681) while being as much as 20 times faster. For cross-lingual style transfer, Voicebox outperforms YourTTS to reduce the average word error rate from 10.9% to 5.2% and improve audio similarity from 0.335 to 0.481.

Demos

Upon the release of Voicebox, Meta AI provided a series of demos demonstrating the model's capabilities; these included the following:

  • Transient noise removal—re-generating noise-corrupted speech and removing unwanted sounds
  • Content editing—correcting recorded speech, including misspoken words
  • Zero-shot text-to-speech synthesis—generating speech with any audio style from input reference audio
  • Cross-lingual style transfer—transferring style across languages, enabling speakers to translate their own voice into another language
  • Diverse speech generation—sampling without conditioning on any audio, to create unique and expressive audio styles

Timeline

No Timeline data yet.

Further Resources

Title
Author
Link
Type
Date

Introducing Voicebox: The first generative AI model for speech to generalize across tasks with state-of-the-art performance

https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/

Web

June 16, 2023

Voicebox: Text-Guided Multilingual

Universal Speech Generation at Scale

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz,

Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

https://dl.fbaipublicfiles.com/voicebox/paper.pdf

Paper

References

Find more entities like Voicebox

Use the Golden Query Tool to find similar entities by any field in the Knowledge Graph, including industry, location, and more.
Open Query Tool
Access by API
Golden Query Tool
Golden logo

Company

  • Home
  • Press & Media
  • Blog
  • Careers
  • WE'RE HIRING

Products

  • Knowledge Graph
  • Query Tool
  • Data Requests
  • Knowledge Storage
  • API
  • Pricing
  • Enterprise
  • ChatGPT Plugin

Legal

  • Terms of Service
  • Enterprise Terms of Service
  • Privacy Policy

Help

  • Help center
  • API Documentation
  • Contact Us
By using this site, you agree to our Terms of Service.