Log in
Enquire now
Massively Multilingual Speech (MMS)

Massively Multilingual Speech (MMS)

The Massively Multilingual Speech (MMS) is a project from Meta building a single multilingual speech recognition model.

OverviewStructured DataIssuesContributors

Contents

github.com/facebookresearch/fairseq/tree/main/examples/mms
Is a
Product
Product
Technology
Technology

Product attributes

Industry
Speech recognition
Speech recognition
Natural language processing (NLP)
Natural language processing (NLP)
Artificial Intelligence (AI)
Artificial Intelligence (AI)
Generative AI
Generative AI
Launch Date
May 22, 2023
Product Parent Company
Meta AI
Meta AI
Competitors
Whisper (OpenAI)
Whisper (OpenAI)

Technology attributes

Related Industries
Language education
Language education

Other attributes

Announcement URL
about.fb.com/news/202...chnology/
License
CC-BY-NC 4.0 license
Parent Organization
Meta
Meta
Overview

The Massively Multilingual Speech (MMS) project is building a single multilingual speech recognition model, expanding speech technology to support over 1,100 languages (more than ten times as many as before), language identification models able to identify over 4,000 languages (more than forty times more than before), pre-trained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages. MMS is a project from Meta aiming to make it easier for people to access information and use their devices in their preferred language. Meta made MMS available for free on May 22, 2023, open-sourcing the code and model weights under the CC-BY-NC 4.0 license.

Many languages are in danger of being lost, and the limitations of speech recognition and generation technology are struggling to capture them. Through MMS, Meta hopes to make a small contribution to preserving the language diversity of the world, helping academics, researchers, and activists document and preserve languages. MMS also has a range of use cases:

  • Creating and converting books and tutorials into audiobooks
  • Preparing documentation and converting audio or videos into structured documentation
  • Audio file analysis, identifying the main focus areas
  • Generating closed captioning for videos and audio content

Previous speech recognition models only covered roughly 100 languages, a fraction of the 7,000+ known languages spoken around the world. Upon release, the MMS recognition model supports speech-to-text and text-to-speech for 1,107 languages and language identification for over 4,000 languages. The map below shows the geographic origin of MMS's language coverage.

Illustration of languages covered by MMS.

Illustration of languages covered by MMS.

Meta's results show MMS outperforms existing models. In the future, Meta plans to increase coverage, supporting more languages and taking dialects into account.

Model

MMS combines Meta's self-supervised learning wav2vec 2.0 model with a new dataset containing labeled data for 1,100 languages and unlabeled data for nearly 4,000 languages. Some of these languages, including the Tatuyo language, have only a few hundred speakers and have no prior speech technology coverage.

To collect audio data for thousands of languages, the project used religious texts, such as the Bible, that have been translated and recorded in many different languages. MMS has readings of the New Testament in over 1,100 languages providing roughly thirty-two hours of data per language. Utilizing unlabeled recordings of various other Christian religious readings increased the number of languages to over 4,000. While the recordings are more often read by male speakers and contain religious text, the model performs equally well for male and female voices and does not show bias for religious language. This is due to a connectionist temporal classification approach.

The data was preprocessed to improve quality, making it usable by machine learning algorithms. An alignment model was trained on existing data in over one hundred languages, performing a final cross-validation filtering step based on model accuracy to remove potentially misaligned data. To enable other researchers to create new speech datasets, Meta added the alignment algorithm to PyTorch and released the alignment model. While thirty-two hours of data per language is not enough to train conventional supervised speech recognition models. MMS builds on wav2vec 2.0 to reduce the amount of labeled data needed to train useful systems. Meta-trained self-supervised models on about 500,000 hours of speech data in over 1,400 languages, nearly five times more languages than prior works. The resulting models were then fine-tuned for a specific speech task, such as multilingual speech recognition or language identification.

Performance

MMS performance has been evaluated on existing benchmark datasets, including FLEURS. As the number of languages increases, performance slightly decreases. Moving from 61 to 1,107 languages increases the character error rate by only about 0.4 percent but increases the language coverage by over eighteen times. In comparison to OpenAI's Whisper, Meta found that models trained using MMS data achieved half the word error rate while covering eleven times more languages.

Timeline

No Timeline data yet.

Further Resources

Title
Author
Link
Type
Date

MMS - Language Coverage

https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html

Web

References

Find more entities like Massively Multilingual Speech (MMS)

Use the Golden Query Tool to find similar entities by any field in the Knowledge Graph, including industry, location, and more.
Open Query Tool
Access by API
Golden Query Tool
Golden logo

Company

  • Home
  • Press & Media
  • Blog
  • Careers
  • WE'RE HIRING

Products

  • Knowledge Graph
  • Query Tool
  • Data Requests
  • Knowledge Storage
  • API
  • Pricing
  • Enterprise
  • ChatGPT Plugin

Legal

  • Terms of Service
  • Enterprise Terms of Service
  • Privacy Policy

Help

  • Help center
  • API Documentation
  • Contact Us
By using this site, you agree to our Terms of Service.