Log in
Enquire now
AudioCraft

AudioCraft

AudioCraft is an open-source PyTorch library for audio processing and generation with deep learning, developed by Meta AI.

OverviewStructured DataIssuesContributors

Contents

OverviewMusicGenAudioGenEnCodecTimelineTable: Further ResourcesReferences
audiocraft.metademolab.com
Is a
‌
AI Project
Product
Product

Product attributes

Industry
Generative AI
Generative AI
‌
AI music generation
Sound effect
Sound effect
Competitors
Stable Audio
Stable Audio

AI Project attributes

Announcement URL
about.fb.com/news/202...nd-audio/
Overview

AudioCraft is an open-source PyTorch library for audio processing and generation with deep learning, developed by Meta AI. AudioCrafts offers users a range of generative audio capabilities (music, sound effects, and compression after training on raw audio signals) in a single code base. It consists of three models:

Diagram demonstrating how Audiocraft works.

Diagram demonstrating how Audiocraft works.

  • MusicGen—text-to-music model.
  • AudioGen—text-to-sound model.
  • EnCodec—neural audio codec.

Both MusicGen and AudioGen consist of a single autoregressive Language Model (LM) operating over streams of compressed discrete music representation (tokens). Meta AI introduced an approach leveraging the internal structure of the parallel streams of tokens, showing that a token interleaving pattern can efficiently model audio sequences while also capturing long-term dependencies in the audio. These models leverage EnCodec to learn the discrete audio tokens from raw waveforms. The codec maps audio signals to one or several parallel streams of discrete tokens. Then a single autoregressive language model recursively models the tokens from EnCodec. Generated tokens are then fed to EnCodec decoder to map back to the audio space, obtaining an output waveform. Different types of conditioning models can control the generation, including a pretrained text encoder for text-to-audio applications.

AudioCraft was released on August 2, 2023. Meta AI chose to open-source the AudioCraft models allowing users to train their own models based on their own datasets. The AudioCraft code is released under the MIT license, and the model weights are released under the CC-BY-NC 4.0 license. Meta has released demos of the models demonstrating samples of audio generated from both the text-to-sound and text-to-music models. The company is aiming for the AudioCraft models to be used as tools for musicians and sound designers, helping users brainstorm new ideas or iterate on their existing compositions in new ways. Meta has also MusicGen could become a new type of instrument similar to the adoption of synthesizers.

MusicGen

The MusicGen model was first described in a paper released in June 2023 titled "Simple and Controllable Music Generation." The model was developed by the FAIR team at Meta AI and trained between April 2023 and May 2023. The training dataset consisted of roughly 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music owned by Meta or licensed from the following sources the Meta Music Initiative Sound Collection, Shutterstock music collection, and the Pond5 music collection.

MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. MusicGen is available in three sizes (300M, 1.5B, and 3.3B parameters) and two variants (text-to-music generation tasks and melody-guided music generation). The model was evaluated using standard music benchmarks, including those below:

  • Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish)
  • Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST)
  • CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model

Additional qualitative studies with human participants were used to evaluate the performance of the model based on the following criteria:

  • Overall quality of the music samples
  • Text relevance to the provided text input
  • Adherence to the melody for melody-guided music generation
AudioGen

AudioGen was also developed by the FAIR team at Meta AI. A paper describing version one of the model was released in September 2022, titled "AudioGen: Textually Guided Audio Generation."

Version two of AudioGen released as part of AudioCraft, was trained between July 2023 and August 2023 on a range of public data sources, including the following:

  • A subset of AudioSet
  • BBC sound effects
  • AudioCaps
  • Clotho v2
  • VGG-Sound
  • FSD50K
  • Free To Use Sounds
  • Sonniss Game Effects
  • WeSoundEffects
  • Paramount Motion - Odeon Cinematic Sound Effects.

AudioGen consists of an EnCodec model for audio tokenization and an auto-regressive language model based on the transformer architecture for audio modeling. Version 2 was enhanced by training on 10-second samples vs 5 seconds (version 1), using a retrained EnCodec model on environmental sound data, and not using audio mixing augmentations. Version 2 has 1.5 billion parameters. AudioGen was evaluated using:

  • Frechet Audio Distance and
  • Kullback-Leibler Divergence.

Again, qualitative studies with human participants were also undertaken.

EnCodec

EnCodec was first released by Meta AI on October 25, 2022. The model was described in a paper titled "High Fidelity Neural Audio Compression." Encodec consists of three parts:

  1. The encoder—takes uncompressed data and transforms it into a higher dimensional and lower frame rate representation.
  2. The quantizer—compresses this representation to the targeted size. The quantizer is trained to give the desired size (or set of sizes) while retaining the most important information to rebuild the original signal. This compressed representation is stored on disk or sent through the network.
  3. The decoder—turns the compressed signal back into a waveform that is as similar as possible to the original. Discriminators are used to improve the perceptual quality of the generated samples by trying to differentiate between real samples and reconstructed samples.

Timeline

No Timeline data yet.

Further Resources

Title
Author
Link
Type
Date

AudioGen: Textually Guided Audio Generation

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

https://arxiv.org/abs/2209.15352

September 30, 2022

High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

https://arxiv.org/abs/2210.13438

October 24, 2022

Simple and Controllable Music Generation

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez

https://arxiv.org/abs/2306.05284

June 8, 2023

Using AI to compress audio files for quick and easy sharing

https://ai.meta.com/blog/ai-powered-audio-compression-technique/

Web

October 25, 2022

References

Find more entities like AudioCraft

Use the Golden Query Tool to find similar entities by any field in the Knowledge Graph, including industry, location, and more.
Open Query Tool
Access by API
Golden Query Tool
Golden logo

Company

  • Home
  • Press & Media
  • Blog
  • Careers
  • WE'RE HIRING

Products

  • Knowledge Graph
  • Query Tool
  • Data Requests
  • Knowledge Storage
  • API
  • Pricing
  • Enterprise
  • ChatGPT Plugin

Legal

  • Terms of Service
  • Enterprise Terms of Service
  • Privacy Policy

Help

  • Help center
  • API Documentation
  • Contact Us
By using this site, you agree to our Terms of Service.