Product attributes
Other attributes
MusicLM is an AI model developed by Google for generating high-fidelity music from text descriptions. MusicLM utilizes hierarchical sequence-to-sequence modeling for conditional music generation. It outputs music at 24kHz that remains consistent over several minutes and can be conditioned on both text and a melody (e.g., transforming whistled or hummed melodies according to a described style from a text description). Users can define instrument types (e.g., "electronic" or "classical") and the vibe, mood, or emotion they are looking for.
MusicLM was first previewed in a January 2023 paper titled "MusicLM: Generating Music From Text." At the time, Google stated it had no immediate plans to release MusicLM, with the paper citing ethical challenges of the system, in particular, incorporating copyrighted material from training data into generated music. Google stated it worked with musicians and hosted workshops to determine how MusicLM could empower the creative process. On May 10, 2023, Google released MusicLM, allowing users to sign up and test the model in AI Test Kitchen on the web, Android, or iOS. To help improve the model, it returns two versions of the song and asks the user to choose the better version. The initial version of MusicLM in AI Test Kitchen doesn't generate music with specific artists or vocals.
Generating high-quality, coherent audio presents significant challenges, including limited data containing paired audio-text data. In comparison, the image domain has massive datasets of image-text pairs. Also, it is significantly harder to describe audio in a short text description. Unambiguously capturing the salient characteristics of an acoustic scene (e.g., the sounds heard in a train station or in a forest) or music (e.g., the melody, the rhythm, the timbre of vocals, and the many instruments used in accompaniment) in a few short words is a difficult challenge. Plus, audio is structured along a temporal dimension, making sequence-wide captions a weaker level of annotation than an image caption.
Previous audio generation based on sequence-wide, high-level captions was limited in scope, typically generating simple and short acoustic scenes. It was not capable of turning a single text caption into an extended audio sequence with long-term structures and multiple stems. Google's previous audio generation model, AudioLM, released in 2022, casts audio synthesis as a language modeling task in a discrete representation space. By leveraging a hierarchy of coarse-to-fine audio discrete units (or tokens), AudioLM achieves both high fidelity and long-term coherence over dozens of seconds. AudioLM also generates realistic audio from audio-only corpora, such as speech or piano music.
MusicLM leverages AudioLM's multi-stage autoregressive modeling as the generative component while extending it to incorporate text conditioning. To overcome the scarcity of audio-text data pairs, Google researchers relied on MuLan, a joint music-text model trained to project music and its corresponding text descriptions to representations close to each other in an embedding space. This shared embedding space eliminates the need for captions during training and allows training on massive audio-only corpora. MusicLM utilizes MuLan embeddings computed from the audio as conditioning during training. When trained on a large dataset of unlabeled music, MusicLM can generate long and coherent music at 24 kHz for text descriptions of significant complexity. Examples include “enchanting jazz song with a memorable saxophone solo and a solo singer” or “Berlin 90s techno with a low bass and strong kick.” To help develop future audio generation models, Google publicly released MusicCaps, a music caption dataset with 5.5 thousand examples prepared by musicians.