Google’s New MusicLM Shakes Up the World of Music Generation with High-Fidelity Audio and Unmatched Accuracy!

Home » Corporate Culture » Google’s New MusicLM Shakes Up the World of Music Generation with High-Fidelity Audio and Unmatched Accuracy!

Estimated reading time: 6 minutes

Imagine you are lying down on your bed, tired and lonely, and you feel like listening to soothing music. But you are lethargic or not in the mood to search or scroll from your playlist. MusicLM can come in handy.

What if you could text prompt Google to play music based on your text commands purely on your wild imagination?

So creepy right? 

Yes! Now, this is possible.


Google’s MusicLM is a new AI system in the music world that generates high-fidelity audio and unmatched accuracy by using a hierarchical sequence-to-sequence approach. It outperforms previous systems in audio quality and output accuracy and can be conditioned by text and melody.

MusicLM uses three models for audio representations and relies on MuLan, a joint music-text model, to overcome limited paired data. The system trains on large audio-only corpora without captions, and the results from experiments are promising. Explore MusicLM on GitHub for more information.

Google's MusicLM
Photo by Dark Rider on Unsplash

Here is a new groundbreaking artificial intelligence (AI), or the ChatGPT in the music world, MusicLM, by Google.

While reading their research papers, it is clear that Google’s MusicLM uses a hierarchical sequence-to-sequence approach for generating music. 

It produces high-quality audio at 24 kHz that stays consistent for several minutes. 

Results from experiments indicate that MusicLM surpasses previous systems in terms of audio quality and the accuracy of its output to the given text description. 

Additionally, MusicLM can be conditioned by text and hymn, transforming even hummed or whistled melodies into the style specified in the text description. 

MusicCaps, a public dataset containing 5.5k pairs of music and text, complete with expert-written rich text descriptions, has been released to advance research.

Here is the GitHub link for you to explore: MusicLM Github examples.

You could generate the main soundtrack of an arcade game, a fusion of reggaeton, a catchy electric guitar riff, or otherworldly sounds.

So, try these prompts as you use them in Dall-E or ChatGPT.

  • Electronic dance music
  • Calming violin melody
  • Upbeat music
  • Drum rolls
  • Accordion death metal demo
  • Melodic Techno
  • Meditation song
  • 90s chorus
Image Source: Google Blog

Sample text prompts to use on MusicLM

Use some of these long tail prompts. Feel free to get creative.

  1. Imagine a peaceful forest on a sunny day
  2. Visualize a calm ocean at sunset
  3. Think of a gently flowing river in a meadow
  4. Picture a soothing waterfall in a mountain range
  5. Envision a tranquil lake surrounded by trees and wildlife
  6. Create an energetic pop track with a strong beat and catchy melody.
  7. Compose a classical piano piece inspired by Beethoven’s Moonlight Sonata.
  8. Produce a smooth jazz track with a saxophone solo.
  9. Write a modern hip-hop beat with a bass-heavy sound.
  10. Craft a nostalgic rock ballad with electric guitar and vocals.
  11. Generate an upbeat electronic dance music track with synthesizer beats.
  12. Create a soulful R&B track with a smooth, silky vocal performance.
  13. Write a hauntingly beautiful instrumental piece for strings and woodwinds.
  14. Make an alternative rock track with an edgy, angsty vibe.
  15. Compose a folk song with acoustic guitar, harmonica, and folksy vocals.
  16. Generate an ambient electronic soundscape with ethereal synth pads.
  17. Write a classical symphony with lush orchestrations and grand crescendos.
  18. Create a Latin-infused salsa track with horns, percussion, and lively rhythm.
  19. Produce a hip-hop instrumental beat with heavy bass and rhythmic drums.
  20. Compose a country music track with twangy steel guitar and storytelling lyrics.

The idea of MusicLM

MusicLM incorporates the multi-stage autoregressive modeling from AudioLM. (It’s generative component text conditioning similar to GPT). 

Moreover, to overcome the issue of limited paired data, Google researchers say, MusicLM relies on the MuLan model, a joint music-text, that maps music and its corresponding text description to closely related representations in an embedding space. 

The shared embedding space enables training on large audio-only corpora without needing captions during training. During training, MusicLM uses MuLan embeddings computed from audio as conditioning, while during inference, it uses MuLan embeddings from the text input.

Models used for audio representations

MusicLM uses THREE models for audio representations.

Image Source: MusicLM Research Paper


The SoundStream model produces 24 kHz monophonic audio, with a stride of 480, generating 50 Hz embeddings. During training, the quantization of embeddings trains via an RVQ with 12 quantizers.


Like AudioLM, an intermediate layer of the masked-language-modeling (MLM) module uses a w2v-BERT model with 600 million parameters. So, after pretraining and freezing the model, embeddings are extracted from the 7th layer and quantized using the centroids from a learned k-means clustering of the embeddings.


MusicLM trains by taking the representation of an audio sequence from MuLan’s audio-embedding network. It is a continuous signal that can use as a control signal for Transformer-based models. However, the team thinks modifying the MuLan embeddings into a more uniform representation using discrete tokens to control and remodel in the future.

The risks with MusicLM

Getting your prompts into story mode and a long generation of text becomes rich captions for the AI model in the process of conditional music generation.

The 30-second snippets generated song from the text-to-music method gets you into the uncharted territory of non-distorted quality music. Later you could add input audio of your voice to better the vocal quality to match.

Of course, there are risks of these music generators of fair use and ethical challenges that the model has to face. But the training data and future research in this area must help human experts use this to generate hours of music.

Check out the research paper here to understand this model in detail.


While music generation falls under the umbrella of artificial intelligence (AI), several different music-generation methods exist.

In this blog, we saw two more popular music-generation models – AudioLM and Google’s MusicLM.

Both models use an RVQ model to generate audio snippets that can be used for training other AI models or as fair-use sound files.

However, there are risks associated with using these tools – such as generating distorted quality audio output or using unlicensed data in training your model.

Hoomale is home to articles that talk about many things, how companies work, how young people think and act, and what work and technology will be like in the future. Each part has engaging articles that will make you think and learn more about these subjects.

If you wish to receive an email when we post the next, consider using the below form.

Disclaimer: Some of the links in this post may be affiliate links, meaning if you click on the link and make a purchase, we may receive a commission at no additional cost to you. It is important to note that we only recommend products and services that we have personally used and believe to be of high quality. So, I thank you for your support.

Click to rate this post!
[Total: 0 Average: 0]


%d bloggers like this: