How to Build a Music Generator with MusicGen: Text-to-Music in Python
Back to Blog
Audio AI· 9 min min read

How to Build a Music Generator with MusicGen: Text-to-Music in Python

MusicGen from Meta generates original music from text descriptions — 'lo-fi hip hop 120bpm with rain sounds' becomes an actual audio file. Full setup guide, model comparison, and Python integration code.

NA
By NEPA AI
NEPA AI · Building autonomous systems for creators and businesses
#MusicGen#text to music#Meta AI#audiocraft#music generation#python#audio AI

Setting Up MusicGen

MusicGen is Meta's open-source tool for text-to-music generation. No licensing or royalties needed — just use it locally on your GPU and Python scripts to make custom audio tracks.

How It Works

MusicGen uses a T5 encoder to turn text into tokens, which are then fed through an autoregressive transformer that generates audio tokens based on the input text. These tokens are decoded by EnCodec's decoder into a 32kHz mono or stereo .wav file. Unlike diffusion models, MusicGen builds sequentially — each token predicts the next.

Model Sizes

Meta offers four sizes:

| Size | Params | VRAM | Quality | Speed (RTX 3090) | |---|---|---|---|---| | small | 300M | 4GB | Good for demos | ~2s/15s clip | | medium | 1.5B | 8GB | Production quality | ~8s/15s clip | | large | 3.3B | 16GB | Best quality | ~18s/15s clip | | melody | 1.5B | 8GB | Melody-conditioned | ~8s/15s clip |

The melody variant takes both text and a reference audio clip to generate music in the style of the reference.

Installation

python -m venv musicgen-env
source musicgen-env/bin/activate  # or .\musicgen-env\Scripts\activate on Windows
pip install audiocraft
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Models download automatically.

Basic Generation

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('facebook/musicgen-medium')

model.set_generation_params(
    duration=30,
    temperature=1.0,
    top_k=250,
    top_p=0.0,
    cfg_coef=3.0
)

descriptions = [
    "lo-fi hip hop beat, 85 bpm, vinyl crackle, mellow, background music",
    "cinematic orchestral swell, strings and brass, dramatic, no vocals",
    "upbeat electronic dance music, 128 bpm, energetic, synthesizers"
]

wav = model.generate(descriptions)

for idx, one_wav in enumerate(wav):
    audio_write(f'output_{idx}', one_wav.cpu(), model.sample_rate)

Melody-Conditioned Generation

model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=20)

melody_waveform, sr = torchaudio.load("reference_track.mp3")
melody_waveform = melody_waveform.unsqueeze(0)

descriptions = ["happy indie folk, acoustic guitar, male vocals"]

wav = model.generate_with_chroma(
    descriptions=descriptions,
    melody_wavs=melody_waveform,
    melody_sample_rate=sr
)

audio_write('output_melody_conditioned', wav[0].cpu(), model.sample_rate)

Batch Production

import json
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

def batch_generate(prompts_file: str, output_dir: str, model_size: str = "medium"):
    with open(prompts_file) as f:
        prompts = json.load(f)
    
    model = MusicGen.get_pretrained(f'facebook/musicgen-{model_size}')
    model.set_generation_params(duration=30, temperature=1.0)

    for item in prompts:
        description = item['prompt']
        wav = model.generate([description])
        
        out_file = f"{output_dir}/{item['id']}.wav"
        audio_write(out_file, wav[0].cpu(), model.sample_rate)
        print(f"✓ Saved: {out_file}")

batch_generate("prompts.json", "./music_output/", model_size="medium")

Prompt Engineering

Specific, technical descriptions work best:

What works: BPM, genre, era, instruments, mood, use case, no vocals.

What doesn't work: long prompts, contradictory descriptors.

Good prompt template: {genre}, {bpm} bpm, {instruments}, {mood}, {use_case}, 30 seconds

Video Pipeline Integration

import subprocess
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

def generate_and_mix(video_path: str, music_prompt: str, output_path: str):
    result = subprocess.run(["ffprobe", "-v", "quiet", "-print_format", "json",
                             "-show_format", video_path], capture_output=True)
    
    duration = float(json.loads(result.stdout)['format']['duration'])
    model = MusicGen.get_pretrained('facebook/musicgen-medium')
    model.set_generation_params(duration=min(duration, 30))

    wav = model.generate([music_prompt])
    audio_write("/tmp/bg_music", wav[0].cpu(), model.sample_rate)

    subprocess.run([
        "ffmpeg", "-i", video_path, "-i", "/tmp/bg_music.wav",
        "-filter_complex", "[1:a]volume=0.15[bg];[0:a][bg]amix=inputs=2:duration=first",
        "-c:v", "copy", output_path
    ])
    
    print(f"Mixed video saved to {output_path}")

Memory Optimization

For VRAM limitations, use half precision or move generation to CPU.


The NEPA AI Music Workspace makes all this seamless. Describe what you need in plain English and let the agent handle everything else.

→ Get the AI Music Workspace at axon.nepa-ai.com/shop/music-workspace

Stop paying per-track music licensing fees — generate exactly what you need, owned by you.