Back to Blog
2026-03-22

How to Build a Music Generator with MusicGen: Text-to-Music in Python

MusicGen from Meta generates original music from text descriptions — 'lo-fi hip hop 120bpm with rain sounds' becomes an actual audio file. Full setup guide, model comparison, and Python integration code.

MusicGen is Meta's open-source text-to-music model. You describe what you want — "upbeat electronic background music, 90bpm, no vocals, 30 seconds" — and it generates original audio. No licensing issues. No royalties. No subscription.

It runs locally on consumer GPUs, integrates into Python pipelines, and produces surprisingly good output for background tracks, ambient soundscapes, and short music beds. Here's how to set it up and actually use it.

How MusicGen Works

MusicGen is a language model for audio. It uses the same attention mechanism as text transformers, but the "tokens" represent compressed audio chunks from a neural audio codec called EnCodec.

The generation process:

  1. Text prompt → encoded into conditioning tokens via a T5 text encoder
  2. Autoregressive transformer generates a sequence of audio tokens conditioned on the text
  3. Audio tokens → decoded back to waveform by EnCodec's decoder
  4. Output: 32kHz mono or stereo .wav file

This is different from diffusion-based audio models (like AudioLDM). MusicGen generates sequentially — each audio token predicts the next, conditioned on the text embedding and all previous tokens.

Model Sizes

Meta released four model sizes:

| Model | Parameters | VRAM | Quality | Speed (RTX 3090) | |---|---|---|---|---| | small | 300M | 4GB | Good for demos | ~2s for 15s clip | | medium | 1.5B | 8GB | Production quality | ~8s for 15s clip | | large | 3.3B | 16GB | Best quality | ~18s for 15s clip | | melody | 1.5B | 8GB | Melody-conditioned | ~8s for 15s clip |

The melody variant accepts both a text prompt and a reference audio clip — it generates music in the style/tempo/key of the reference while following the text description. Extremely useful for matching existing content.

Installation

# Create a clean environment
python -m venv musicgen-env
source musicgen-env/bin/activate  # or .\musicgen-env\Scripts\activate on Windows

# Install audiocraft (Meta's library containing MusicGen)
pip install audiocraft

# For CUDA acceleration (RTX users)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify GPU detection
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

Models download automatically on first use (~600MB for small, ~3GB for large).

Basic Generation

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# Load model (downloads on first run)
model = MusicGen.get_pretrained('facebook/musicgen-medium')

# Set generation parameters
model.set_generation_params(
    duration=30,           # seconds
    temperature=1.0,       # higher = more creative/chaotic
    top_k=250,             # sampling diversity
    top_p=0.0,             # nucleus sampling (0 = disabled)
    cfg_coef=3.0,          # classifier-free guidance strength
)

# Generate from text
descriptions = [
    "lo-fi hip hop beat, 85 bpm, vinyl crackle, mellow, background music",
    "cinematic orchestral swell, strings and brass, dramatic, no vocals",
    "upbeat electronic dance music, 128 bpm, energetic, synthesizers"
]

wav = model.generate(descriptions)  # returns tensor [batch, channels, samples]

# Save each output
for idx, one_wav in enumerate(wav):
    # Normalizes + saves as .wav
    audio_write(
        f'output_{idx}',
        one_wav.cpu(),
        model.sample_rate,
        strategy="loudness",    # auto normalize loudness
        loudness_compressor=True
    )

print("Generated", len(wav), "tracks")

Melody-Conditioned Generation

The melody model lets you use a reference audio clip as a stylistic anchor:

from audiocraft.models import MusicGen
import torchaudio

model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=20)

# Load reference audio
melody_waveform, sr = torchaudio.load("reference_track.mp3")
melody_waveform = melody_waveform.unsqueeze(0)  # add batch dimension

descriptions = ["happy indie folk, acoustic guitar, male vocals"]

wav = model.generate_with_chroma(
    descriptions=descriptions,
    melody_wavs=melody_waveform,
    melody_sample_rate=sr,
    progress=True
)

audio_write('output_melody_conditioned', wav[0].cpu(), model.sample_rate)

The model extracts the harmonic "chroma" features from your reference (tempo, key, chord progressions) and uses them to condition generation. The output won't sound like your reference — but it will fit it.

Batch Production Script

For content pipelines that need multiple tracks:

import json
from pathlib import Path
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

def batch_generate(prompts_file: str, output_dir: str, model_size: str = "medium"):
    """
    Generate music for a list of prompts from a JSON file.
    
    prompts.json format:
    [
        {"id": "intro-music", "prompt": "upbeat lo-fi, 90bpm", "duration": 15},
        {"id": "background-loop", "prompt": "ambient nature sounds, calm", "duration": 60}
    ]
    """
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    with open(prompts_file) as f:
        prompts = json.load(f)
    
    model = MusicGen.get_pretrained(f'facebook/musicgen-{model_size}')
    
    # Group by duration for efficiency (model regenerates for each duration change)
    from itertools import groupby
    for duration, group in groupby(sorted(prompts, key=lambda x: x['duration']), 
                                    key=lambda x: x['duration']):
        batch = list(group)
        model.set_generation_params(duration=duration, temperature=1.0)
        
        descriptions = [item['prompt'] for item in batch]
        wavs = model.generate(descriptions, progress=True)
        
        for item, wav in zip(batch, wavs):
            out_file = output_path / item['id']
            audio_write(str(out_file), wav.cpu(), model.sample_rate,
                       strategy="loudness", loudness_compressor=True)
            print(f"✓ Saved: {item['id']}.wav")

# Run it
batch_generate("prompts.json", "./music_output/", model_size="medium")

Prompt Engineering for Better Results

MusicGen responds well to specific, technical descriptions:

What works:

  • BPM values: "120 bpm", "laid back 75 bpm"
  • Genre + era: "90s boom bap hip hop", "2000s chillout lounge"
  • Instrumentation: "acoustic guitar, bass, light drums, no synths"
  • Mood + use case: "background music for coding, non-distracting, lo-fi"
  • Negative descriptors: "no vocals", "no bass drop", "subtle, not intrusive"

What doesn't work:

  • Artist names (copyright/training limitations)
  • Very long prompts (>50 words often degrades coherence)
  • Contradictory descriptors: "heavy metal, peaceful, ambient" — pick one vibe

Prompt template that consistently produces usable tracks:

{genre}, {bpm} bpm, {instruments}, {mood}, {use_case}, no vocals, {duration} seconds

Example: "lo-fi jazz, 85 bpm, piano and brushed drums, mellow, background study music, no vocals, 30 seconds"

Integration with Video Pipelines

Combining MusicGen with video editing:

import subprocess
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

def generate_and_mix(video_path: str, music_prompt: str, output_path: str):
    """Generate background music and mix into video."""
    # Get video duration
    result = subprocess.run([
        "ffprobe", "-v", "quiet", "-print_format", "json",
        "-show_format", video_path
    ], capture_output=True, text=True)
    
    import json
    duration = float(json.loads(result.stdout)['format']['duration'])
    
    # Generate music to match video length
    model = MusicGen.get_pretrained('facebook/musicgen-medium')
    model.set_generation_params(duration=min(duration, 30))  # MusicGen max = 30s
    
    wav = model.generate([music_prompt])
    audio_write("/tmp/bg_music", wav[0].cpu(), model.sample_rate,
               strategy="loudness")
    
    # Mix with ffmpeg: original audio at 100%, music at 15%
    subprocess.run([
        "ffmpeg", "-i", video_path, "-i", "/tmp/bg_music.wav",
        "-filter_complex", "[1:a]volume=0.15[bg];[0:a][bg]amix=inputs=2:duration=first",
        "-c:v", "copy", output_path
    ])
    
    print(f"Mixed video saved to {output_path}")

Memory Optimization

For systems with limited VRAM:

import torch
from audiocraft.models import MusicGen

model = MusicGen.get_pretrained('facebook/musicgen-small')

# Use half precision to cut VRAM in half
model = model.half()

# Move to CPU for generation if GPU OOMs
# model = model.to('cpu')  # slow but works

# Clear cache between generations
torch.cuda.empty_cache()

The small model runs on 4GB VRAM — accessible on most modern GPUs including laptop cards.


The NEPA AI Music Workspace packages all of this into an agent-accessible API. Describe the music you need in plain English. Your agent handles model selection, generation, format conversion, and delivery to your content pipeline.

→ Get the AI Music Workspace at /shop/music-workspace

Stop paying per-track music licensing fees. Generate exactly what you need, owned by you.