How to Build a Music Generator with MusicGen: Text-to-Music in Python
MusicGen from Meta generates original music from text descriptions — 'lo-fi hip hop 120bpm with rain sounds' becomes an actual audio file. Full setup guide, model comparison, and Python integration code.
MusicGen is Meta's open-source text-to-music model. You describe what you want — "upbeat electronic background music, 90bpm, no vocals, 30 seconds" — and it generates original audio. No licensing issues. No royalties. No subscription.
It runs locally on consumer GPUs, integrates into Python pipelines, and produces surprisingly good output for background tracks, ambient soundscapes, and short music beds. Here's how to set it up and actually use it.
How MusicGen Works
MusicGen is a language model for audio. It uses the same attention mechanism as text transformers, but the "tokens" represent compressed audio chunks from a neural audio codec called EnCodec.
The generation process:
- Text prompt → encoded into conditioning tokens via a T5 text encoder
- Autoregressive transformer generates a sequence of audio tokens conditioned on the text
- Audio tokens → decoded back to waveform by EnCodec's decoder
- Output: 32kHz mono or stereo
.wavfile
This is different from diffusion-based audio models (like AudioLDM). MusicGen generates sequentially — each audio token predicts the next, conditioned on the text embedding and all previous tokens.
Model Sizes
Meta released four model sizes:
| Model | Parameters | VRAM | Quality | Speed (RTX 3090) |
|---|---|---|---|---|
| small | 300M | 4GB | Good for demos | ~2s for 15s clip |
| medium | 1.5B | 8GB | Production quality | ~8s for 15s clip |
| large | 3.3B | 16GB | Best quality | ~18s for 15s clip |
| melody | 1.5B | 8GB | Melody-conditioned | ~8s for 15s clip |
The melody variant accepts both a text prompt and a reference audio clip — it generates music in the style/tempo/key of the reference while following the text description. Extremely useful for matching existing content.
Installation
# Create a clean environment
python -m venv musicgen-env
source musicgen-env/bin/activate # or .\musicgen-env\Scripts\activate on Windows
# Install audiocraft (Meta's library containing MusicGen)
pip install audiocraft
# For CUDA acceleration (RTX users)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Verify GPU detection
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
Models download automatically on first use (~600MB for small, ~3GB for large).
Basic Generation
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
# Load model (downloads on first run)
model = MusicGen.get_pretrained('facebook/musicgen-medium')
# Set generation parameters
model.set_generation_params(
duration=30, # seconds
temperature=1.0, # higher = more creative/chaotic
top_k=250, # sampling diversity
top_p=0.0, # nucleus sampling (0 = disabled)
cfg_coef=3.0, # classifier-free guidance strength
)
# Generate from text
descriptions = [
"lo-fi hip hop beat, 85 bpm, vinyl crackle, mellow, background music",
"cinematic orchestral swell, strings and brass, dramatic, no vocals",
"upbeat electronic dance music, 128 bpm, energetic, synthesizers"
]
wav = model.generate(descriptions) # returns tensor [batch, channels, samples]
# Save each output
for idx, one_wav in enumerate(wav):
# Normalizes + saves as .wav
audio_write(
f'output_{idx}',
one_wav.cpu(),
model.sample_rate,
strategy="loudness", # auto normalize loudness
loudness_compressor=True
)
print("Generated", len(wav), "tracks")
Melody-Conditioned Generation
The melody model lets you use a reference audio clip as a stylistic anchor:
from audiocraft.models import MusicGen
import torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=20)
# Load reference audio
melody_waveform, sr = torchaudio.load("reference_track.mp3")
melody_waveform = melody_waveform.unsqueeze(0) # add batch dimension
descriptions = ["happy indie folk, acoustic guitar, male vocals"]
wav = model.generate_with_chroma(
descriptions=descriptions,
melody_wavs=melody_waveform,
melody_sample_rate=sr,
progress=True
)
audio_write('output_melody_conditioned', wav[0].cpu(), model.sample_rate)
The model extracts the harmonic "chroma" features from your reference (tempo, key, chord progressions) and uses them to condition generation. The output won't sound like your reference — but it will fit it.
Batch Production Script
For content pipelines that need multiple tracks:
import json
from pathlib import Path
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
def batch_generate(prompts_file: str, output_dir: str, model_size: str = "medium"):
"""
Generate music for a list of prompts from a JSON file.
prompts.json format:
[
{"id": "intro-music", "prompt": "upbeat lo-fi, 90bpm", "duration": 15},
{"id": "background-loop", "prompt": "ambient nature sounds, calm", "duration": 60}
]
"""
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
with open(prompts_file) as f:
prompts = json.load(f)
model = MusicGen.get_pretrained(f'facebook/musicgen-{model_size}')
# Group by duration for efficiency (model regenerates for each duration change)
from itertools import groupby
for duration, group in groupby(sorted(prompts, key=lambda x: x['duration']),
key=lambda x: x['duration']):
batch = list(group)
model.set_generation_params(duration=duration, temperature=1.0)
descriptions = [item['prompt'] for item in batch]
wavs = model.generate(descriptions, progress=True)
for item, wav in zip(batch, wavs):
out_file = output_path / item['id']
audio_write(str(out_file), wav.cpu(), model.sample_rate,
strategy="loudness", loudness_compressor=True)
print(f"✓ Saved: {item['id']}.wav")
# Run it
batch_generate("prompts.json", "./music_output/", model_size="medium")
Prompt Engineering for Better Results
MusicGen responds well to specific, technical descriptions:
What works:
- BPM values:
"120 bpm","laid back 75 bpm" - Genre + era:
"90s boom bap hip hop","2000s chillout lounge" - Instrumentation:
"acoustic guitar, bass, light drums, no synths" - Mood + use case:
"background music for coding, non-distracting, lo-fi" - Negative descriptors:
"no vocals","no bass drop","subtle, not intrusive"
What doesn't work:
- Artist names (copyright/training limitations)
- Very long prompts (>50 words often degrades coherence)
- Contradictory descriptors:
"heavy metal, peaceful, ambient"— pick one vibe
Prompt template that consistently produces usable tracks:
{genre}, {bpm} bpm, {instruments}, {mood}, {use_case}, no vocals, {duration} seconds
Example: "lo-fi jazz, 85 bpm, piano and brushed drums, mellow, background study music, no vocals, 30 seconds"
Integration with Video Pipelines
Combining MusicGen with video editing:
import subprocess
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
def generate_and_mix(video_path: str, music_prompt: str, output_path: str):
"""Generate background music and mix into video."""
# Get video duration
result = subprocess.run([
"ffprobe", "-v", "quiet", "-print_format", "json",
"-show_format", video_path
], capture_output=True, text=True)
import json
duration = float(json.loads(result.stdout)['format']['duration'])
# Generate music to match video length
model = MusicGen.get_pretrained('facebook/musicgen-medium')
model.set_generation_params(duration=min(duration, 30)) # MusicGen max = 30s
wav = model.generate([music_prompt])
audio_write("/tmp/bg_music", wav[0].cpu(), model.sample_rate,
strategy="loudness")
# Mix with ffmpeg: original audio at 100%, music at 15%
subprocess.run([
"ffmpeg", "-i", video_path, "-i", "/tmp/bg_music.wav",
"-filter_complex", "[1:a]volume=0.15[bg];[0:a][bg]amix=inputs=2:duration=first",
"-c:v", "copy", output_path
])
print(f"Mixed video saved to {output_path}")
Memory Optimization
For systems with limited VRAM:
import torch
from audiocraft.models import MusicGen
model = MusicGen.get_pretrained('facebook/musicgen-small')
# Use half precision to cut VRAM in half
model = model.half()
# Move to CPU for generation if GPU OOMs
# model = model.to('cpu') # slow but works
# Clear cache between generations
torch.cuda.empty_cache()
The small model runs on 4GB VRAM — accessible on most modern GPUs including laptop cards.
The NEPA AI Music Workspace packages all of this into an agent-accessible API. Describe the music you need in plain English. Your agent handles model selection, generation, format conversion, and delivery to your content pipeline.
→ Get the AI Music Workspace at /shop/music-workspace
Stop paying per-track music licensing fees. Generate exactly what you need, owned by you.