Setting Up MusicGen
MusicGen is Meta's open-source tool for text-to-music generation. No licensing or royalties needed — just use it locally on your GPU and Python scripts to make custom audio tracks.
How It Works
MusicGen uses a T5 encoder to turn text into tokens, which are then fed through an autoregressive transformer that generates audio tokens based on the input text. These tokens are decoded by EnCodec's decoder into a 32kHz mono or stereo .wav file. Unlike diffusion models, MusicGen builds sequentially — each token predicts the next.
Model Sizes
Meta offers four sizes:
| Size | Params | VRAM | Quality | Speed (RTX 3090) |
|---|---|---|---|---|
| small | 300M | 4GB | Good for demos | ~2s/15s clip |
| medium | 1.5B | 8GB | Production quality | ~8s/15s clip |
| large | 3.3B | 16GB | Best quality | ~18s/15s clip |
| melody | 1.5B | 8GB | Melody-conditioned | ~8s/15s clip |
The melody variant takes both text and a reference audio clip to generate music in the style of the reference.
Installation
python -m venv musicgen-env
source musicgen-env/bin/activate # or .\musicgen-env\Scripts\activate on Windows
pip install audiocraft
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
Models download automatically.
Basic Generation
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('facebook/musicgen-medium')
model.set_generation_params(
duration=30,
temperature=1.0,
top_k=250,
top_p=0.0,
cfg_coef=3.0
)
descriptions = [
"lo-fi hip hop beat, 85 bpm, vinyl crackle, mellow, background music",
"cinematic orchestral swell, strings and brass, dramatic, no vocals",
"upbeat electronic dance music, 128 bpm, energetic, synthesizers"
]
wav = model.generate(descriptions)
for idx, one_wav in enumerate(wav):
audio_write(f'output_{idx}', one_wav.cpu(), model.sample_rate)
Melody-Conditioned Generation
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=20)
melody_waveform, sr = torchaudio.load("reference_track.mp3")
melody_waveform = melody_waveform.unsqueeze(0)
descriptions = ["happy indie folk, acoustic guitar, male vocals"]
wav = model.generate_with_chroma(
descriptions=descriptions,
melody_wavs=melody_waveform,
melody_sample_rate=sr
)
audio_write('output_melody_conditioned', wav[0].cpu(), model.sample_rate)
Batch Production
import json
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
def batch_generate(prompts_file: str, output_dir: str, model_size: str = "medium"):
with open(prompts_file) as f:
prompts = json.load(f)
model = MusicGen.get_pretrained(f'facebook/musicgen-{model_size}')
model.set_generation_params(duration=30, temperature=1.0)
for item in prompts:
description = item['prompt']
wav = model.generate([description])
out_file = f"{output_dir}/{item['id']}.wav"
audio_write(out_file, wav[0].cpu(), model.sample_rate)
print(f"✓ Saved: {out_file}")
batch_generate("prompts.json", "./music_output/", model_size="medium")
Prompt Engineering
Specific, technical descriptions work best:
What works: BPM, genre, era, instruments, mood, use case, no vocals.
What doesn't work: long prompts, contradictory descriptors.
Good prompt template:
{genre}, {bpm} bpm, {instruments}, {mood}, {use_case}, 30 seconds
Video Pipeline Integration
import subprocess
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
def generate_and_mix(video_path: str, music_prompt: str, output_path: str):
result = subprocess.run(["ffprobe", "-v", "quiet", "-print_format", "json",
"-show_format", video_path], capture_output=True)
duration = float(json.loads(result.stdout)['format']['duration'])
model = MusicGen.get_pretrained('facebook/musicgen-medium')
model.set_generation_params(duration=min(duration, 30))
wav = model.generate([music_prompt])
audio_write("/tmp/bg_music", wav[0].cpu(), model.sample_rate)
subprocess.run([
"ffmpeg", "-i", video_path, "-i", "/tmp/bg_music.wav",
"-filter_complex", "[1:a]volume=0.15[bg];[0:a][bg]amix=inputs=2:duration=first",
"-c:v", "copy", output_path
])
print(f"Mixed video saved to {output_path}")
Memory Optimization
For VRAM limitations, use half precision or move generation to CPU.
The NEPA AI Music Workspace makes all this seamless. Describe what you need in plain English and let the agent handle everything else.
→ Get the AI Music Workspace at axon.nepa-ai.com/shop/music-workspace
Stop paying per-track music licensing fees — generate exactly what you need, owned by you.



