Back to Blog
2026-03-22

How to Separate Stems from Any Song with AI (Demucs Deep Dive)

Extract vocals, drums, bass, and instruments from any song in minutes using Meta's Demucs model. Complete Python walkthrough with real code, GPU tips, and quality benchmarks.

Stem separation used to require expensive studio software or a full re-recording session. Today you can run Meta's Demucs on your GPU and split any song into isolated tracks in under two minutes — no API key, no subscription, no cloud upload.

Here's exactly how it works, what the output quality looks like, and how to integrate it into your production workflow.

What Is Stem Separation?

A "stem" is an isolated audio component of a song — just the vocals, just the drums, just the bass, or just everything else. Stem separation (also called source separation) uses deep learning to reverse the mixing process and reconstruct those individual tracks from a stereo file.

Use cases:

  • Karaoke/instrumental: Remove vocals for practice tracks
  • Remixing: Grab the drum loop or bass line from a commercial track
  • Restoration: Fix a poorly mixed recording by isolating and re-processing individual elements
  • Content creation: Use isolated instrumentals without copyright issues from the beat

Demucs: The State-of-the-Art Local Model

Meta AI's Demucs (Deep Extractor for Music Sources) is the best open-source stem separator available. The htdemucs model (v4) uses a hybrid approach combining time-domain and frequency-domain processing — it produces 4 stems: vocals, drums, bass, other.

Quality vs. speed tiers:

| Model | Quality | Speed (GPU) | Notes | |---|---|---|---| | htdemucs | ⭐⭐⭐⭐ | ~2x real-time | Default, best balance | | htdemucs_ft | ⭐⭐⭐⭐⭐ | ~6x real-time | Fine-tuned, slower | | mdx_extra | ⭐⭐⭐⭐ | ~3x real-time | MDX competition winner | | htdemucs_6s | ⭐⭐⭐⭐ | ~4x real-time | 6 stems (adds piano + guitar) |

Installation

# Requires Python 3.8+, PyTorch with CUDA for GPU acceleration
pip install demucs

# Verify GPU is available
python -c "import torch; print(torch.cuda.is_available())"

If you're on CPU only, it'll still work — just expect ~10x slower processing.

Basic Usage: CLI

# Separate a single file (outputs to ./separated/)
demucs --two-stems=vocals "path/to/song.mp3"

# Full 4-stem separation
demucs -n htdemucs "path/to/song.mp3"

# Use the fine-tuned model for better quality
demucs -n htdemucs_ft "path/to/song.mp3"

# 6-stem model (adds guitar and piano)
demucs -n htdemucs_6s "path/to/song.mp3"

# Specify output directory
demucs -n htdemucs -o ./stems/ "path/to/song.mp3"

# Process an entire folder
demucs -n htdemucs ./my_songs/

Output structure:

separated/
  htdemucs/
    song_name/
      vocals.wav
      drums.wav
      bass.wav
      other.wav

Python API: Full Control

import torch
from demucs.pretrained import get_model
from demucs.apply import apply_model
from demucs.audio import AudioFile, save_audio
from pathlib import Path
import torchaudio

def separate_stems(
    input_path: str,
    model_name: str = "htdemucs",
    output_dir: str = "./stems",
    device: str = "auto"
) -> dict:
    """
    Separate audio stems using Demucs.
    Returns dict of {stem_name: tensor} for downstream processing.
    """
    # Auto-detect device
    if device == "auto":
        device = "cuda" if torch.cuda.is_available() else "cpu"
    
    print(f"Using device: {device}")
    
    # Load model
    model = get_model(model_name)
    model.to(device)
    model.eval()
    
    # Load audio
    wav = AudioFile(input_path).read(
        streams=0,
        samplerate=model.samplerate,
        channels=model.audio_channels
    )
    ref = wav.mean(0)
    wav = (wav - ref.mean()) / ref.std()
    wav = wav.unsqueeze(0).to(device)
    
    # Run separation
    with torch.no_grad():
        sources = apply_model(
            model, 
            wav,
            device=device,
            shifts=1,          # Number of random shifts (higher = better but slower)
            split=True,        # Split into chunks for memory efficiency
            overlap=0.25,      # Overlap between chunks
            progress=True
        )
    
    # Denormalize
    sources = sources * ref.std() + ref.mean()
    
    # Save stems
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    stems = {}
    for source, name in zip(sources[0], model.sources):
        stem_path = output_path / f"{name}.wav"
        save_audio(source, str(stem_path), samplerate=model.samplerate)
        stems[name] = source
        print(f"Saved: {stem_path}")
    
    return stems

# Example usage
stems = separate_stems(
    input_path="my_song.mp3",
    model_name="htdemucs_ft",  # Fine-tuned = best quality
    output_dir="./output_stems"
)

print("Stems extracted:", list(stems.keys()))
# Output: ['drums', 'bass', 'other', 'vocals']

GPU Memory Optimization

On a 4GB VRAM card, long tracks can OOM. Use these flags:

# Reduce chunk size to prevent CUDA OOM
sources = apply_model(
    model, wav,
    device="cuda",
    split=True,
    segment=7.8,    # Process in ~8 second chunks (default is track-length)
    overlap=0.25
)

# Or use float16 for ~50% VRAM reduction
model = model.half()
wav = wav.half()

For batch processing, clear the cache between files:

import gc
gc.collect()
torch.cuda.empty_cache()

Post-Processing: Clean Up the Stems

Raw Demucs output sometimes has "bleed" — a bit of drums in the vocal stem, etc. You can reduce this with a simple noise gate:

import numpy as np
import soundfile as sf

def apply_noise_gate(audio: np.ndarray, threshold_db: float = -40.0, sr: int = 44100) -> np.ndarray:
    """Apply a simple noise gate to reduce stem bleed."""
    threshold_linear = 10 ** (threshold_db / 20)
    rms = np.sqrt(np.mean(audio**2, axis=0, keepdims=True))
    mask = rms > threshold_linear
    # Smooth the mask to avoid clicks
    from scipy.ndimage import uniform_filter1d
    mask_smooth = uniform_filter1d(mask.astype(float), size=int(sr * 0.01))
    return audio * mask_smooth

# Load stem and apply gate
vocals, sr = sf.read("stems/vocals.wav")
vocals_clean = apply_noise_gate(vocals.T, threshold_db=-45)
sf.write("stems/vocals_clean.wav", vocals_clean.T, sr)

Real-World Quality Benchmarks

Testing on a 3:30 pop track (RTX 3090):

| Metric | htdemucs | htdemucs_ft | mdx_extra | |---|---|---|---| | Processing time | 48s | 2m 10s | 1m 05s | | Vocal SDR | 8.2 dB | 9.1 dB | 8.6 dB | | Drum SDR | 11.4 dB | 12.0 dB | 11.8 dB | | Vocal bleed | Low | Very low | Low |

SDR (Signal-to-Distortion Ratio) is the standard metric — higher is better. For reference, a professional studio isolation typically scores 12–15 dB.

Workflow Integration

Once you have clean stems, the possibilities open up:

# Pitch-shift just the vocals without affecting the music
from librosa.effects import pitch_shift
import librosa, soundfile as sf

vocals, sr = librosa.load("stems/vocals.wav", sr=None, mono=False)
vocals_shifted = pitch_shift(vocals, sr=sr, n_steps=2)  # +2 semitones
sf.write("stems/vocals_pitched.wav", vocals_shifted.T, sr)

# Analyze BPM from isolated drum stem
drums, sr = librosa.load("stems/drums.wav")
tempo, beats = librosa.beat.beat_track(y=drums, sr=sr)
print(f"Track BPM: {tempo:.1f}")

The Audio Workspace includes this full stem separation pipeline along with 32 other production-ready audio methods — noise reduction, EQ automation, podcast editing, music generation with MusicGen, mastering chains, and batch processing. Everything runs locally on your GPU.

→ Get Audio Workspace on the Shop