Back to Blog
2026-03-22

How to Run Whisper Locally for Free Transcription (Full Setup Guide)

OpenAI's Whisper model runs entirely locally — no API key, no per-minute fees. This guide covers every model size, GPU acceleration, accuracy tips, and how to integrate it into a video processing pipeline.

OpenAI Whisper is the transcription model that changed the industry when it dropped in 2022. It runs locally on consumer hardware, supports 99 languages, handles background noise surprisingly well, and is completely free. The only cost is compute time.

If you're still paying per-minute for transcription — whether that's AWS Transcribe, Rev, or the OpenAI API — you should read this.

Model Sizes and Tradeoffs

Whisper ships in five sizes. Picking the right one matters:

| Model | Parameters | VRAM | Speed (RTX 3090) | WER (English) | Best For | |---|---|---|---|---|---| | tiny | 39M | under 1GB | ~32x real-time | ~15% | Live captions, drafts | | base | 74M | under 1GB | ~16x real-time | ~12% | Fast workflows | | small | 244M | ~2GB | ~8x real-time | ~8% | Good balance | | medium | 769M | ~5GB | ~4x real-time | ~5% | Production quality | | large-v3 | 1.5B | ~10GB | ~2x real-time | ~3% | Maximum accuracy |

WER = Word Error Rate. Lower is better. large-v3 at 3% means roughly 3 mistakes per 100 words on clean audio.

For most content creators, medium is the sweet spot: fast enough to process an hour of footage in 15 minutes, accurate enough to publish captions without heavy editing.

Installation

# Standard Python package
pip install openai-whisper

# For CUDA acceleration (NVIDIA GPU — recommended)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install openai-whisper

# Faster Whisper (C++ backend — 4x faster, same accuracy)
pip install faster-whisper

# ffmpeg is required for audio extraction
# Ubuntu/Debian:
sudo apt install ffmpeg

# macOS:
brew install ffmpeg

Models are downloaded automatically on first use and cached to ~/.cache/whisper/.

Basic Transcription

import whisper

# Load model (downloads on first run)
model = whisper.load_model("medium")

# Transcribe a file
result = model.transcribe("interview.mp4")

# Access the transcript
print(result["text"])          # full text
print(result["language"])      # detected language
print(result["segments"])      # word-level timestamps

# Save to file
with open("transcript.txt", "w") as f:
    f.write(result["text"])

Whisper handles video files directly — it extracts audio internally via ffmpeg. You don't need to pre-process.

Timestamped Output (SRT/VTT)

For captions:

import whisper
import datetime

def format_timestamp(seconds: float) -> str:
    """Convert seconds to SRT timestamp format HH:MM:SS,mmm"""
    td = datetime.timedelta(seconds=seconds)
    hours, remainder = divmod(int(td.total_seconds()), 3600)
    minutes, secs = divmod(remainder, 60)
    milliseconds = int((td.total_seconds() % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{milliseconds:03d}"


def transcribe_to_srt(video_path: str, model_size: str = "medium") -> str:
    """Transcribe video and return SRT caption content."""
    model = whisper.load_model(model_size)
    
    result = model.transcribe(
        video_path,
        word_timestamps=True,   # enable word-level timing
        verbose=False
    )
    
    srt_lines = []
    for i, segment in enumerate(result["segments"], start=1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        
        srt_lines.append(f"{i}")
        srt_lines.append(f"{start} --> {end}")
        srt_lines.append(text)
        srt_lines.append("")  # blank line between entries
    
    return "\n".join(srt_lines)


# Generate SRT file
srt_content = transcribe_to_srt("podcast_episode.mp4", "medium")
with open("podcast_episode.srt", "w") as f:
    f.write(srt_content)

print("SRT captions saved")

Faster Whisper (4x Speed Boost)

faster-whisper uses CTranslate2 — a C++ inference engine — instead of PyTorch. Same accuracy, significantly faster:

from faster_whisper import WhisperModel

# Load model (float16 for GPU, int8 for CPU)
model = WhisperModel("medium", device="cuda", compute_type="float16")

# Transcribe
segments, info = model.transcribe("video.mp4", beam_size=5)

print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")

for segment in segments:
    print(f"[{segment.start:.1f}s → {segment.end:.1f}s] {segment.text}")

For batch processing on GPU, faster-whisper with float16 is 2-4x faster than standard Whisper with virtually identical accuracy.

Accuracy Tips

Preprocessing Audio

Bad audio = bad transcript. Fix it before transcribing:

# Remove background noise with ffmpeg
ffmpeg -i input.mp4 -af "highpass=f=200,lowpass=f=3000,anlmdn=s=7" clean.mp4

# Normalize loudness (helps Whisper with quiet recordings)
ffmpeg -i input.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1.5" normalized.mp4

# Extract audio only (faster than transcribing full video)
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

Transcription Options That Improve Accuracy

result = model.transcribe(
    "audio.wav",
    
    # Language hint (skip auto-detection if you know the language)
    language="en",
    
    # Initial prompt seeds the context — helps with jargon, names, formatting
    initial_prompt="This is a podcast about AI automation and software development.",
    
    # Temperature: lower = more conservative (fewer hallucinations)
    temperature=0.0,
    
    # Condition on previous text (helps coherence across long files)
    condition_on_previous_text=True,
    
    # Word timestamps for precise caption timing
    word_timestamps=True,
    
    # Suppress tokens (prevent common hallucinations)
    suppress_tokens=[-1],  # -1 = suppress no_speech_token
    
    # Verbose progress logging
    verbose=True
)

The Initial Prompt Technique

Whisper's biggest weakness is proper nouns — names, technical terms, brand names. You can pre-load context with an initial prompt:

# For a tech podcast episode
result = model.transcribe(
    "episode.mp4",
    initial_prompt=(
        "Discussion about OpenAI, Anthropic, Claude, GPT-4, NEPA AI, "
        "LangChain, Playwright, Ollama. Tech startup founders."
    )
)

Whisper uses this as a context window, dramatically improving recognition of terms it would otherwise get wrong.

Batch Processing Pipeline

For processing large volumes of video content:

import os
from pathlib import Path
from faster_whisper import WhisperModel
import json
from datetime import datetime

def batch_transcribe(input_dir: str, output_dir: str, model_size: str = "medium"):
    """
    Transcribe all video/audio files in a directory.
    Saves JSON (with timestamps) + TXT (plain text) + SRT (captions).
    """
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    extensions = {".mp4", ".mov", ".mp3", ".wav", ".m4a", ".webm"}
    files = [f for f in input_path.iterdir() if f.suffix.lower() in extensions]
    
    print(f"Loading Whisper {model_size} model...")
    model = WhisperModel(model_size, device="cuda", compute_type="float16")
    
    for i, video_file in enumerate(files):
        print(f"\n[{i+1}/{len(files)}] Transcribing: {video_file.name}")
        
        stem = video_file.stem
        
        # Check if already processed
        if (output_path / f"{stem}.json").exists():
            print(f"  Skipping (already processed)")
            continue
        
        segments, info = model.transcribe(
            str(video_file),
            language="en",
            beam_size=5,
            word_timestamps=True
        )
        
        all_segments = list(segments)  # consume generator
        full_text = " ".join(s.text for s in all_segments).strip()
        
        # Save JSON with full data
        data = {
            "file": video_file.name,
            "language": info.language,
            "duration": info.duration,
            "transcribed_at": datetime.now().isoformat(),
            "text": full_text,
            "segments": [
                {"start": s.start, "end": s.end, "text": s.text}
                for s in all_segments
            ]
        }
        
        (output_path / f"{stem}.json").write_text(json.dumps(data, indent=2))
        (output_path / f"{stem}.txt").write_text(full_text)
        
        print(f"  ✓ Duration: {info.duration:.0f}s | Language: {info.language}")
    
    print(f"\nDone. Transcripts saved to {output_dir}")

batch_transcribe("./footage/", "./transcripts/", model_size="medium")

Memory Management on GPU

For long files on limited VRAM:

# Process in chunks for very long videos
result = model.transcribe(
    "long_video.mp4",
    
    # Chunk size in seconds (reduces peak VRAM usage)
    # Default is 30s — decrease for less VRAM
    chunk_length=20,
    
    # Batch size (higher = faster but more VRAM)
    batch_size=8,
)

The NEPA AI Video Workspace includes Whisper integration at its core: automatic transcription on import, SRT generation, keyword extraction, and viral moment detection based on transcript analysis. Your footage pipeline starts with a transcript.

→ Get the AI Video Workspace at /shop/video-workspace

Every video transcribed, searchable, and ready for editing — automatically.