How to Run Whisper Locally for Free Transcription (Full Setup Guide)
OpenAI's Whisper model runs entirely locally — no API key, no per-minute fees. This guide covers every model size, GPU acceleration, accuracy tips, and how to integrate it into a video processing pipeline.
OpenAI Whisper is the transcription model that changed the industry when it dropped in 2022. It runs locally on consumer hardware, supports 99 languages, handles background noise surprisingly well, and is completely free. The only cost is compute time.
If you're still paying per-minute for transcription — whether that's AWS Transcribe, Rev, or the OpenAI API — you should read this.
Model Sizes and Tradeoffs
Whisper ships in five sizes. Picking the right one matters:
| Model | Parameters | VRAM | Speed (RTX 3090) | WER (English) | Best For |
|---|---|---|---|---|---|
| tiny | 39M | under 1GB | ~32x real-time | ~15% | Live captions, drafts |
| base | 74M | under 1GB | ~16x real-time | ~12% | Fast workflows |
| small | 244M | ~2GB | ~8x real-time | ~8% | Good balance |
| medium | 769M | ~5GB | ~4x real-time | ~5% | Production quality |
| large-v3 | 1.5B | ~10GB | ~2x real-time | ~3% | Maximum accuracy |
WER = Word Error Rate. Lower is better. large-v3 at 3% means roughly 3 mistakes per 100 words on clean audio.
For most content creators, medium is the sweet spot: fast enough to process an hour of footage in 15 minutes, accurate enough to publish captions without heavy editing.
Installation
# Standard Python package
pip install openai-whisper
# For CUDA acceleration (NVIDIA GPU — recommended)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install openai-whisper
# Faster Whisper (C++ backend — 4x faster, same accuracy)
pip install faster-whisper
# ffmpeg is required for audio extraction
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
Models are downloaded automatically on first use and cached to ~/.cache/whisper/.
Basic Transcription
import whisper
# Load model (downloads on first run)
model = whisper.load_model("medium")
# Transcribe a file
result = model.transcribe("interview.mp4")
# Access the transcript
print(result["text"]) # full text
print(result["language"]) # detected language
print(result["segments"]) # word-level timestamps
# Save to file
with open("transcript.txt", "w") as f:
f.write(result["text"])
Whisper handles video files directly — it extracts audio internally via ffmpeg. You don't need to pre-process.
Timestamped Output (SRT/VTT)
For captions:
import whisper
import datetime
def format_timestamp(seconds: float) -> str:
"""Convert seconds to SRT timestamp format HH:MM:SS,mmm"""
td = datetime.timedelta(seconds=seconds)
hours, remainder = divmod(int(td.total_seconds()), 3600)
minutes, secs = divmod(remainder, 60)
milliseconds = int((td.total_seconds() % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{milliseconds:03d}"
def transcribe_to_srt(video_path: str, model_size: str = "medium") -> str:
"""Transcribe video and return SRT caption content."""
model = whisper.load_model(model_size)
result = model.transcribe(
video_path,
word_timestamps=True, # enable word-level timing
verbose=False
)
srt_lines = []
for i, segment in enumerate(result["segments"], start=1):
start = format_timestamp(segment["start"])
end = format_timestamp(segment["end"])
text = segment["text"].strip()
srt_lines.append(f"{i}")
srt_lines.append(f"{start} --> {end}")
srt_lines.append(text)
srt_lines.append("") # blank line between entries
return "\n".join(srt_lines)
# Generate SRT file
srt_content = transcribe_to_srt("podcast_episode.mp4", "medium")
with open("podcast_episode.srt", "w") as f:
f.write(srt_content)
print("SRT captions saved")
Faster Whisper (4x Speed Boost)
faster-whisper uses CTranslate2 — a C++ inference engine — instead of PyTorch. Same accuracy, significantly faster:
from faster_whisper import WhisperModel
# Load model (float16 for GPU, int8 for CPU)
model = WhisperModel("medium", device="cuda", compute_type="float16")
# Transcribe
segments, info = model.transcribe("video.mp4", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.1f}s → {segment.end:.1f}s] {segment.text}")
For batch processing on GPU, faster-whisper with float16 is 2-4x faster than standard Whisper with virtually identical accuracy.
Accuracy Tips
Preprocessing Audio
Bad audio = bad transcript. Fix it before transcribing:
# Remove background noise with ffmpeg
ffmpeg -i input.mp4 -af "highpass=f=200,lowpass=f=3000,anlmdn=s=7" clean.mp4
# Normalize loudness (helps Whisper with quiet recordings)
ffmpeg -i input.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1.5" normalized.mp4
# Extract audio only (faster than transcribing full video)
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
Transcription Options That Improve Accuracy
result = model.transcribe(
"audio.wav",
# Language hint (skip auto-detection if you know the language)
language="en",
# Initial prompt seeds the context — helps with jargon, names, formatting
initial_prompt="This is a podcast about AI automation and software development.",
# Temperature: lower = more conservative (fewer hallucinations)
temperature=0.0,
# Condition on previous text (helps coherence across long files)
condition_on_previous_text=True,
# Word timestamps for precise caption timing
word_timestamps=True,
# Suppress tokens (prevent common hallucinations)
suppress_tokens=[-1], # -1 = suppress no_speech_token
# Verbose progress logging
verbose=True
)
The Initial Prompt Technique
Whisper's biggest weakness is proper nouns — names, technical terms, brand names. You can pre-load context with an initial prompt:
# For a tech podcast episode
result = model.transcribe(
"episode.mp4",
initial_prompt=(
"Discussion about OpenAI, Anthropic, Claude, GPT-4, NEPA AI, "
"LangChain, Playwright, Ollama. Tech startup founders."
)
)
Whisper uses this as a context window, dramatically improving recognition of terms it would otherwise get wrong.
Batch Processing Pipeline
For processing large volumes of video content:
import os
from pathlib import Path
from faster_whisper import WhisperModel
import json
from datetime import datetime
def batch_transcribe(input_dir: str, output_dir: str, model_size: str = "medium"):
"""
Transcribe all video/audio files in a directory.
Saves JSON (with timestamps) + TXT (plain text) + SRT (captions).
"""
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
extensions = {".mp4", ".mov", ".mp3", ".wav", ".m4a", ".webm"}
files = [f for f in input_path.iterdir() if f.suffix.lower() in extensions]
print(f"Loading Whisper {model_size} model...")
model = WhisperModel(model_size, device="cuda", compute_type="float16")
for i, video_file in enumerate(files):
print(f"\n[{i+1}/{len(files)}] Transcribing: {video_file.name}")
stem = video_file.stem
# Check if already processed
if (output_path / f"{stem}.json").exists():
print(f" Skipping (already processed)")
continue
segments, info = model.transcribe(
str(video_file),
language="en",
beam_size=5,
word_timestamps=True
)
all_segments = list(segments) # consume generator
full_text = " ".join(s.text for s in all_segments).strip()
# Save JSON with full data
data = {
"file": video_file.name,
"language": info.language,
"duration": info.duration,
"transcribed_at": datetime.now().isoformat(),
"text": full_text,
"segments": [
{"start": s.start, "end": s.end, "text": s.text}
for s in all_segments
]
}
(output_path / f"{stem}.json").write_text(json.dumps(data, indent=2))
(output_path / f"{stem}.txt").write_text(full_text)
print(f" ✓ Duration: {info.duration:.0f}s | Language: {info.language}")
print(f"\nDone. Transcripts saved to {output_dir}")
batch_transcribe("./footage/", "./transcripts/", model_size="medium")
Memory Management on GPU
For long files on limited VRAM:
# Process in chunks for very long videos
result = model.transcribe(
"long_video.mp4",
# Chunk size in seconds (reduces peak VRAM usage)
# Default is 30s — decrease for less VRAM
chunk_length=20,
# Batch size (higher = faster but more VRAM)
batch_size=8,
)
The NEPA AI Video Workspace includes Whisper integration at its core: automatic transcription on import, SRT generation, keyword extraction, and viral moment detection based on transcript analysis. Your footage pipeline starts with a transcript.
→ Get the AI Video Workspace at /shop/video-workspace
Every video transcribed, searchable, and ready for editing — automatically.