BMX Racer's Guide to Whisper Transcription
Whisper is the game-changer in transcription. OpenAI released it in 2022 and it’s a beast. It’s local, supports 99 languages, handles noise well, and best of all—free.
If you're still paying per-minute for transcriptions—whether that's AWS Transcribe, Rev, or the OpenAI API—read on.
Model Sizes & Speed
Whisper comes in five sizes. Choose wisely:
| Model | Params | VRAM | Real-Time Speed (RTX 3090) | WER (English) | Best For |
|---|---|---|---|---|---|
| tiny | 39M | <1GB | ~32x real-time | ~15% | Live captions, drafts |
| base | 74M | <1GB | ~16x real-time | ~12% | Fast workflows |
| small | 244M | ~2GB | ~8x real-time | ~8% | Good balance |
| medium | 769M | ~5GB | ~4x real-time | ~5% | Production quality |
| large-v3 | 1.5B | ~10GB | ~2x real-time | ~3% | Maximum accuracy |
WER = Word Error Rate. Lower is better. large-v3 at 3% means roughly 3 mistakes per 100 words on clean audio.
For most creators, the medium model is perfect: fast enough to process an hour of footage in 15 minutes, and accurate enough for publishing.
Installation
# Standard Python package
pip install openai-whisper
# CUDA acceleration (NVIDIA GPU—recommended)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install openai-whisper
# Faster Whisper (C++ backend—4x faster, same accuracy)
pip install faster-whisper
# ffmpeg is required for audio extraction
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
Models download on first use and are cached in ~/.cache/whisper/.
Basic Transcription
import whisper
model = whisper.load_model("medium")
result = model.transcribe("interview.mp4")
print(result["text"])
print(result["language"])
print(result["segments"])
with open("transcript.txt", "w") as f:
f.write(result["text"])
Whisper works directly on video files. It extracts audio using ffmpeg.
Timestamped Output (SRT/VTT)
For captions:
import whisper
from datetime import timedelta
def format_timestamp(seconds: float) -> str:
td = timedelta(seconds=seconds)
return f"{td.hours:02d}:{td.minutes:02d}:{td.seconds:02d},{int(td.microseconds / 1000):03d}"
def transcribe_to_srt(video_path: str, model_size: str = "medium") -> str:
model = whisper.load_model(model_size)
result = model.transcribe(
video_path,
word_timestamps=True
)
srt_lines = []
for i, segment in enumerate(result["segments"], start=1):
start = format_timestamp(segment["start"])
end = format_timestamp(segment["end"])
text = segment["text"].strip()
srt_lines.append(f"{i}")
srt_lines.append(f"{start} --> {end}")
srt_lines.append(text)
srt_lines.append("") # blank line between entries
return "\n".join(srt_lines)
srt_content = transcribe_to_srt("podcast_episode.mp4", "medium")
with open("podcast_episode.srt", "w") as f:
f.write(srt_content)
Faster Whisper (4x Speed Boost)
faster-whisper uses C++ for speed:
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
segments, info = model.transcribe("video.mp4", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.1f}s → {segment.end:.1f}s] {segment.text}")
For GPU batch processing, faster-whisper with float16 is 2-4x faster.
Accuracy Tips
Preprocessing Audio
Bad audio = bad transcript. Fix it:
# Remove background noise
ffmpeg -i input.mp4 -af "highpass=f=200,lowpass=f=3000,anlmdn=s=7" clean.mp4
# Normalize loudness
ffmpeg -i input.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1.5" normalized.mp4
# Extract audio only (faster than transcribing full video)
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 -f wav audio.wav
Transcription Options That Improve Accuracy
result = model.transcribe(
"audio.wav",
language="en",
initial_prompt="This is a podcast about AI automation and software development.",
temperature=0.0,
condition_on_previous_text=True,
word_timestamps=True,
suppress_tokens=[-1]
)
The Initial Prompt Technique
Pre-load context with an initial prompt:
result = model.transcribe(
"episode.mp4",
initial_prompt=(
"Discussion about OpenAI, Anthropic, Claude, GPT-4, NEPA AI, LangChain, Playwright, Ollama. Tech startup founders."
)
)
Batch Processing Pipeline
For large volumes of video:
import os
from pathlib import Path
from faster_whisper import WhisperModel
import json
from datetime import datetime
def batch_transcribe(input_dir: str, output_dir: str, model_size: str = "medium"):
input_path = Path(input_dir)
output_path = Path(output_dir)
output_path.mkdir(exist_ok=True)
extensions = {".mp4", ".mov", ".mp3", ".wav", ".m4a", ".webm"}
files = [f for f in input_path.iterdir() if f.suffix.lower() in extensions]
print(f"Loading Whisper {model_size} model...")
model = WhisperModel(model_size, device="cuda", compute_type="float16")
for i, video_file in enumerate(files):
stem = video_file.stem
if (output_path / f"{stem}.json").exists():
continue
segments, info = model.transcribe(
str(video_file),
language="en",
beam_size=5,
word_timestamps=True
)
full_text = " ".join(s.text for s in segments).strip()
data = {
"file": video_file.name,
"language": info.language,
"duration": info.duration,
"transcribed_at": datetime.now().isoformat(),
"text": full_text,
"segments": [
{"start": s.start, "end": s.end, "text": s.text}
for s in segments
]
}
(output_path / f"{stem}.json").write_text(json.dumps(data, indent=2))
(output_path / f"{stem}.txt").write_text(full_text)
print(f"Done. Transcripts saved to {output_dir}")
batch_transcribe("./footage/", "./transcripts/", model_size="medium")
Memory Management on GPU
For long files:
result = model.transcribe(
"long_video.mp4",
chunk_length=20,
batch_size=8
)
NEPA AI's Video Workspace has Whisper integrated: automatic transcription, SRT generation, keyword extraction, and viral moment detection. Your footage pipeline starts with a transcript.
→ Get the NEPA AI Video Workspace at axon.nepa-ai.com/shop/video-workspace
Every video transcribed, searchable, and ready for editing—automatically.



