How to Run Whisper Locally for Free Transcription (Full Setup Guide)
Back to Blog
Audio AI· 8 min min read

How to Run Whisper Locally for Free Transcription (Full Setup Guide)

OpenAI's Whisper model runs entirely locally — no API key, no per-minute fees. This guide covers every model size, GPU acceleration, accuracy tips, and how to integrate it into a video processing pipeline.

NA
By NEPA AI
NEPA AI · Building autonomous systems for creators and businesses
#whisper#transcription#openai whisper#local AI#speech to text#python#ffmpeg#video editing

BMX Racer's Guide to Whisper Transcription

Whisper is the game-changer in transcription. OpenAI released it in 2022 and it’s a beast. It’s local, supports 99 languages, handles noise well, and best of all—free.

If you're still paying per-minute for transcriptions—whether that's AWS Transcribe, Rev, or the OpenAI API—read on.

Model Sizes & Speed

Whisper comes in five sizes. Choose wisely:

| Model | Params | VRAM | Real-Time Speed (RTX 3090) | WER (English) | Best For | |---|---|---|---|---|---| | tiny | 39M | <1GB | ~32x real-time | ~15% | Live captions, drafts | | base | 74M | <1GB | ~16x real-time | ~12% | Fast workflows | | small | 244M | ~2GB | ~8x real-time | ~8% | Good balance | | medium | 769M | ~5GB | ~4x real-time | ~5% | Production quality | | large-v3 | 1.5B | ~10GB | ~2x real-time | ~3% | Maximum accuracy |

WER = Word Error Rate. Lower is better. large-v3 at 3% means roughly 3 mistakes per 100 words on clean audio.

For most creators, the medium model is perfect: fast enough to process an hour of footage in 15 minutes, and accurate enough for publishing.

Installation

# Standard Python package
pip install openai-whisper

# CUDA acceleration (NVIDIA GPU—recommended)
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install openai-whisper

# Faster Whisper (C++ backend—4x faster, same accuracy)
pip install faster-whisper

# ffmpeg is required for audio extraction
# Ubuntu/Debian:
sudo apt install ffmpeg

# macOS:
brew install ffmpeg

Models download on first use and are cached in ~/.cache/whisper/.

Basic Transcription

import whisper

model = whisper.load_model("medium")
result = model.transcribe("interview.mp4")

print(result["text"])
print(result["language"])
print(result["segments"])

with open("transcript.txt", "w") as f:
    f.write(result["text"])

Whisper works directly on video files. It extracts audio using ffmpeg.

Timestamped Output (SRT/VTT)

For captions:

import whisper
from datetime import timedelta

def format_timestamp(seconds: float) -> str:
    td = timedelta(seconds=seconds)
    return f"{td.hours:02d}:{td.minutes:02d}:{td.seconds:02d},{int(td.microseconds / 1000):03d}"

def transcribe_to_srt(video_path: str, model_size: str = "medium") -> str:
    model = whisper.load_model(model_size)
    
    result = model.transcribe(
        video_path,
        word_timestamps=True
    )
    
    srt_lines = []
    for i, segment in enumerate(result["segments"], start=1):
        start = format_timestamp(segment["start"])
        end = format_timestamp(segment["end"])
        text = segment["text"].strip()
        
        srt_lines.append(f"{i}")
        srt_lines.append(f"{start} --> {end}")
        srt_lines.append(text)
        srt_lines.append("")  # blank line between entries
    
    return "\n".join(srt_lines)

srt_content = transcribe_to_srt("podcast_episode.mp4", "medium")
with open("podcast_episode.srt", "w") as f:
    f.write(srt_content)

Faster Whisper (4x Speed Boost)

faster-whisper uses C++ for speed:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")

segments, info = model.transcribe("video.mp4", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
    print(f"[{segment.start:.1f}s → {segment.end:.1f}s] {segment.text}")

For GPU batch processing, faster-whisper with float16 is 2-4x faster.

Accuracy Tips

Preprocessing Audio

Bad audio = bad transcript. Fix it:

# Remove background noise
ffmpeg -i input.mp4 -af "highpass=f=200,lowpass=f=3000,anlmdn=s=7" clean.mp4

# Normalize loudness
ffmpeg -i input.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1.5" normalized.mp4

# Extract audio only (faster than transcribing full video)
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 -f wav audio.wav

Transcription Options That Improve Accuracy

result = model.transcribe(
    "audio.wav",
    
    language="en",
    
    initial_prompt="This is a podcast about AI automation and software development.",
    
    temperature=0.0,
    
    condition_on_previous_text=True,
    
    word_timestamps=True,
    
    suppress_tokens=[-1]
)

The Initial Prompt Technique

Pre-load context with an initial prompt:

result = model.transcribe(
    "episode.mp4",
    initial_prompt=(
        "Discussion about OpenAI, Anthropic, Claude, GPT-4, NEPA AI, LangChain, Playwright, Ollama. Tech startup founders."
    )
)

Batch Processing Pipeline

For large volumes of video:

import os
from pathlib import Path
from faster_whisper import WhisperModel
import json
from datetime import datetime

def batch_transcribe(input_dir: str, output_dir: str, model_size: str = "medium"):
    input_path = Path(input_dir)
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    extensions = {".mp4", ".mov", ".mp3", ".wav", ".m4a", ".webm"}
    files = [f for f in input_path.iterdir() if f.suffix.lower() in extensions]
    
    print(f"Loading Whisper {model_size} model...")
    model = WhisperModel(model_size, device="cuda", compute_type="float16")
    
    for i, video_file in enumerate(files):
        stem = video_file.stem
        
        if (output_path / f"{stem}.json").exists():
            continue
        
        segments, info = model.transcribe(
            str(video_file),
            language="en",
            beam_size=5,
            word_timestamps=True
        )
        
        full_text = " ".join(s.text for s in segments).strip()
        
        data = {
            "file": video_file.name,
            "language": info.language,
            "duration": info.duration,
            "transcribed_at": datetime.now().isoformat(),
            "text": full_text,
            "segments": [
                {"start": s.start, "end": s.end, "text": s.text}
                for s in segments
            ]
        }
        
        (output_path / f"{stem}.json").write_text(json.dumps(data, indent=2))
        (output_path / f"{stem}.txt").write_text(full_text)
        
    print(f"Done. Transcripts saved to {output_dir}")

batch_transcribe("./footage/", "./transcripts/", model_size="medium")

Memory Management on GPU

For long files:

result = model.transcribe(
    "long_video.mp4",
    
    chunk_length=20,
    batch_size=8
)

NEPA AI's Video Workspace has Whisper integrated: automatic transcription, SRT generation, keyword extraction, and viral moment detection. Your footage pipeline starts with a transcript.

→ Get the NEPA AI Video Workspace at axon.nepa-ai.com/shop/video-workspace

Every video transcribed, searchable, and ready for editing—automatically.