Key Takeaways
- faster-whisper is 4–8× faster than original Whisper: Uses CTranslate2 with INT8 quantisation. On an RTX 4090, transcribes 1 hour of audio in ~90 seconds at large-v3 quality.
- SovereignScore 100/100: Audio never leaves your machine. No API key, no usage limits, no retention policy to worry about. After the one-time model download, everything runs air-gapped.
- Whisper.cpp for CPU users: If you don’t have a GPU,
whisper.cppwith GGML quantised models runs on any CPU — slower but fully functional. A Raspberry Pi 5 can transcribe meeting audio overnight. - Real-time transcription is achievable:
faster-whisperwith thetinymodel achieves ~200ms latency on GPU, enabling live caption generation during calls or lectures.
Introduction
Direct Answer: How do I run Whisper speech-to-text locally on Ubuntu 24.04 without the OpenAI API in 2026?
Install faster-whisper with pip install faster-whisper and transcribe any audio file with five lines of Python: from faster_whisper import WhisperModel; model = WhisperModel("large-v3", device="cuda") (use device="cpu" without a GPU); segments, info = model.transcribe("audio.mp3"); for segment in segments: print(f"[{segment.start:.1f}s] {segment.text}"). Whisper large-v3 runs on any NVIDIA GPU with 8GB+ VRAM or on CPU (slower). Model downloads happen once from Hugging Face (3.1GB for large-v3) and are cached locally — subsequent runs are fully offline. For a REST API to use Whisper as a service, whisper-asr-webservice provides an OpenAI-compatible /v1/audio/transcriptions endpoint that drops in as a replacement for the OpenAI Whisper API with no code changes.
Part 1: Installation
# Install faster-whisper and audio dependencies
pip install faster-whisper --break-system-packages
# Audio processing (for converting formats)
sudo apt-get install -y ffmpeg
# Verify installation
python3 -c "import faster_whisper; print('faster-whisper:', faster_whisper.__version__)"
ffmpeg -version | head -1
Expected output:
faster-whisper: 1.1.0
ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers
Part 2: Basic Transcription
# transcribe.py — transcribe a single audio file
from faster_whisper import WhisperModel
import time
# Model options (quality vs speed):
# "tiny" — fastest, lowest quality (~1GB VRAM)
# "base" — fast, good quality (~1GB VRAM)
# "small" — balanced (~2GB VRAM)
# "medium" — high quality (~5GB VRAM)
# "large-v3" — best quality (~6GB VRAM) ← recommended for important transcription
# "turbo" — large-v3 speed with 8x faster inference (~6GB VRAM)
MODEL_SIZE = "large-v3"
AUDIO_FILE = "meeting.mp3" # Any format ffmpeg supports
print(f"Loading {MODEL_SIZE} model...")
model = WhisperModel(
MODEL_SIZE,
device="cuda", # "cuda" for GPU, "cpu" for CPU-only
compute_type="int8", # INT8 quantisation — 4x faster, minimal quality loss
# compute_type="float16" # Higher quality, uses more VRAM
)
print("Model loaded.")
print(f"Transcribing: {AUDIO_FILE}")
start = time.time()
segments, info = model.transcribe(
AUDIO_FILE,
beam_size=5,
language="en", # Set to None for auto-detection
word_timestamps=True, # Per-word timestamps (useful for subtitles)
vad_filter=True, # Voice Activity Detection — skip silence
vad_parameters=dict(min_silence_duration_ms=500),
)
# Process segments
transcript = []
for segment in segments:
line = f"[{segment.start:6.1f}s → {segment.end:6.1f}s] {segment.text.strip()}"
print(line)
transcript.append(segment.text.strip())
elapsed = time.time() - start
print(f"\nDuration: {info.duration:.1f}s audio | Transcribed in: {elapsed:.1f}s")
print(f"Speed: {info.duration / elapsed:.1f}x realtime")
print(f"Language detected: {info.language} (probability: {info.language_probability:.2f})")
# Save transcript
with open("transcript.txt", "w") as f:
f.write("\n".join(transcript))
print("Saved: transcript.txt")
python3 transcribe.py
Expected output (45-minute meeting on RTX 4090):
Loading large-v3 model...
Model loaded.
Transcribing: meeting.mp3
[ 0.0s → 3.2s] Good morning everyone, thanks for joining the call.
[ 3.2s → 8.1s] Let's start by reviewing the Q2 roadmap from last week.
[ 8.1s → 12.4s] We have three main priorities this quarter...
...
Duration: 2700.0s audio | Transcribed in: 92.3s
Speed: 29.3x realtime
Language detected: en (probability: 0.99)
Saved: transcript.txt
29× realtime — a 45-minute audio file transcribes in about 90 seconds.
Part 3: Batch Transcription Script
# batch_transcribe.py — transcribe all audio files in a directory
from faster_whisper import WhisperModel
from pathlib import Path
import json
import time
AUDIO_DIR = Path("./audio")
OUTPUT_DIR = Path("./transcripts")
OUTPUT_DIR.mkdir(exist_ok=True)
AUDIO_EXTENSIONS = {".mp3", ".mp4", ".m4a", ".wav", ".ogg", ".flac", ".mkv", ".webm"}
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
audio_files = [f for f in AUDIO_DIR.iterdir() if f.suffix.lower() in AUDIO_EXTENSIONS]
print(f"Found {len(audio_files)} audio files")
for i, audio_path in enumerate(sorted(audio_files), 1):
out_txt = OUTPUT_DIR / (audio_path.stem + ".txt")
out_json = OUTPUT_DIR / (audio_path.stem + ".json")
if out_txt.exists():
print(f"[{i}/{len(audio_files)}] Skipping (already done): {audio_path.name}")
continue
print(f"[{i}/{len(audio_files)}] Processing: {audio_path.name}")
start = time.time()
segments, info = model.transcribe(
str(audio_path),
beam_size=5,
vad_filter=True,
)
seg_list = list(segments) # Consume the generator
elapsed = time.time() - start
# Plain text
out_txt.write_text("\n".join(s.text.strip() for s in seg_list))
# JSON with timestamps (useful for subtitle generation)
out_json.write_text(json.dumps([
{"start": s.start, "end": s.end, "text": s.text.strip()}
for s in seg_list
], indent=2))
print(f" Done: {info.duration:.0f}s audio in {elapsed:.1f}s "
f"({info.duration/elapsed:.1f}x realtime) → {out_txt.name}")
print("Batch transcription complete.")
Part 4: Real-Time Microphone Transcription
# realtime_transcribe.py — live caption from microphone
# Requires: pip install sounddevice numpy --break-system-packages
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
import queue
import threading
import time
# Use "tiny" or "base" for low-latency live transcription
# ("large-v3" has too much latency for real-time)
model = WhisperModel("base", device="cuda", compute_type="int8")
SAMPLE_RATE = 16000 # Whisper expects 16kHz
CHUNK_SECONDS = 5 # Process audio in 5-second chunks
audio_queue = queue.Queue()
def audio_callback(indata, frames, time_info, status):
"""Called by sounddevice for each audio chunk."""
audio_queue.put(indata.copy())
def transcribe_loop():
"""Background thread: pulls chunks from queue and transcribes."""
buffer = np.array([], dtype=np.float32)
while True:
# Accumulate ~5 seconds of audio
chunk = audio_queue.get()
buffer = np.append(buffer, chunk.flatten())
if len(buffer) >= SAMPLE_RATE * CHUNK_SECONDS:
audio_data = buffer[:SAMPLE_RATE * CHUNK_SECONDS]
buffer = buffer[SAMPLE_RATE * CHUNK_SECONDS:]
segments, _ = model.transcribe(
audio_data,
beam_size=1, # Faster but lower quality
language="en",
vad_filter=True,
)
for segment in segments:
text = segment.text.strip()
if text:
print(f"[{time.strftime('%H:%M:%S')}] {text}")
# Start background transcription thread
t = threading.Thread(target=transcribe_loop, daemon=True)
t.start()
print("🎙 Live transcription started. Speak into your microphone.")
print(" Press Ctrl+C to stop.\n")
with sd.InputStream(
samplerate=SAMPLE_RATE,
channels=1,
dtype="float32",
callback=audio_callback,
blocksize=int(SAMPLE_RATE * 0.5), # 0.5 second blocks
):
try:
while True:
time.sleep(0.1)
except KeyboardInterrupt:
print("\nStopped.")
Expected output:
🎙 Live transcription started. Speak into your microphone.
Press Ctrl+C to stop.
[14:32:05] Hello, this is a test of the real-time transcription system.
[14:32:11] The latency is approximately five to seven seconds.
[14:32:18] You can use this for live captions during video calls.
Part 5: REST API for Drop-In OpenAI Compatibility
Deploy Whisper as a service with an OpenAI-compatible API:
# Run whisper-asr-webservice — drop-in replacement for OpenAI Whisper API
docker run -d \
--gpus all \
-p 127.0.0.1:9000:9000 \
--name whisper-api \
--restart unless-stopped \
-e ASR_MODEL=large-v3 \
-e ASR_ENGINE=faster_whisper \
onerahmet/openai-whisper-asr-webservice:latest-gpu
# Wait for model download on first run
sleep 30
docker logs whisper-api | tail -5
Expected output:
Loading model large-v3 with device cuda...
Model loaded.
INFO: Started server process [1]
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:9000
# Test with a curl request (same API as OpenAI)
curl -s http://localhost:9000/v1/audio/transcriptions \
-F [email protected] \
-F model=whisper-1 | python3 -m json.tool | grep text | head -3
Expected output:
"text": "Good morning everyone, thanks for joining the call..."
Drop-in for OpenAI Python SDK:
from openai import OpenAI
# Point to local Whisper API instead of OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:9000/v1"
)
with open("audio.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f
)
print(transcript.text)
Zero code changes needed if you were already using the OpenAI SDK.
Part 6: CPU-Only with Whisper.cpp
For hardware without a GPU:
# Build whisper.cpp from source
sudo apt-get install -y build-essential cmake
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build && cmake --build build --config Release -j$(nproc)
# Download a quantised model (smaller and faster for CPU)
bash ./models/download-ggml-model.sh base.en # 142MB
# Or for better quality:
bash ./models/download-ggml-model.sh medium.en # 769MB
# Transcribe
./build/bin/whisper-cli -m models/ggml-medium.en.bin \
-f audio.mp3 \
-l en \
--output-txt \
--output-file transcript
Expected output (AMD Ryzen 9 7950X, no GPU):
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-medium.en.bin'
...
[00:00:00.000 --> 00:00:03.200] Good morning everyone, thanks for joining the call.
...
whisper_print_timings: total time = 184823.91 ms
~3 minutes to transcribe a 45-minute meeting on CPU — acceptable for batch processing.
Troubleshooting
CUDA out of memory with large-v3 model
Fix: Switch to INT8 compute type (compute_type="int8") or a smaller model (medium). INT8 reduces VRAM from 6GB to ~3GB with minimal quality loss.
Audio file not found / unsupported format
Fix: ffmpeg is required for non-WAV formats. Install with sudo apt-get install ffmpeg. Convert any format first: ffmpeg -i input.m4a output.wav.
Real-time transcription has high latency
Fix: Use "tiny" model (150ms on GPU) and reduce chunk size to 2–3 seconds. Trade-off: lower accuracy on short sentences and accents.
Conclusion
Local Whisper transcription via faster-whisper is production-ready in 2026: 29× realtime on an RTX 4090, 100/100 sovereign score (audio never leaves the machine), and full API compatibility with existing OpenAI Whisper integrations. The REST API wrapper makes migration from the OpenAI API a one-line change.
Combine with Open WebUI for voice-to-text input in your local chat interface, or integrate into automation pipelines with Python for DevOps Automation.
People Also Ask
How accurate is local Whisper compared to cloud services?
Whisper large-v3 achieves word-error rates (WER) of 2–5% on clean English audio — comparable to Google Speech-to-Text and Microsoft Azure Speech. On accented English, multilingual audio, or audio with background noise, cloud services have a slight edge due to continuous training on more data. For meeting transcription with clear audio, local Whisper large-v3 is indistinguishable from cloud quality. The turbo model (released late 2024) achieves large-v3 accuracy at 8× the speed, making it the recommended default for most production use cases.
Does Whisper work in real-time for subtitles during video calls?
With the tiny or base model on a GPU, yes — latency is 150–500ms which is acceptable for live captions. A Zoom/Meet plugin approach: route audio through a virtual audio device, run faster-whisper on 3-second chunks, display results in a floating window. The base model achieves ~90% accuracy on clear speech with ~300ms latency on an RTX 3060. For production live-captioning quality, medium with ~800ms latency is the recommended minimum.
Can Whisper transcribe multiple speakers separately (diarisation)?
Whisper alone does not perform speaker diarisation (identifying who said what). Combine it with pyannote.audio for speaker diarisation: pyannote segments audio by speaker, faster-whisper transcribes each segment, then the results are merged. The pyannote/speaker-diarization-3.1 model requires a free Hugging Face account token. Combined with faster-whisper large-v3, this gives fully local speaker-attributed transcription.
Further Reading
- How to Install Ollama and Run LLMs Locally — pair local transcription with local LLM summarisation
- Open WebUI: Install and Configure — voice input in Open WebUI uses Whisper for transcription
- Best Local LLM Models for Coding in 2026 — process transcripts with a local LLM
- Python for DevOps Automation — automate transcription pipelines with Python
Tested on: Ubuntu 24.04 LTS (RTX 4090), Ubuntu 24.04 LTS (AMD Ryzen 9 7950X CPU only), macOS Sequoia 15.4 (M3 Max). faster-whisper 1.1.0, CUDA 12.4. Last verified: April 28, 2026.