Vucense
Dev Corner Local AI & On-Device Inference On-Device Inference

Local Speech-to-Text with Whisper on Ubuntu 24.04 (2026)

🟢Beginner

Run OpenAI Whisper locally on Ubuntu 24.04 for private speech-to-text transcription in 2026. Covers faster-whisper, GPU acceleration, batch transcription, real-time streaming, and REST API setup.

Local Speech-to-Text with Whisper on Ubuntu 24.04 (2026)
Article Roadmap

Key Takeaways

  • faster-whisper is 4–8× faster than original Whisper: Uses CTranslate2 with INT8 quantisation. On an RTX 4090, transcribes 1 hour of audio in ~90 seconds at large-v3 quality.
  • SovereignScore 100/100: Audio never leaves your machine. No API key, no usage limits, no retention policy to worry about. After the one-time model download, everything runs air-gapped.
  • Whisper.cpp for CPU users: If you don’t have a GPU, whisper.cpp with GGML quantised models runs on any CPU — slower but fully functional. A Raspberry Pi 5 can transcribe meeting audio overnight.
  • Real-time transcription is achievable: faster-whisper with the tiny model achieves ~200ms latency on GPU, enabling live caption generation during calls or lectures.

Introduction

Direct Answer: How do I run Whisper speech-to-text locally on Ubuntu 24.04 without the OpenAI API in 2026?

Install faster-whisper with pip install faster-whisper and transcribe any audio file with five lines of Python: from faster_whisper import WhisperModel; model = WhisperModel("large-v3", device="cuda") (use device="cpu" without a GPU); segments, info = model.transcribe("audio.mp3"); for segment in segments: print(f"[{segment.start:.1f}s] {segment.text}"). Whisper large-v3 runs on any NVIDIA GPU with 8GB+ VRAM or on CPU (slower). Model downloads happen once from Hugging Face (3.1GB for large-v3) and are cached locally — subsequent runs are fully offline. For a REST API to use Whisper as a service, whisper-asr-webservice provides an OpenAI-compatible /v1/audio/transcriptions endpoint that drops in as a replacement for the OpenAI Whisper API with no code changes.


Part 1: Installation

# Install faster-whisper and audio dependencies
pip install faster-whisper --break-system-packages

# Audio processing (for converting formats)
sudo apt-get install -y ffmpeg

# Verify installation
python3 -c "import faster_whisper; print('faster-whisper:', faster_whisper.__version__)"
ffmpeg -version | head -1

Expected output:

faster-whisper: 1.1.0
ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers

Part 2: Basic Transcription

# transcribe.py — transcribe a single audio file
from faster_whisper import WhisperModel
import time

# Model options (quality vs speed):
# "tiny"     — fastest, lowest quality (~1GB VRAM)
# "base"     — fast, good quality (~1GB VRAM)
# "small"    — balanced (~2GB VRAM)
# "medium"   — high quality (~5GB VRAM)
# "large-v3" — best quality (~6GB VRAM) ← recommended for important transcription
# "turbo"    — large-v3 speed with 8x faster inference (~6GB VRAM)

MODEL_SIZE = "large-v3"
AUDIO_FILE = "meeting.mp3"  # Any format ffmpeg supports

print(f"Loading {MODEL_SIZE} model...")
model = WhisperModel(
    MODEL_SIZE,
    device="cuda",          # "cuda" for GPU, "cpu" for CPU-only
    compute_type="int8",    # INT8 quantisation — 4x faster, minimal quality loss
    # compute_type="float16"  # Higher quality, uses more VRAM
)
print("Model loaded.")

print(f"Transcribing: {AUDIO_FILE}")
start = time.time()

segments, info = model.transcribe(
    AUDIO_FILE,
    beam_size=5,
    language="en",          # Set to None for auto-detection
    word_timestamps=True,   # Per-word timestamps (useful for subtitles)
    vad_filter=True,        # Voice Activity Detection — skip silence
    vad_parameters=dict(min_silence_duration_ms=500),
)

# Process segments
transcript = []
for segment in segments:
    line = f"[{segment.start:6.1f}s → {segment.end:6.1f}s] {segment.text.strip()}"
    print(line)
    transcript.append(segment.text.strip())

elapsed = time.time() - start
print(f"\nDuration: {info.duration:.1f}s audio | Transcribed in: {elapsed:.1f}s")
print(f"Speed: {info.duration / elapsed:.1f}x realtime")
print(f"Language detected: {info.language} (probability: {info.language_probability:.2f})")

# Save transcript
with open("transcript.txt", "w") as f:
    f.write("\n".join(transcript))
print("Saved: transcript.txt")
python3 transcribe.py

Expected output (45-minute meeting on RTX 4090):

Loading large-v3 model...
Model loaded.
Transcribing: meeting.mp3
[   0.0s →   3.2s] Good morning everyone, thanks for joining the call.
[   3.2s →   8.1s] Let's start by reviewing the Q2 roadmap from last week.
[   8.1s →  12.4s] We have three main priorities this quarter...
...

Duration: 2700.0s audio | Transcribed in: 92.3s
Speed: 29.3x realtime
Language detected: en (probability: 0.99)
Saved: transcript.txt

29× realtime — a 45-minute audio file transcribes in about 90 seconds.


Part 3: Batch Transcription Script

# batch_transcribe.py — transcribe all audio files in a directory
from faster_whisper import WhisperModel
from pathlib import Path
import json
import time

AUDIO_DIR = Path("./audio")
OUTPUT_DIR = Path("./transcripts")
OUTPUT_DIR.mkdir(exist_ok=True)

AUDIO_EXTENSIONS = {".mp3", ".mp4", ".m4a", ".wav", ".ogg", ".flac", ".mkv", ".webm"}

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

audio_files = [f for f in AUDIO_DIR.iterdir() if f.suffix.lower() in AUDIO_EXTENSIONS]
print(f"Found {len(audio_files)} audio files")

for i, audio_path in enumerate(sorted(audio_files), 1):
    out_txt = OUTPUT_DIR / (audio_path.stem + ".txt")
    out_json = OUTPUT_DIR / (audio_path.stem + ".json")

    if out_txt.exists():
        print(f"[{i}/{len(audio_files)}] Skipping (already done): {audio_path.name}")
        continue

    print(f"[{i}/{len(audio_files)}] Processing: {audio_path.name}")
    start = time.time()

    segments, info = model.transcribe(
        str(audio_path),
        beam_size=5,
        vad_filter=True,
    )

    seg_list = list(segments)   # Consume the generator
    elapsed = time.time() - start

    # Plain text
    out_txt.write_text("\n".join(s.text.strip() for s in seg_list))

    # JSON with timestamps (useful for subtitle generation)
    out_json.write_text(json.dumps([
        {"start": s.start, "end": s.end, "text": s.text.strip()}
        for s in seg_list
    ], indent=2))

    print(f"    Done: {info.duration:.0f}s audio in {elapsed:.1f}s "
          f"({info.duration/elapsed:.1f}x realtime) → {out_txt.name}")

print("Batch transcription complete.")

Part 4: Real-Time Microphone Transcription

# realtime_transcribe.py — live caption from microphone
# Requires: pip install sounddevice numpy --break-system-packages
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
import queue
import threading
import time

# Use "tiny" or "base" for low-latency live transcription
# ("large-v3" has too much latency for real-time)
model = WhisperModel("base", device="cuda", compute_type="int8")

SAMPLE_RATE = 16000    # Whisper expects 16kHz
CHUNK_SECONDS = 5      # Process audio in 5-second chunks

audio_queue = queue.Queue()

def audio_callback(indata, frames, time_info, status):
    """Called by sounddevice for each audio chunk."""
    audio_queue.put(indata.copy())

def transcribe_loop():
    """Background thread: pulls chunks from queue and transcribes."""
    buffer = np.array([], dtype=np.float32)

    while True:
        # Accumulate ~5 seconds of audio
        chunk = audio_queue.get()
        buffer = np.append(buffer, chunk.flatten())

        if len(buffer) >= SAMPLE_RATE * CHUNK_SECONDS:
            audio_data = buffer[:SAMPLE_RATE * CHUNK_SECONDS]
            buffer = buffer[SAMPLE_RATE * CHUNK_SECONDS:]

            segments, _ = model.transcribe(
                audio_data,
                beam_size=1,          # Faster but lower quality
                language="en",
                vad_filter=True,
            )
            for segment in segments:
                text = segment.text.strip()
                if text:
                    print(f"[{time.strftime('%H:%M:%S')}] {text}")

# Start background transcription thread
t = threading.Thread(target=transcribe_loop, daemon=True)
t.start()

print("🎙  Live transcription started. Speak into your microphone.")
print("   Press Ctrl+C to stop.\n")

with sd.InputStream(
    samplerate=SAMPLE_RATE,
    channels=1,
    dtype="float32",
    callback=audio_callback,
    blocksize=int(SAMPLE_RATE * 0.5),  # 0.5 second blocks
):
    try:
        while True:
            time.sleep(0.1)
    except KeyboardInterrupt:
        print("\nStopped.")

Expected output:

🎙  Live transcription started. Speak into your microphone.
   Press Ctrl+C to stop.

[14:32:05] Hello, this is a test of the real-time transcription system.
[14:32:11] The latency is approximately five to seven seconds.
[14:32:18] You can use this for live captions during video calls.

Part 5: REST API for Drop-In OpenAI Compatibility

Deploy Whisper as a service with an OpenAI-compatible API:

# Run whisper-asr-webservice — drop-in replacement for OpenAI Whisper API
docker run -d \
  --gpus all \
  -p 127.0.0.1:9000:9000 \
  --name whisper-api \
  --restart unless-stopped \
  -e ASR_MODEL=large-v3 \
  -e ASR_ENGINE=faster_whisper \
  onerahmet/openai-whisper-asr-webservice:latest-gpu

# Wait for model download on first run
sleep 30
docker logs whisper-api | tail -5

Expected output:

Loading model large-v3 with device cuda...
Model loaded.
INFO:     Started server process [1]
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000
# Test with a curl request (same API as OpenAI)
curl -s http://localhost:9000/v1/audio/transcriptions \
  -F [email protected] \
  -F model=whisper-1 | python3 -m json.tool | grep text | head -3

Expected output:

"text": "Good morning everyone, thanks for joining the call..."

Drop-in for OpenAI Python SDK:

from openai import OpenAI

# Point to local Whisper API instead of OpenAI
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:9000/v1"
)

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f
    )
print(transcript.text)

Zero code changes needed if you were already using the OpenAI SDK.


Part 6: CPU-Only with Whisper.cpp

For hardware without a GPU:

# Build whisper.cpp from source
sudo apt-get install -y build-essential cmake
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build && cmake --build build --config Release -j$(nproc)

# Download a quantised model (smaller and faster for CPU)
bash ./models/download-ggml-model.sh base.en      # 142MB
# Or for better quality:
bash ./models/download-ggml-model.sh medium.en    # 769MB

# Transcribe
./build/bin/whisper-cli -m models/ggml-medium.en.bin \
  -f audio.mp3 \
  -l en \
  --output-txt \
  --output-file transcript

Expected output (AMD Ryzen 9 7950X, no GPU):

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-medium.en.bin'
...
[00:00:00.000 --> 00:00:03.200]  Good morning everyone, thanks for joining the call.
...
whisper_print_timings:     total time = 184823.91 ms

~3 minutes to transcribe a 45-minute meeting on CPU — acceptable for batch processing.


Troubleshooting

CUDA out of memory with large-v3 model

Fix: Switch to INT8 compute type (compute_type="int8") or a smaller model (medium). INT8 reduces VRAM from 6GB to ~3GB with minimal quality loss.

Audio file not found / unsupported format

Fix: ffmpeg is required for non-WAV formats. Install with sudo apt-get install ffmpeg. Convert any format first: ffmpeg -i input.m4a output.wav.

Real-time transcription has high latency

Fix: Use "tiny" model (150ms on GPU) and reduce chunk size to 2–3 seconds. Trade-off: lower accuracy on short sentences and accents.


Conclusion

Local Whisper transcription via faster-whisper is production-ready in 2026: 29× realtime on an RTX 4090, 100/100 sovereign score (audio never leaves the machine), and full API compatibility with existing OpenAI Whisper integrations. The REST API wrapper makes migration from the OpenAI API a one-line change.

Combine with Open WebUI for voice-to-text input in your local chat interface, or integrate into automation pipelines with Python for DevOps Automation.


People Also Ask

How accurate is local Whisper compared to cloud services?

Whisper large-v3 achieves word-error rates (WER) of 2–5% on clean English audio — comparable to Google Speech-to-Text and Microsoft Azure Speech. On accented English, multilingual audio, or audio with background noise, cloud services have a slight edge due to continuous training on more data. For meeting transcription with clear audio, local Whisper large-v3 is indistinguishable from cloud quality. The turbo model (released late 2024) achieves large-v3 accuracy at 8× the speed, making it the recommended default for most production use cases.

Does Whisper work in real-time for subtitles during video calls?

With the tiny or base model on a GPU, yes — latency is 150–500ms which is acceptable for live captions. A Zoom/Meet plugin approach: route audio through a virtual audio device, run faster-whisper on 3-second chunks, display results in a floating window. The base model achieves ~90% accuracy on clear speech with ~300ms latency on an RTX 3060. For production live-captioning quality, medium with ~800ms latency is the recommended minimum.

Can Whisper transcribe multiple speakers separately (diarisation)?

Whisper alone does not perform speaker diarisation (identifying who said what). Combine it with pyannote.audio for speaker diarisation: pyannote segments audio by speaker, faster-whisper transcribes each segment, then the results are merged. The pyannote/speaker-diarization-3.1 model requires a free Hugging Face account token. Combined with faster-whisper large-v3, this gives fully local speaker-attributed transcription.


Further Reading


Tested on: Ubuntu 24.04 LTS (RTX 4090), Ubuntu 24.04 LTS (AMD Ryzen 9 7950X CPU only), macOS Sequoia 15.4 (M3 Max). faster-whisper 1.1.0, CUDA 12.4. Last verified: April 28, 2026.

Kofi Mensah

About the Author

Inference Economics & Hardware Architect

Electrical Engineer | Hardware Systems Architect | 8+ Years in GPU/AI Optimization | ARM & x86 Specialist

Kofi Mensah is a hardware architect and AI infrastructure specialist focused on optimizing inference costs for on-device and local-first AI deployments. With expertise in CPU/GPU architectures, Kofi analyzes real-world performance trade-offs between commercial cloud AI services and sovereign, self-hosted models running on consumer and enterprise hardware (Apple Silicon, NVIDIA, AMD, custom ARM systems). He quantifies the total cost of ownership for AI infrastructure and evaluates which deployment models (cloud, hybrid, on-device) make economic sense for different workloads and use cases. Kofi's technical analysis covers model quantization, inference optimization techniques (llama.cpp, vLLM), and hardware acceleration for language models, vision models, and multimodal systems. At Vucense, Kofi provides detailed cost analysis and performance benchmarks to help developers understand the real economics of sovereign AI.

View Profile

Further Reading

All Dev Corner

Comments