Dev Corner Local AI & On-Device Inference On-Device Inference

Local Speech-to-Text with Whisper on Ubuntu 24.04 (2026)

100 / 100

🟢Beginner

Run OpenAI Whisper locally on Ubuntu 24.04 for private speech-to-text transcription in 2026. Covers faster-whisper, GPU acceleration, batch transcription, real-time streaming, and REST API setup.

Current

By Kofi Mensah

Feb 3, 2026

17 min

20 min

Local Speech-to-Text with Whisper on Ubuntu 24.04 (2026)

Article Roadmap

Key Takeaways

faster-whisper (by SYSTRAN) is the recommended Whisper implementation in 2026 — it uses CTranslate2 for INT8 quantisation and runs 4-8x faster than the original OpenAI Whisper with the same accuracy on an equivalent GPU.
Whisper large-v3 achieves word-error rates under 5% on English speech and handles 100 languages — it runs on a consumer RTX 3090 or Apple M2 Max, processing 1 hour of audio in under 2 minutes at full quality.
The sovereign case for local Whisper is strong: audio data is the most sensitive input category for AI (it may contain voice biometrics, personal conversations, and meeting content) — sending it to OpenAI's API creates a permanent record of that data.
Whisper.cpp is the CPU-only alternative for machines without a GPU — it uses GGML quantisation and runs on any hardware including Raspberry Pi 5, making local transcription accessible without a graphics card.

Key Takeaways

faster-whisper is 4–8× faster than original Whisper: Uses CTranslate2 with INT8 quantisation. On an RTX 4090, transcribes 1 hour of audio in ~90 seconds at large-v3 quality.
SovereignScore 100/100: Audio never leaves your machine. No API key, no usage limits, no retention policy to worry about. After the one-time model download, everything runs air-gapped.
Whisper.cpp for CPU users: If you don’t have a GPU, whisper.cpp with GGML quantised models runs on any CPU — slower but fully functional. A Raspberry Pi 5 can transcribe meeting audio overnight.
Real-time transcription is achievable: faster-whisper with the tiny model achieves ~200ms latency on GPU, enabling live caption generation during calls or lectures.

Introduction

Direct Answer: How do I run Whisper speech-to-text locally on Ubuntu 24.04 without the OpenAI API in 2026?

Install faster-whisper with pip install faster-whisper and transcribe any audio file with five lines of Python: from faster_whisper import WhisperModel; model = WhisperModel("large-v3", device="cuda") (use device="cpu" without a GPU); segments, info = model.transcribe("audio.mp3"); for segment in segments: print(f"[{segment.start:.1f}s] {segment.text}"). Whisper large-v3 runs on any NVIDIA GPU with 8GB+ VRAM or on CPU (slower). Model downloads happen once from Hugging Face (3.1GB for large-v3) and are cached locally — subsequent runs are fully offline. For a REST API to use Whisper as a service, whisper-asr-webservice provides an OpenAI-compatible /v1/audio/transcriptions endpoint that drops in as a replacement for the OpenAI Whisper API with no code changes.

Part 1: Installation

# Install faster-whisper and audio dependencies
pip install faster-whisper --break-system-packages

# Audio processing (for converting formats)
sudo apt-get install -y ffmpeg

# Verify installation
python3 -c "import faster_whisper; print('faster-whisper:', faster_whisper.__version__)"
ffmpeg -version | head -1

Expected output:

faster-whisper: 1.1.0
ffmpeg version 6.1.1-3ubuntu5 Copyright (c) 2000-2023 the FFmpeg developers

Part 2: Basic Transcription

# transcribe.py — transcribe a single audio file
from faster_whisper import WhisperModel
import time

# Model options (quality vs speed):
# "tiny"     — fastest, lowest quality (~1GB VRAM)
# "base"     — fast, good quality (~1GB VRAM)
# "small"    — balanced (~2GB VRAM)
# "medium"   — high quality (~5GB VRAM)
# "large-v3" — best quality (~6GB VRAM) ← recommended for important transcription
# "turbo"    — large-v3 speed with 8x faster inference (~6GB VRAM)

MODEL_SIZE = "large-v3"
AUDIO_FILE = "meeting.mp3"  # Any format ffmpeg supports

print(f"Loading {MODEL_SIZE} model...")
model = WhisperModel(
    MODEL_SIZE,
    device="cuda",          # "cuda" for GPU, "cpu" for CPU-only
    compute_type="int8",    # INT8 quantisation — 4x faster, minimal quality loss
    # compute_type="float16"  # Higher quality, uses more VRAM
)
print("Model loaded.")

print(f"Transcribing: {AUDIO_FILE}")
start = time.time()

segments, info = model.transcribe(
    AUDIO_FILE,
    beam_size=5,
    language="en",          # Set to None for auto-detection
    word_timestamps=True,   # Per-word timestamps (useful for subtitles)
    vad_filter=True,        # Voice Activity Detection — skip silence
    vad_parameters=dict(min_silence_duration_ms=500),
)

# Process segments
transcript = []
for segment in segments:
    line = f"[{segment.start:6.1f}s → {segment.end:6.1f}s] {segment.text.strip()}"
    print(line)
    transcript.append(segment.text.strip())

elapsed = time.time() - start
print(f"\nDuration: {info.duration:.1f}s audio | Transcribed in: {elapsed:.1f}s")
print(f"Speed: {info.duration / elapsed:.1f}x realtime")
print(f"Language detected: {info.language} (probability: {info.language_probability:.2f})")

# Save transcript
with open("transcript.txt", "w") as f:
    f.write("\n".join(transcript))
print("Saved: transcript.txt")

python3 transcribe.py

Expected output (45-minute meeting on RTX 4090):

Loading large-v3 model...
Model loaded.
Transcribing: meeting.mp3
[   0.0s →   3.2s] Good morning everyone, thanks for joining the call.
[   3.2s →   8.1s] Let's start by reviewing the Q2 roadmap from last week.
[   8.1s →  12.4s] We have three main priorities this quarter...
...

Duration: 2700.0s audio | Transcribed in: 92.3s
Speed: 29.3x realtime
Language detected: en (probability: 0.99)
Saved: transcript.txt

29× realtime — a 45-minute audio file transcribes in about 90 seconds.

Part 3: Batch Transcription Script

# batch_transcribe.py — transcribe all audio files in a directory
from faster_whisper import WhisperModel
from pathlib import Path
import json
import time

AUDIO_DIR = Path("./audio")
OUTPUT_DIR = Path("./transcripts")
OUTPUT_DIR.mkdir(exist_ok=True)

AUDIO_EXTENSIONS = {".mp3", ".mp4", ".m4a", ".wav", ".ogg", ".flac", ".mkv", ".webm"}

model = WhisperModel("large-v3", device="cuda", compute_type="int8")

audio_files = [f for f in AUDIO_DIR.iterdir() if f.suffix.lower() in AUDIO_EXTENSIONS]
print(f"Found {len(audio_files)} audio files")

for i, audio_path in enumerate(sorted(audio_files), 1):
    out_txt = OUTPUT_DIR / (audio_path.stem + ".txt")
    out_json = OUTPUT_DIR / (audio_path.stem + ".json")

    if out_txt.exists():
        print(f"[{i}/{len(audio_files)}] Skipping (already done): {audio_path.name}")
        continue

    print(f"[{i}/{len(audio_files)}] Processing: {audio_path.name}")
    start = time.time()

    segments, info = model.transcribe(
        str(audio_path),
        beam_size=5,
        vad_filter=True,
    )

    seg_list = list(segments)   # Consume the generator
    elapsed = time.time() - start

    # Plain text
    out_txt.write_text("\n".join(s.text.strip() for s in seg_list))

    # JSON with timestamps (useful for subtitle generation)
    out_json.write_text(json.dumps([
        {"start": s.start, "end": s.end, "text": s.text.strip()}
        for s in seg_list
    ], indent=2))

    print(f"    Done: {info.duration:.0f}s audio in {elapsed:.1f}s "
          f"({info.duration/elapsed:.1f}x realtime) → {out_txt.name}")

print("Batch transcription complete.")

Part 4: Real-Time Microphone Transcription

# realtime_transcribe.py — live caption from microphone
# Requires: pip install sounddevice numpy --break-system-packages
import sounddevice as sd
import numpy as np
from faster_whisper import WhisperModel
import queue
import threading
import time

# Use "tiny" or "base" for low-latency live transcription
# ("large-v3" has too much latency for real-time)
model = WhisperModel("base", device="cuda", compute_type="int8")

SAMPLE_RATE = 16000    # Whisper expects 16kHz
CHUNK_SECONDS = 5      # Process audio in 5-second chunks

audio_queue = queue.Queue()

def audio_callback(indata, frames, time_info, status):
    """Called by sounddevice for each audio chunk."""
    audio_queue.put(indata.copy())

def transcribe_loop():
    """Background thread: pulls chunks from queue and transcribes."""
    buffer = np.array([], dtype=np.float32)

    while True:
        # Accumulate ~5 seconds of audio
        chunk = audio_queue.get()
        buffer = np.append(buffer, chunk.flatten())

        if len(buffer) >= SAMPLE_RATE * CHUNK_SECONDS:
            audio_data = buffer[:SAMPLE_RATE * CHUNK_SECONDS]
            buffer = buffer[SAMPLE_RATE * CHUNK_SECONDS:]

            segments, _ = model.transcribe(
                audio_data,
                beam_size=1,          # Faster but lower quality
                language="en",
                vad_filter=True,
            )
            for segment in segments:
                text = segment.text.strip()
                if text:
                    print(f"[{time.strftime('%H:%M:%S')}] {text}")

# Start background transcription thread
t = threading.Thread(target=transcribe_loop, daemon=True)
t.start()

print("🎙  Live transcription started. Speak into your microphone.")
print("   Press Ctrl+C to stop.\n")

with sd.InputStream(
    samplerate=SAMPLE_RATE,
    channels=1,
    dtype="float32",
    callback=audio_callback,
    blocksize=int(SAMPLE_RATE * 0.5),  # 0.5 second blocks
):
    try:
        while True:
            time.sleep(0.1)
    except KeyboardInterrupt:
        print("\nStopped.")

Expected output:

🎙  Live transcription started. Speak into your microphone.
   Press Ctrl+C to stop.

[14:32:05] Hello, this is a test of the real-time transcription system.
[14:32:11] The latency is approximately five to seven seconds.
[14:32:18] You can use this for live captions during video calls.

Part 5: REST API for Drop-In OpenAI Compatibility

Deploy Whisper as a service with an OpenAI-compatible API:

# Run whisper-asr-webservice — drop-in replacement for OpenAI Whisper API
docker run -d \
  --gpus all \
  -p 127.0.0.1:9000:9000 \
  --name whisper-api \
  --restart unless-stopped \
  -e ASR_MODEL=large-v3 \
  -e ASR_ENGINE=faster_whisper \
  onerahmet/openai-whisper-asr-webservice:latest-gpu

# Wait for model download on first run
sleep 30
docker logs whisper-api | tail -5

Expected output:

Loading model large-v3 with device cuda...
Model loaded.
INFO:     Started server process [1]
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000

# Test with a curl request (same API as OpenAI)
curl -s http://localhost:9000/v1/audio/transcriptions \
  -F [email protected] \
  -F model=whisper-1 | python3 -m json.tool | grep text | head -3

Expected output:

"text": "Good morning everyone, thanks for joining the call..."

Drop-in for OpenAI Python SDK:

from openai import OpenAI

# Point to local Whisper API instead of OpenAI
client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:9000/v1"
)

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f
    )
print(transcript.text)

Zero code changes needed if you were already using the OpenAI SDK.

Part 6: CPU-Only with Whisper.cpp

For hardware without a GPU:

# Build whisper.cpp from source
sudo apt-get install -y build-essential cmake
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build && cmake --build build --config Release -j$(nproc)

# Download a quantised model (smaller and faster for CPU)
bash ./models/download-ggml-model.sh base.en      # 142MB
# Or for better quality:
bash ./models/download-ggml-model.sh medium.en    # 769MB

# Transcribe
./build/bin/whisper-cli -m models/ggml-medium.en.bin \
  -f audio.mp3 \
  -l en \
  --output-txt \
  --output-file transcript

Expected output (AMD Ryzen 9 7950X, no GPU):

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-medium.en.bin'
...
[00:00:00.000 --> 00:00:03.200]  Good morning everyone, thanks for joining the call.
...
whisper_print_timings:     total time = 184823.91 ms

~3 minutes to transcribe a 45-minute meeting on CPU — acceptable for batch processing.

Troubleshooting

`CUDA out of memory` with large-v3 model

Fix: Switch to INT8 compute type (compute_type="int8") or a smaller model (medium). INT8 reduces VRAM from 6GB to ~3GB with minimal quality loss.

Audio file not found / unsupported format

Fix: ffmpeg is required for non-WAV formats. Install with sudo apt-get install ffmpeg. Convert any format first: ffmpeg -i input.m4a output.wav.

Real-time transcription has high latency

Fix: Use "tiny" model (150ms on GPU) and reduce chunk size to 2–3 seconds. Trade-off: lower accuracy on short sentences and accents.

Conclusion

Local Whisper transcription via faster-whisper is production-ready in 2026: 29× realtime on an RTX 4090, 100/100 sovereign score (audio never leaves the machine), and full API compatibility with existing OpenAI Whisper integrations. The REST API wrapper makes migration from the OpenAI API a one-line change.

Combine with Open WebUI for voice-to-text input in your local chat interface, or integrate into automation pipelines with Python for DevOps Automation.

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

>_ 12 Apr | 18 min | Dev Corner

🟡Intermediate

Deploy a complete local AI stack: Ollama 5.x, Open WebUI, and pgvector: on Ubuntu 24.04. Zero cloud. Zero API costs. Full commands, and tested output.

By Divya Prakash

CrewAI Tutorial 2026: Multi-Agent Systems with Local Ollama

>_ 15 May | 24 min | Dev Corner

🟡Intermediate

Build sovereign multi-agent crews with CrewAI and local Ollama models. Covers role-based agents, task delegation, crew orchestration, tool integration.

By Kofi Mensah

Local Multimodal AI 2026: Expert Guide to Vision + Text Pipelines with Llama 4 & Qwen2-VL

>_ 19 May | 18 min | Dev Corner

🟡Intermediate

Sovereign local multimodal AI on Ubuntu 24.04: vision-language with Llama 4 Scout, document and image reasoning with Qwen2-VL, and local Whisper audio transcription. Practical pipeline design for on-premise inference and secure data workflows.

By Kofi Mensah

#whisper #speech-to-text #local-ai #transcription #ubuntu #python #dev-corner #2026

Local Speech-to-Text with Whisper on Ubuntu 24.04 (2026)

Key Takeaways

Introduction

Part 1: Installation

Part 2: Basic Transcription

Part 3: Batch Transcription Script

Part 4: Real-Time Microphone Transcription

Part 5: REST API for Drop-In OpenAI Compatibility

Part 6: CPU-Only with Whisper.cpp

Troubleshooting

`CUDA out of memory` with large-v3 model

Audio file not found / unsupported format

Real-time transcription has high latency

Conclusion

People Also Ask

How accurate is local Whisper compared to cloud services?

Does Whisper work in real-time for subtitles during video calls?

Can Whisper transcribe multiple speakers separately (diarisation)?

Further Reading

About the Author

Further Reading

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

CrewAI Tutorial 2026: Multi-Agent Systems with Local Ollama

Local Multimodal AI 2026: Expert Guide to Vision + Text Pipelines with Llama 4 & Qwen2-VL

Comments

Linux systemd Service Management 2026: systemctl and journalctl

Local Multimodal AI 2026: Expert Guide to Vision + Text Pipelines with Llama 4 & Qwen2-VL

Linux Package Management 2026: apt, dpkg & snap on Ubuntu 24.04

Linux Command Line Basics 2026: 50 Essential Commands

Linux Server Hardening 2026: CIS Benchmark on Ubuntu 24.04

Recently Visited

Key Takeaways

Introduction

Part 1: Installation

Part 2: Basic Transcription

Part 3: Batch Transcription Script

Part 4: Real-Time Microphone Transcription

Part 5: REST API for Drop-In OpenAI Compatibility

Part 6: CPU-Only with Whisper.cpp

Troubleshooting

CUDA out of memory with large-v3 model

Audio file not found / unsupported format

Real-time transcription has high latency

Conclusion

People Also Ask

How accurate is local Whisper compared to cloud services?

Does Whisper work in real-time for subtitles during video calls?

Can Whisper transcribe multiple speakers separately (diarisation)?

Further Reading

Get the Sovereign Stack Playbook

You're in — welcome to the community!

Related Questions Answered in This Article

About the Author

Further Reading

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

CrewAI Tutorial 2026: Multi-Agent Systems with Local Ollama

Local Multimodal AI 2026: Expert Guide to Vision + Text Pipelines with Llama 4 & Qwen2-VL

Get the Sovereign Stack Playbook

You're in — welcome!

Comments

Recently Visited

`CUDA out of memory` with large-v3 model