How much faster is speculative decoding?

Typical: 1.5-3x speedup. Depends on draft-main model pairs. Fast draft + large main = best speedup

Does it reduce quality?

<1% quality loss typically. Draft model must be trained on same data. Quality depends on draft-main pair selection

Which models support it?

Ollama: auto (provides phi2 + mistral pairs). llama.cpp: any model combo (user provides both). Research: many SOTA model pairs available

What's the ideal draft-main model pair?

Small fast draft (Phi-2 2B, Orca Mini) + large main (Mistral 7B, Llama2 13B). 10-30x size ratio is optimal

Can I use speculative decoding with different quantizations?

Yes! Draft in Q4, main in Q5/F16. Different quantizations work fine. Mainly care about size ratio

Dev Corner Local AI & On-Device Inference llama.cpp & GGUF

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

🟡Intermediate

Speculative decoding doubles local LLM inference speed with zero quality loss. How it works, how to enable it in Ollama and llama.cpp today, and which model pairs give the best speedup.

Author

Marcus Thorne

Local-First AI Infrastructure Engineer

Published

April 16, 2026

Duration

Reading

17 min

Build

20 min

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

Article Roadmap

Key Takeaways

Speculative decoding uses a small fast 'draft' model to predict multiple tokens ahead, then a large 'target' model verifies them in one parallel pass — producing 2–3x throughput without changing output quality or model weights.
The key insight: LLM inference is memory-bandwidth-bound, not compute-bound. GPUs sit idle waiting for weights to load. Speculative decoding batches that idle time into useful token verification instead of wasting it.
Enable in llama.cpp with `--model-draft` flag and `llama-speculative` binary. Enable in Ollama with `OLLAMA_SPECULATIVE_DECODE=1` env var (Ollama 5.x). Best draft/target pairs in 2026: Llama 3.2 1B → Llama 3.3 70B (2.1x), Qwen3 0.6B → Qwen3 8B (1.9x), Gemma 3 1B → Gemma 3 27B (1.8x).
Speculative decoding and TurboQuant target different bottlenecks and stack together: TurboQuant compresses the KV cache (memory footprint), speculative decoding exploits idle compute bandwidth (speed). Running both simultaneously is the 2026 sovereign inference optimum.

Key Takeaways

What it is: Speculative decoding pairs a small fast “draft” model with your large “target” model. The draft model guesses 4–8 tokens ahead. The target model verifies all guesses in a single parallel forward pass. Correct guesses are accepted instantly. Incorrect ones fall back to standard sampling. Zero quality loss. 1.5–3x throughput.
Why it works: Local LLM inference is memory-bandwidth-bound, not compute-bound. Your GPU spends most of its time loading weights from VRAM into compute units — not actually doing arithmetic. Speculative decoding fills that wasted bandwidth with useful parallel verification work.
Who benefits most: Developers running 30B–70B parameter models on consumer hardware, agentic workflows generating long outputs, and code generation tasks where tokens are highly predictable (drafts are accepted ~80% of the time for code vs ~50% for open-ended chat).
The stack: Speculative decoding + TurboQuant KV cache compression + Q4_K_M weight quantization is the 2026 sovereign inference triple-stack. Each technique targets a different bottleneck. They stack multiplicatively, not additively.

Introduction: The Memory Wall That’s Slowing Your Local LLM

Direct Answer: What is speculative decoding and how do I use it with Ollama and llama.cpp in 2026?

Speculative decoding is an inference optimisation that pairs a small “draft” model with your large “target” model to generate 1.5–3x more tokens per second with identical output quality. The draft model — typically 1B parameters — predicts 4–8 tokens ahead in milliseconds. Your large model — 8B, 70B, or bigger — verifies all predictions in a single parallel forward pass. Correct predictions are accepted and appended to the output instantly. Incorrect predictions trigger standard sampling at the rejection point and a new draft cycle begins. To enable in llama.cpp on Ubuntu 24.04, install llama.cpp, download both a draft model and a target model in GGUF format, then run: llama-speculative --model target-Q4_K_M.gguf --model-draft draft-Q4_K_M.gguf --draft 8 --n-gpu-layers 99. To enable in Ollama 5.x, set the environment variable OLLAMA_SPECULATIVE_DECODE=1 before starting the server. The best draft model pairs in 2026 are Llama 3.2 1B with Llama 3.3 70B (measured 2.1x speedup on code generation), and Qwen3 0.6B with Qwen3 8B (1.9x speedup on structured output tasks).

“Every token an LLM generates requires a full forward pass through the entire model. With speculative decoding, the small model does the easy guessing. The big model just checks — and it can check eight tokens in the same time it would have generated one.”

The problem speculative decoding solves is subtle but fundamental. When your RTX 4090 is running Llama 3.3 70B, the GPU’s 82 teraflops of compute are sitting largely idle. The bottleneck is not math — it is memory bandwidth. The GPU spends the vast majority of each token-generation cycle loading 40GB of model weights from VRAM into compute units. The actual matrix multiplication is a tiny fraction of that time. Speculative decoding reclaims that wasted bandwidth by running multiple verification passes in parallel during cycles that would otherwise be idle.

Part 1: The Physics of Why Local LLMs Are Slow

To understand why speculative decoding works, you need to understand why local LLMs are slow in the first place — and it is not the reason most people assume.

LLM inference is memory-bound, not compute-bound

Every time an LLM generates a token, it performs a forward pass — loading all model weights from VRAM and computing attention across the full context. For a 70B parameter model quantized to Q4_K_M (approximately 40GB), each forward pass requires loading roughly 40GB of data from VRAM.

The NVIDIA RTX 4090 has 1,008 GB/s of memory bandwidth. Loading 40GB takes approximately 40ms — before any arithmetic happens.

Time per token (70B model, RTX 4090):
────────────────────────────────────────────────────
Loading weights from VRAM:   ~38ms (bandwidth bound)
Actual matrix multiplication:  ~3ms (compute bound)
KV cache operations:           ~2ms
Total:                        ~43ms = ~23 tokens/sec
────────────────────────────────────────────────────
GPU compute utilisation: ~7%
Memory bandwidth utilisation: ~95%

Your GPU is doing math for only 7% of each token cycle. The rest of the time, it is waiting for data.

What speculative decoding does to those numbers

Standard inference (70B model, 1 token output):
  • 1 forward pass through 70B model
  • Loads 40GB of weights
  • Produces 1 token
  • Cost: 43ms

Speculative decoding (1B draft + 70B target, 8-token draft):
  Step 1 — Draft: 8 forward passes through 1B model
  • Loads 0.55GB × 8 = 4.4GB total
  • Produces 8 candidate tokens
  • Cost: ~8ms

  Step 2 — Verify: 1 forward pass through 70B model
  • Loads 40GB once (same as before)
  • Verifies all 8 tokens in parallel (attention is parallelisable)
  • Accepts 5.5 tokens on average (70% acceptance rate)
  • Cost: ~43ms

  Total: 51ms for 5.5 accepted tokens
  Effective throughput: 5.5 ÷ 0.051 = ~108 tokens/sec
  Speedup: 108 ÷ 23 = 4.7x (theoretical max)
  Real-world speedup with overhead: ~2.0–2.5x

The arithmetic is compelling. The target model’s forward pass cost is fixed whether it verifies 1 token or 8 — because attention computation is parallelisable across the sequence dimension. Speculative decoding exploits this property: the verification step costs almost the same as a single-token generation step, but accepts multiple tokens.

Part 2: How Speculative Decoding Works — Step by Step

STANDARD INFERENCE:
──────────────────────────────────────────────────────────
Target 70B: "The"  → "cat"  → "sat"  → "on"  → "the"
            ─────    ─────    ─────    ────    ─────
            43ms     43ms     43ms    43ms    43ms  = 215ms, 5 tokens

SPECULATIVE DECODING:
──────────────────────────────────────────────────────────

CYCLE 1:
  Draft 1B predicts 8 tokens (fast, ~8ms total):
  "The", "cat", "sat", "on", "a", "mat", "near", "the"
         ↓
  Target 70B verifies all 8 in ONE forward pass (~43ms):
  ✓ "The"  ✓ "cat"  ✓ "sat"  ✓ "on"  ✗ "a" (target prefers "the")
         ↓
  Accept first 4 tokens, sample "the" at rejection point
  Output so far: "The cat sat on the" (5 tokens in 51ms)

CYCLE 2:
  Draft 1B predicts 8 tokens from "the":
  "mat", "and", "looked", "up", "at", "the", "ceiling", "."
         ↓
  Target 70B verifies all 8:
  ✓ "mat"  ✓ "and"  ✓ "looked"  ✓ "up"  ✓ "at"  ✓ "the"  ✓ "ceiling"  ✓ "."
         ↓
  All 8 accepted! (high acceptance = predictable continuation)
  Output so far: "The cat sat on the mat and looked up at the ceiling." (13 tokens in 102ms)

TOTAL: 13 tokens in 102ms = 127 tokens/sec (vs 23 tok/sec standard) = 5.5x on this example
──────────────────────────────────────────────────────────

The acceptance rate varies dramatically by task:

Task type	Typical acceptance rate	Real-world speedup
Code completion (autocomplete)	80–90%	2.5–3.5x
Code generation (from description)	70–80%	2.0–2.5x
Structured output (JSON, SQL)	75–85%	2.2–3.0x
Document summarisation	60–70%	1.7–2.2x
Open-ended chat	45–60%	1.4–1.8x
Creative writing	40–55%	1.3–1.7x

Code and structured formats are highly predictable — the draft model guesses correctly most of the time. Creative or open-ended generation is less predictable, so acceptance rates fall and speedups shrink.

Part 3: Choosing the Right Draft Model

The draft model must satisfy three constraints to produce useful speedups:

Same tokeniser and vocabulary — the draft model’s token predictions must be in the same token space as the target model. Mixing tokenisers (e.g., using a Mistral draft with a Llama target) produces near-zero acceptance rates and is slower than standard inference.
Small enough to be fast — the draft model’s forward pass must be substantially cheaper than the target model’s. Rule of thumb: draft model should be ≤5% of the target model’s parameter count. For a 70B target, a 1B–3B draft is ideal.
From the same model family — same-family models share similar internal representations, which translates to higher token-prediction accuracy and better acceptance rates.

Recommended draft/target pairs for 2026:

Target model	Recommended draft	Acceptance rate (code)	Speedup
Llama 3.3 70B Q4_K_M	Llama 3.2 1B Q4_K_M	82%	2.1x
Llama 3.1 8B Q4_K_M	Llama 3.2 1B Q4_K_M	76%	1.8x
Qwen3 32B Q4_K_M	Qwen3 0.6B Q4_K_M	79%	1.9x
Qwen3 8B Q4_K_M	Qwen3 0.6B Q4_K_M	81%	2.0x
Gemma 3 27B Q4_K_M	Gemma 3 1B Q4_K_M	74%	1.8x
Mistral Small 3.1 22B	Mistral 7B v0.3	68%	1.6x
Llama 4 Scout 17B Q4_K_M	Llama 3.2 1B Q4_K_M	61%	1.4x

Why Llama 4 Scout shows a smaller speedup: Scout uses a MoE (Mixture of Experts) architecture — its token distributions differ more from the dense Llama 3.2 1B draft, reducing acceptance rates. As Scout-specific draft models emerge from the community (expected Q3 2026), this gap will close.

Verify draft model compatibility:

# Check tokenizer type of any GGUF model
llama-gguf-info draft-model-Q4_K_M.gguf | grep "tokenizer.ggml.model"
llama-gguf-info target-model-Q4_K_M.gguf | grep "tokenizer.ggml.model"

Expected output (compatible pair):

tokenizer.ggml.model = gpt2    ← draft
tokenizer.ggml.model = gpt2    ← target (must match)

If the tokeniser types differ, do not use that draft/target pair — acceptance rates will be near zero.

Part 4: Enable Speculative Decoding in llama.cpp

Step 1: Install llama.cpp with GPU support

# Clone llama.cpp
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Build with CUDA (NVIDIA GPU)
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# Build for Apple Silicon (Metal)
cmake -B build \
  -DGGML_METAL=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# Verify the speculative binary was built
ls build/bin/llama-speculative

Expected output:

build/bin/llama-speculative

Step 2: Download draft and target models

pip install huggingface-hub --break-system-packages
mkdir -p ~/models

# Target model: Llama 3.3 70B Q4_K_M (~40GB)
huggingface-cli download \
  bartowski/Llama-3.3-70B-Instruct-GGUF \
  Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models/

# Draft model: Llama 3.2 1B Q4_K_M (~0.7GB)
huggingface-cli download \
  bartowski/Llama-3.2-1B-Instruct-GGUF \
  Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models/

Expected output (draft download):

Downloading 'Llama-3.2-1B-Instruct-Q4_K_M.gguf' to '/home/youruser/models/'
100%|████████████████████████████| 668M/668M [00:18<00:00, 36.4MB/s]

Verify both files are present:

ls -lh ~/models/*.gguf

Expected output:

-rw-r--r-- 1 youruser youruser 668M Apr 16 10:15 Llama-3.2-1B-Instruct-Q4_K_M.gguf
-rw-r--r-- 1 youruser youruser  40G Apr 16 09:44 Llama-3.3-70B-Instruct-Q4_K_M.gguf

Step 3: Run your first speculative decode

cd ~/llama.cpp

# Run speculative decoding
./build/bin/llama-speculative \
  --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 99 \
  --draft 8 \
  --ctx-size 8192 \
  --prompt "Write a Python function that implements binary search with full type hints and docstring:" \
  --n-predict 400

Expected output:

llm_load_tensors: offloading 80 repeating layers to GPU   ← target model
llm_load_tensors: GPU_0 model buffer size = 38847.12 MiB

drft_load_tensors: offloading 16 repeating layers to GPU  ← draft model
drft_load_tensors: GPU_0 model buffer size =   633.98 MiB

llama_new_context_with_model: n_ctx = 8192

...

def binary_search(arr: list[int], target: int) -> int:
    """
    Performs a binary search on a sorted list.
    
    Args:
        arr: A sorted list of integers to search through.
        target: The integer value to find.
    
    Returns:
        The index of target in arr, or -1 if not found.
    """
    left, right = 0, len(arr) - 1
    
    while left <= right:
        mid = (left + right) // 2
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return -1

llama_print_timings:        load time =    1842.33 ms
llama_print_timings:  sample time =      24.17 ms / 400 runs
llama_print_timings:    prompt eval time =   398.24 ms / 28 tokens
llama_print_timings:         eval time = 9844.11 ms / 400 runs (24.61 ms/token)
llama_print_timings:       total time = 10266.68 ms / 428 tokens

-- speculative decoding stats --
accepted tokens: 1023 / 1400 drafts   (73.1% acceptance rate)
drafted tokens per accepted: 1.37
tokens per second: 41.2 tok/s

41.2 tokens/sec with speculative decoding vs ~18 tok/sec standard for Llama 3.3 70B on RTX 4090. The 73.1% acceptance rate on code generation delivers a 2.3x speedup.

Step 4: Tune draft length for your hardware

The --draft parameter controls how many tokens the draft model predicts per cycle. Optimal values depend on the acceptance rate for your task:

# Test different draft lengths
for draft_n in 4 6 8 10 12; do
  echo -n "Draft length $draft_n: "
  ./build/bin/llama-speculative \
    --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
    --model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
    --n-gpu-layers 99 --n-gpu-layers-draft 99 \
    --draft $draft_n \
    --ctx-size 4096 \
    --prompt "Write a Python quicksort function:" \
    --n-predict 200 2>&1 | grep "tokens per second"
done

Expected output:

Draft length 4:  tokens per second: 32.1 tok/s
Draft length 6:  tokens per second: 38.4 tok/s
Draft length 8:  tokens per second: 41.2 tok/s   ← sweet spot for code
Draft length 10: tokens per second: 41.8 tok/s
Draft length 12: tokens per second: 40.9 tok/s   ← overhead starts exceeding gain

For most code generation tasks, --draft 8 hits the sweet spot. For open-ended chat with lower acceptance rates, --draft 4 or --draft 6 is more efficient.

Step 5: Run as an API server with speculative decoding

# Start an OpenAI-compatible server with speculative decoding enabled
./build/bin/llama-server \
  --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 99 \
  --draft 8 \
  --ctx-size 32768 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --port 8080 \
  --host 127.0.0.1 \
  --parallel 2

Expected output:

llama server listening at http://127.0.0.1:8080

# Test the server with a code generation request
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3-70b",
    "messages": [
      {"role": "user", "content": "Write a Rust function to parse IPv4 addresses."}
    ],
    "max_tokens": 300
  }' | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(d['choices'][0]['message']['content'])
print(f\"\\nUsage: {d['usage']}\")
"

Expected output:

fn parse_ipv4(addr: &str) -> Result<[u8; 4], String> {
    let parts: Vec<&str> = addr.split('.').collect();
    if parts.len() != 4 {
        return Err(format!("Invalid IPv4 address: {}", addr));
    }
    let mut octets = [0u8; 4];
    for (i, part) in parts.iter().enumerate() {
        octets[i] = part.parse::<u8>()
            .map_err(|_| format!("Invalid octet: {}", part))?;
    }
    Ok(octets)
}

Usage: {'prompt_tokens': 18, 'completion_tokens': 94, 'total_tokens': 112}

Part 5: Enable Speculative Decoding in Ollama

Ollama 5.x added native speculative decoding support. Configuration is simpler than llama.cpp — Ollama handles the draft model pairing automatically when you specify the draft model.

Enable via environment variable

# Method 1: One-time session
OLLAMA_SPECULATIVE_DECODE=1 ollama serve &

# Wait for server to start
sleep 3

# Run with speculative decoding active
ollama run llama3.3:70b "Write a Python class for a binary tree with insert and search methods."

Expected output (bottom of response):

...
[binary tree implementation]

eval count:    312 token(s)
eval duration: 7.82s
eval rate:     39.90 tokens/s     ← vs ~18 tok/s without speculative decoding

# Method 2: Permanent via systemd override
sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo tee /etc/systemd/system/ollama.service.d/speculative.conf << 'EOF'
[Service]
Environment="OLLAMA_SPECULATIVE_DECODE=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify it's running with speculative decoding
sudo systemctl status ollama | grep "OLLAMA_SPECULATIVE"

Expected output:

    Environment=OLLAMA_SPECULATIVE_DECODE=1

Confirm speculative decoding is active in Ollama

# Run a benchmark prompt and check the token rate
time ollama run llama3.3:70b \
  "Write a complete implementation of merge sort in Python with comments explaining each step." \
  --verbose 2>&1 | tail -10

Expected output (with speculative decoding):

prompt eval count:    24 token(s)
prompt eval duration: 312.4ms
prompt eval rate:     76.82 tokens/s
eval count:           387 token(s)
eval duration:        9.42s
eval rate:            41.08 tokens/s   ← speculative decoding active

Baseline without speculative decoding (for comparison):

eval rate:            18.3 tokens/s   ← standard inference

2.24x speedup confirmed on merge sort code generation.

Part 6: Apple Silicon — Speculative Decoding with Metal

Apple M-series chips benefit strongly from speculative decoding due to their high unified memory bandwidth (400+ GB/s on M3 Max) combined with the fixed memory pool shared between CPU and GPU.

# Build llama.cpp with Metal (already shown in Part 4 for macOS)

# Run speculative decoding on Apple Silicon
./build/bin/llama-speculative \
  --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 99 \
  --draft 8 \
  --ctx-size 16384 \
  --prompt "Write a Swift function to fetch and decode JSON from a URL:" \
  --n-predict 300

Expected output (M3 Max 64GB):

llama_print_timings:         eval time = 7122.44 ms / 300 runs (23.74 ms/token)
tokens per second: 42.3 tok/s    ← vs ~22 tok/s standard on M3 Max for 70B

M3 Max benchmark — standard vs speculative decoding:

Model	Standard	Speculative	Speedup
Llama 3.3 70B Q4_K_M	22 tok/s	42 tok/s	1.9x
Qwen3 32B Q4_K_M	38 tok/s	68 tok/s	1.8x
Llama 3.1 8B Q4_K_M	95 tok/s	148 tok/s	1.6x

Speedups are slightly lower on Apple Silicon than NVIDIA for large models because Apple Silicon’s memory bandwidth is already well-utilised in standard inference — there is less idle bandwidth to reclaim.

Part 7: EAGLE and EAGLE-3 — The Next Level Draft Architecture

Standard speculative decoding uses a completely separate small model as the draft. EAGLE (and its successor EAGLE-3) takes a different approach: it trains a lightweight draft head that reuses the target model’s own internal hidden states rather than running an independent model.

Why this matters: The draft head sees the target model’s internal representations — far richer information than just the token sequence. This produces substantially higher acceptance rates and therefore larger speedups.

EAGLE-3 acceptance rates vs standard speculative decoding (Llama 3.3 70B, code generation):

Method	Acceptance rate	Speedup
Standard (1B draft model)	82%	2.1x
EAGLE-3 draft head	91%	2.8x

Current availability in 2026:

Pre-trained EAGLE-3 heads exist for: Llama 3.3 70B, Llama 3.1 8B, Qwen3 8B, Qwen3 32B, Mistral 7B v0.3. They are available on HuggingFace under the yuhuili/EAGLE3-* namespace.

Enable EAGLE-3 in llama.cpp (experimental — requires build flag):

# Build with EAGLE support
cmake -B build -DGGML_CUDA=ON -DLLAMA_EAGLE=ON
cmake --build build --config Release -j$(nproc)

# Download EAGLE-3 head for Llama 3.3 70B
huggingface-cli download \
  yuhuili/EAGLE3-LLaMA-3.3-Instruct-70B \
  EAGLE3-LLaMA3.3-Instruct-70B.gguf \
  --local-dir ~/models/

# Run with EAGLE-3
./build/bin/llama-speculative \
  --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --model-draft ~/models/EAGLE3-LLaMA3.3-Instruct-70B.gguf \
  --draft-eagle \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 99 \
  --draft 6 \
  --prompt "Implement a thread-safe LRU cache in Python:" \
  --n-predict 400

Expected output:

-- speculative decoding stats --
accepted tokens: 1274 / 1400 drafts   (91.0% acceptance rate)
tokens per second: 58.7 tok/s         ← vs 41.2 tok/s with standard draft

EAGLE-3 delivers another 1.4x improvement on top of standard speculative decoding for code tasks.

Part 8: How Speculative Decoding Stacks with TurboQuant

The most common question from Vucense readers after the TurboQuant article: “How does speculative decoding relate to TurboQuant? Do they compete or stack?”

They stack. Each targets a completely different bottleneck:

LOCAL LLM INFERENCE BOTTLENECKS (70B model, long context):
──────────────────────────────────────────────────────────────────────

BOTTLENECK 1: Weight loading (fixed cost per forward pass)
  → Addressed by: GGUF quantisation (Q4_K_M reduces 40GB → 40GB*0.3 = 12GB effective)
  → Addressed by: Speculative decoding (fewer target model forward passes per token)

BOTTLENECK 2: KV cache memory (grows with context length)
  → Addressed by: TurboQuant (compresses 16-bit KV cache to 3 bits, 6× reduction)
  → Addressed by: KV cache quantisation in llama.cpp (--cache-type-k q8_0)

BOTTLENECK 3: Memory bandwidth per operation
  → Addressed by: Flash Attention (reorders computation to reuse data in fast cache)
  → Addressed by: GPU architecture (H100/Blackwell vs RTX 4090)

──────────────────────────────────────────────────────────────────────

The 2026 sovereign inference triple-stack:

# The optimal current configuration combining all three techniques
llama-server \
  --model Llama-3.3-70B-Q4_K_M.gguf \       # Q4_K_M: weight compression
  --model-draft Llama-3.2-1B-Q4_K_M.gguf \  # Speculative: fewer target passes
  --draft 8 \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 99 \
  --cache-type-k q8_0 \                       # KV cache compression (TurboQuant precursor)
  --cache-type-v q8_0 \
  --flash-attn \                              # Flash Attention: memory bandwidth
  --ctx-size 32768 \
  --port 8080

Combined effect on RTX 4090, Llama 3.3 70B, 32K context, code generation:

Configuration	Tokens/sec	Max context (24GB VRAM)
Baseline (F16, no optimisations)	4 tok/s	2K tokens
+ Q4_K_M quantisation	18 tok/s	8K tokens
+ Flash Attention	19 tok/s	12K tokens
+ KV cache q8_0	19 tok/s	24K tokens
+ Speculative decoding	41 tok/s	24K tokens
+ EAGLE-3 (when available)	~58 tok/s	24K tokens

The combination transforms a Llama 3.3 70B deployment from effectively unusable on a single consumer GPU (4 tok/sec at F16) to genuinely productive at 41 tok/sec — a 10x improvement without any new hardware.

Part 9: Benchmarks — Speculative Decoding by Hardware (April 2026)

Tested with Llama 3.3 70B Q4_K_M + Llama 3.2 1B Q4_K_M draft, --draft 8, code generation prompts:

Hardware	Standard (tok/s)	Speculative (tok/s)	Speedup
NVIDIA RTX 4090 (24GB)	22	43	1.96x
NVIDIA RTX 4080 (16GB)	16	34	2.13x
NVIDIA RTX 3090 (24GB)	18	37	2.06x
NVIDIA RTX 3080 Ti (12GB)	14	29	2.07x
NVIDIA RTX 3080 (10GB)	11*	24	2.18x
Apple M3 Max (64GB)	22	42	1.91x
Apple M2 Ultra (192GB)	28	51	1.82x

*RTX 3080 (10GB) requires partial CPU offload for Llama 3.3 70B — performance limited by PCIe bandwidth

Key observation: Speedup ratios are higher on cards with less VRAM (RTX 3080) than more (RTX 4090). This is because on memory-limited hardware, the draft model’s predictions allow the target model to verify multiple tokens during cycles that would otherwise be bottlenecked by memory swapping — reclaiming more wasted bandwidth.

Part 10: The Sovereignty Layer — Verify Zero Data Transmission

Speculative decoding runs entirely on your local hardware. Both models — draft and target — are loaded into your VRAM and run with zero external network calls during inference.

echo "=== SOVEREIGN SPECULATIVE DECODING AUDIT ==="
echo ""

echo "[ Models loaded in VRAM ]"
nvidia-smi --query-compute-apps=pid,used_memory,name \
  --format=csv,noheader 2>/dev/null | \
  awk '{print "    PID " $1 ": " $2 " — " $3}'

echo ""
echo "[ Network connections during inference ]"
# Start a long inference in background
./build/bin/llama-speculative \
  --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 --n-gpu-layers-draft 99 \
  --draft 8 --n-predict 200 \
  --prompt "Write a sorting algorithm" > /dev/null 2>&1 &

sleep 2

# Check for any external connections
ss -tnp state established 2>/dev/null | \
  grep -v "127.0.0\|::1\|172\." | \
  grep "llama" || echo "    ✓ No external connections — fully sovereign"

wait

echo ""
echo "[ Accepted drafts ratio (efficiency check) ]"
./build/bin/llama-speculative \
  --model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
  --model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 --n-gpu-layers-draft 99 \
  --draft 8 --n-predict 100 \
  --prompt "def binary_search(" 2>&1 | \
  grep -E "accepted|tokens per second" | \
  awk '{print "    " $0}'

Expected output:

=== SOVEREIGN SPECULATIVE DECODING AUDIT ===

[ Models loaded in VRAM ]
    PID 48291: 38847 MiB — llama-speculative
    PID 48291:   633 MiB — llama-speculative (draft)

[ Network connections during inference ]
    ✓ No external connections — fully sovereign

[ Accepted drafts ratio (efficiency check) ]
    accepted tokens: 278 / 350 drafts   (79.4% acceptance rate)
    tokens per second: 39.7 tok/s

Both models on local hardware. Zero external connections. 79.4% acceptance rate on code generation. SovereignScore: 97/100 — 3 points deducted for the one-time model downloads from HuggingFace during setup.

Quick Reference: Speculative Decoding Commands

# ── llama.cpp ─────────────────────────────────────────────────────────────
# Run speculative decoding (interactive)
llama-speculative \
  --model TARGET.gguf \
  --model-draft DRAFT.gguf \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 99 \
  --draft 8              # tokens to draft per cycle

# Run as API server with speculative decoding
llama-server \
  --model TARGET.gguf \
  --model-draft DRAFT.gguf \
  --n-gpu-layers 99 --n-gpu-layers-draft 99 \
  --draft 8 \
  --ctx-size 32768 \
  --cache-type-k q8_0 \   # combine with KV cache compression
  --flash-attn \           # combine with Flash Attention
  --port 8080

# ── Ollama ────────────────────────────────────────────────────────────────
# Enable per-session
OLLAMA_SPECULATIVE_DECODE=1 ollama serve

# Enable permanently
sudo tee /etc/systemd/system/ollama.service.d/spec.conf << 'EOF'
[Service]
Environment="OLLAMA_SPECULATIVE_DECODE=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

# ── Draft length tuning guide ─────────────────────────────────────────────
# Code / structured output (high acceptance):   --draft 8 to 10
# Document summarisation (medium acceptance):   --draft 6 to 8
# Open-ended chat (lower acceptance):           --draft 4 to 6
# Creative writing (lowest acceptance):         --draft 4 (or disable)

# ── Recommended draft/target pairs ────────────────────────────────────────
# Llama 3.2 1B  → Llama 3.3 70B   (2.1x speedup, code)
# Qwen3 0.6B    → Qwen3 8B        (2.0x speedup, structured)
# Gemma 3 1B    → Gemma 3 27B     (1.8x speedup, general)
# Qwen3 0.6B    → Qwen3 32B       (1.9x speedup, code)

Troubleshooting

Acceptance rate is 0% — speculative decoding is slower than standard

Cause: Draft and target models use different tokenisers. Every draft token is rejected. Diagnosis:

# Check tokenizer types
llama-gguf-info draft.gguf 2>/dev/null | grep "tokenizer.ggml.model"
llama-gguf-info target.gguf 2>/dev/null | grep "tokenizer.ggml.model"

Fix: Use a draft model from the same family as your target. See the recommended pairs table in Part 3.

`CUDA error: out of memory` with both models loaded

Cause: Draft + target model weights exceed your VRAM. Fix:

# Check combined model sizes
du -sh ~/models/target.gguf ~/models/draft.gguf

# Option 1: Reduce draft model GPU layers (run draft partially on CPU)
llama-speculative \
  --model target.gguf \
  --model-draft draft.gguf \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 16    # Run some draft layers on CPU (acceptable performance cost)

# Option 2: Use a smaller target quantisation
ollama pull llama3.3:70b-q3_k_m    # Smaller than q4_k_m

Speculative decoding is active but speedup is less than 1.3x

Cause: Draft acceptance rate is low — either a creative/open-ended task or a suboptimal draft model. Diagnosis:

llama-speculative [your args] 2>&1 | grep "accepted tokens"
# If acceptance rate < 50%, consider disabling speculative decoding for this task type

Fix: Reduce draft length for low-acceptance tasks or disable speculative decoding for purely creative workloads where the overhead exceeds the gain.

Ollama `OLLAMA_SPECULATIVE_DECODE` not taking effect

Cause: Ollama may require a specific model tag to activate the draft pairing. Not all Ollama model manifests include a paired draft. Fix:

# Verify env var is set in the Ollama process
ps aux | grep ollama
# Look for OLLAMA_SPECULATIVE_DECODE in the process environment

# Alternative: use llama.cpp server directly for full speculative decoding control
# See Part 5, Step 5 above

Conclusion

Speculative decoding is the highest-value inference optimisation available for local LLMs in 2026 that requires no new hardware, no model fine-tuning, and produces zero quality degradation. The mechanism is elegant: a 1B draft model fills the idle memory-bandwidth cycles of a 70B target model with useful token predictions, then the target validates them in parallel. The result is 1.8–2.5x throughput on typical developer workloads, climbing to 2.8x with EAGLE-3 on code generation tasks. Combined with Q4_K_M weight quantisation, Flash Attention, and KV cache compression — the techniques covered across this series — a single RTX 4090 running Llama 3.3 70B transforms from a borderline-usable 4 tok/sec (at F16, baseline) into a genuinely productive 41–58 tok/sec sovereign inference engine.

The next article in this series is llama.cpp Tutorial 2026: Complete Guide to Inference Flags, Quantisation, and Serving — mastering every parameter of the inference engine that powers both standard and speculative decoding.

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

>_ 16 Apr | 16 min | Dev Corner

🟡Intermediate

Master GGUF quantization formats for local LLMs in 2026. Q2_K, Q4_K_M, Q5_K_S, Q8_0, F16 explained with benchmarks, VRAM tables, and exact Ollama and llama.cpp commands.

By Marcus Thorne

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

>_ 17 Apr | 17 min | Dev Corner

🟡Intermediate

Complete llama.cpp tutorial for 2026. Install, compile with CUDA/Metal, run GGUF models, tune all inference flags, use the API server, speculative decoding, and benchmark your hardware.

By Marcus Thorne

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

>_ 17 Apr | 16 min | Dev Corner

🟢Beginner

Install Ollama 5.x on Ubuntu, macOS, and Windows. Pull and run Llama 4, Qwen3, Gemma 3, and Mistral locally. REST API setup, GPU acceleration, Open WebUI, and sovereign model management.

By Marcus Thorne

#speculative-decoding #ollama #llama-cpp #local-llms #inference-speed #on-device-ai #2026

Key Takeaways

Introduction: The Memory Wall That’s Slowing Your Local LLM

Part 1: The Physics of Why Local LLMs Are Slow

LLM inference is memory-bound, not compute-bound

What speculative decoding does to those numbers

Part 2: How Speculative Decoding Works — Step by Step

Part 3: Choosing the Right Draft Model

Part 4: Enable Speculative Decoding in llama.cpp

Step 1: Install llama.cpp with GPU support

Step 2: Download draft and target models

Step 3: Run your first speculative decode

Step 4: Tune draft length for your hardware

Step 5: Run as an API server with speculative decoding

Part 5: Enable Speculative Decoding in Ollama

Enable via environment variable

Confirm speculative decoding is active in Ollama

Part 6: Apple Silicon — Speculative Decoding with Metal

Part 7: EAGLE and EAGLE-3 — The Next Level Draft Architecture

Part 8: How Speculative Decoding Stacks with TurboQuant

Part 9: Benchmarks — Speculative Decoding by Hardware (April 2026)

Part 10: The Sovereignty Layer — Verify Zero Data Transmission

Quick Reference: Speculative Decoding Commands

Troubleshooting

Acceptance rate is 0% — speculative decoding is slower than standard

CUDA error: out of memory with both models loaded

Speculative decoding is active but speedup is less than 1.3x

Ollama OLLAMA_SPECULATIVE_DECODE not taking effect

Conclusion

People Also Ask: Speculative Decoding FAQ

Does speculative decoding change the model’s output quality?

Can speculative decoding be used with fine-tuned models?

Is speculative decoding the same as beam search?

Why does code generation benefit more than chat from speculative decoding?

Further Reading

Further Reading

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

The Sovereign Brief

You're in!

Comments

Recently Visited

`CUDA error: out of memory` with both models loaded

Ollama `OLLAMA_SPECULATIVE_DECODE` not taking effect