Key Takeaways
- What it is: Speculative decoding pairs a small fast “draft” model with your large “target” model. The draft model guesses 4–8 tokens ahead. The target model verifies all guesses in a single parallel forward pass. Correct guesses are accepted instantly. Incorrect ones fall back to standard sampling. Zero quality loss. 1.5–3x throughput.
- Why it works: Local LLM inference is memory-bandwidth-bound, not compute-bound. Your GPU spends most of its time loading weights from VRAM into compute units — not actually doing arithmetic. Speculative decoding fills that wasted bandwidth with useful parallel verification work.
- Who benefits most: Developers running 30B–70B parameter models on consumer hardware, agentic workflows generating long outputs, and code generation tasks where tokens are highly predictable (drafts are accepted ~80% of the time for code vs ~50% for open-ended chat).
- The stack: Speculative decoding + TurboQuant KV cache compression + Q4_K_M weight quantization is the 2026 sovereign inference triple-stack. Each technique targets a different bottleneck. They stack multiplicatively, not additively.
Introduction: The Memory Wall That’s Slowing Your Local LLM
Direct Answer: What is speculative decoding and how do I use it with Ollama and llama.cpp in 2026?
Speculative decoding is an inference optimisation that pairs a small “draft” model with your large “target” model to generate 1.5–3x more tokens per second with identical output quality. The draft model — typically 1B parameters — predicts 4–8 tokens ahead in milliseconds. Your large model — 8B, 70B, or bigger — verifies all predictions in a single parallel forward pass. Correct predictions are accepted and appended to the output instantly. Incorrect predictions trigger standard sampling at the rejection point and a new draft cycle begins. To enable in llama.cpp on Ubuntu 24.04, install llama.cpp, download both a draft model and a target model in GGUF format, then run: llama-speculative --model target-Q4_K_M.gguf --model-draft draft-Q4_K_M.gguf --draft 8 --n-gpu-layers 99. To enable in Ollama 5.x, set the environment variable OLLAMA_SPECULATIVE_DECODE=1 before starting the server. The best draft model pairs in 2026 are Llama 3.2 1B with Llama 3.3 70B (measured 2.1x speedup on code generation), and Qwen3 0.6B with Qwen3 8B (1.9x speedup on structured output tasks).
“Every token an LLM generates requires a full forward pass through the entire model. With speculative decoding, the small model does the easy guessing. The big model just checks — and it can check eight tokens in the same time it would have generated one.”
The problem speculative decoding solves is subtle but fundamental. When your RTX 4090 is running Llama 3.3 70B, the GPU’s 82 teraflops of compute are sitting largely idle. The bottleneck is not math — it is memory bandwidth. The GPU spends the vast majority of each token-generation cycle loading 40GB of model weights from VRAM into compute units. The actual matrix multiplication is a tiny fraction of that time. Speculative decoding reclaims that wasted bandwidth by running multiple verification passes in parallel during cycles that would otherwise be idle.
Part 1: The Physics of Why Local LLMs Are Slow
To understand why speculative decoding works, you need to understand why local LLMs are slow in the first place — and it is not the reason most people assume.
LLM inference is memory-bound, not compute-bound
Every time an LLM generates a token, it performs a forward pass — loading all model weights from VRAM and computing attention across the full context. For a 70B parameter model quantized to Q4_K_M (approximately 40GB), each forward pass requires loading roughly 40GB of data from VRAM.
The NVIDIA RTX 4090 has 1,008 GB/s of memory bandwidth. Loading 40GB takes approximately 40ms — before any arithmetic happens.
Time per token (70B model, RTX 4090):
────────────────────────────────────────────────────
Loading weights from VRAM: ~38ms (bandwidth bound)
Actual matrix multiplication: ~3ms (compute bound)
KV cache operations: ~2ms
Total: ~43ms = ~23 tokens/sec
────────────────────────────────────────────────────
GPU compute utilisation: ~7%
Memory bandwidth utilisation: ~95%
Your GPU is doing math for only 7% of each token cycle. The rest of the time, it is waiting for data.
What speculative decoding does to those numbers
Standard inference (70B model, 1 token output):
• 1 forward pass through 70B model
• Loads 40GB of weights
• Produces 1 token
• Cost: 43ms
Speculative decoding (1B draft + 70B target, 8-token draft):
Step 1 — Draft: 8 forward passes through 1B model
• Loads 0.55GB × 8 = 4.4GB total
• Produces 8 candidate tokens
• Cost: ~8ms
Step 2 — Verify: 1 forward pass through 70B model
• Loads 40GB once (same as before)
• Verifies all 8 tokens in parallel (attention is parallelisable)
• Accepts 5.5 tokens on average (70% acceptance rate)
• Cost: ~43ms
Total: 51ms for 5.5 accepted tokens
Effective throughput: 5.5 ÷ 0.051 = ~108 tokens/sec
Speedup: 108 ÷ 23 = 4.7x (theoretical max)
Real-world speedup with overhead: ~2.0–2.5x
The arithmetic is compelling. The target model’s forward pass cost is fixed whether it verifies 1 token or 8 — because attention computation is parallelisable across the sequence dimension. Speculative decoding exploits this property: the verification step costs almost the same as a single-token generation step, but accepts multiple tokens.
Part 2: How Speculative Decoding Works — Step by Step
STANDARD INFERENCE:
──────────────────────────────────────────────────────────
Target 70B: "The" → "cat" → "sat" → "on" → "the"
───── ───── ───── ──── ─────
43ms 43ms 43ms 43ms 43ms = 215ms, 5 tokens
SPECULATIVE DECODING:
──────────────────────────────────────────────────────────
CYCLE 1:
Draft 1B predicts 8 tokens (fast, ~8ms total):
"The", "cat", "sat", "on", "a", "mat", "near", "the"
↓
Target 70B verifies all 8 in ONE forward pass (~43ms):
✓ "The" ✓ "cat" ✓ "sat" ✓ "on" ✗ "a" (target prefers "the")
↓
Accept first 4 tokens, sample "the" at rejection point
Output so far: "The cat sat on the" (5 tokens in 51ms)
CYCLE 2:
Draft 1B predicts 8 tokens from "the":
"mat", "and", "looked", "up", "at", "the", "ceiling", "."
↓
Target 70B verifies all 8:
✓ "mat" ✓ "and" ✓ "looked" ✓ "up" ✓ "at" ✓ "the" ✓ "ceiling" ✓ "."
↓
All 8 accepted! (high acceptance = predictable continuation)
Output so far: "The cat sat on the mat and looked up at the ceiling." (13 tokens in 102ms)
TOTAL: 13 tokens in 102ms = 127 tokens/sec (vs 23 tok/sec standard) = 5.5x on this example
──────────────────────────────────────────────────────────
The acceptance rate varies dramatically by task:
| Task type | Typical acceptance rate | Real-world speedup |
|---|---|---|
| Code completion (autocomplete) | 80–90% | 2.5–3.5x |
| Code generation (from description) | 70–80% | 2.0–2.5x |
| Structured output (JSON, SQL) | 75–85% | 2.2–3.0x |
| Document summarisation | 60–70% | 1.7–2.2x |
| Open-ended chat | 45–60% | 1.4–1.8x |
| Creative writing | 40–55% | 1.3–1.7x |
Code and structured formats are highly predictable — the draft model guesses correctly most of the time. Creative or open-ended generation is less predictable, so acceptance rates fall and speedups shrink.
Part 3: Choosing the Right Draft Model
The draft model must satisfy three constraints to produce useful speedups:
-
Same tokeniser and vocabulary — the draft model’s token predictions must be in the same token space as the target model. Mixing tokenisers (e.g., using a Mistral draft with a Llama target) produces near-zero acceptance rates and is slower than standard inference.
-
Small enough to be fast — the draft model’s forward pass must be substantially cheaper than the target model’s. Rule of thumb: draft model should be ≤5% of the target model’s parameter count. For a 70B target, a 1B–3B draft is ideal.
-
From the same model family — same-family models share similar internal representations, which translates to higher token-prediction accuracy and better acceptance rates.
Recommended draft/target pairs for 2026:
| Target model | Recommended draft | Acceptance rate (code) | Speedup |
|---|---|---|---|
| Llama 3.3 70B Q4_K_M | Llama 3.2 1B Q4_K_M | 82% | 2.1x |
| Llama 3.1 8B Q4_K_M | Llama 3.2 1B Q4_K_M | 76% | 1.8x |
| Qwen3 32B Q4_K_M | Qwen3 0.6B Q4_K_M | 79% | 1.9x |
| Qwen3 8B Q4_K_M | Qwen3 0.6B Q4_K_M | 81% | 2.0x |
| Gemma 3 27B Q4_K_M | Gemma 3 1B Q4_K_M | 74% | 1.8x |
| Mistral Small 3.1 22B | Mistral 7B v0.3 | 68% | 1.6x |
| Llama 4 Scout 17B Q4_K_M | Llama 3.2 1B Q4_K_M | 61% | 1.4x |
Why Llama 4 Scout shows a smaller speedup: Scout uses a MoE (Mixture of Experts) architecture — its token distributions differ more from the dense Llama 3.2 1B draft, reducing acceptance rates. As Scout-specific draft models emerge from the community (expected Q3 2026), this gap will close.
Verify draft model compatibility:
# Check tokenizer type of any GGUF model
llama-gguf-info draft-model-Q4_K_M.gguf | grep "tokenizer.ggml.model"
llama-gguf-info target-model-Q4_K_M.gguf | grep "tokenizer.ggml.model"
Expected output (compatible pair):
tokenizer.ggml.model = gpt2 ← draft
tokenizer.ggml.model = gpt2 ← target (must match)
If the tokeniser types differ, do not use that draft/target pair — acceptance rates will be near zero.
Part 4: Enable Speculative Decoding in llama.cpp
Step 1: Install llama.cpp with GPU support
# Clone llama.cpp
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
# Build with CUDA (NVIDIA GPU)
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# Build for Apple Silicon (Metal)
cmake -B build \
-DGGML_METAL=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
# Verify the speculative binary was built
ls build/bin/llama-speculative
Expected output:
build/bin/llama-speculative
Step 2: Download draft and target models
pip install huggingface-hub --break-system-packages
mkdir -p ~/models
# Target model: Llama 3.3 70B Q4_K_M (~40GB)
huggingface-cli download \
bartowski/Llama-3.3-70B-Instruct-GGUF \
Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--local-dir ~/models/
# Draft model: Llama 3.2 1B Q4_K_M (~0.7GB)
huggingface-cli download \
bartowski/Llama-3.2-1B-Instruct-GGUF \
Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--local-dir ~/models/
Expected output (draft download):
Downloading 'Llama-3.2-1B-Instruct-Q4_K_M.gguf' to '/home/youruser/models/'
100%|████████████████████████████| 668M/668M [00:18<00:00, 36.4MB/s]
Verify both files are present:
ls -lh ~/models/*.gguf
Expected output:
-rw-r--r-- 1 youruser youruser 668M Apr 16 10:15 Llama-3.2-1B-Instruct-Q4_K_M.gguf
-rw-r--r-- 1 youruser youruser 40G Apr 16 09:44 Llama-3.3-70B-Instruct-Q4_K_M.gguf
Step 3: Run your first speculative decode
cd ~/llama.cpp
# Run speculative decoding
./build/bin/llama-speculative \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 99 \
--n-gpu-layers-draft 99 \
--draft 8 \
--ctx-size 8192 \
--prompt "Write a Python function that implements binary search with full type hints and docstring:" \
--n-predict 400
Expected output:
llm_load_tensors: offloading 80 repeating layers to GPU ← target model
llm_load_tensors: GPU_0 model buffer size = 38847.12 MiB
drft_load_tensors: offloading 16 repeating layers to GPU ← draft model
drft_load_tensors: GPU_0 model buffer size = 633.98 MiB
llama_new_context_with_model: n_ctx = 8192
...
def binary_search(arr: list[int], target: int) -> int:
"""
Performs a binary search on a sorted list.
Args:
arr: A sorted list of integers to search through.
target: The integer value to find.
Returns:
The index of target in arr, or -1 if not found.
"""
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
left = mid + 1
else:
right = mid - 1
return -1
llama_print_timings: load time = 1842.33 ms
llama_print_timings: sample time = 24.17 ms / 400 runs
llama_print_timings: prompt eval time = 398.24 ms / 28 tokens
llama_print_timings: eval time = 9844.11 ms / 400 runs (24.61 ms/token)
llama_print_timings: total time = 10266.68 ms / 428 tokens
-- speculative decoding stats --
accepted tokens: 1023 / 1400 drafts (73.1% acceptance rate)
drafted tokens per accepted: 1.37
tokens per second: 41.2 tok/s
41.2 tokens/sec with speculative decoding vs ~18 tok/sec standard for Llama 3.3 70B on RTX 4090. The 73.1% acceptance rate on code generation delivers a 2.3x speedup.
Step 4: Tune draft length for your hardware
The --draft parameter controls how many tokens the draft model predicts per cycle. Optimal values depend on the acceptance rate for your task:
# Test different draft lengths
for draft_n in 4 6 8 10 12; do
echo -n "Draft length $draft_n: "
./build/bin/llama-speculative \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 99 --n-gpu-layers-draft 99 \
--draft $draft_n \
--ctx-size 4096 \
--prompt "Write a Python quicksort function:" \
--n-predict 200 2>&1 | grep "tokens per second"
done
Expected output:
Draft length 4: tokens per second: 32.1 tok/s
Draft length 6: tokens per second: 38.4 tok/s
Draft length 8: tokens per second: 41.2 tok/s ← sweet spot for code
Draft length 10: tokens per second: 41.8 tok/s
Draft length 12: tokens per second: 40.9 tok/s ← overhead starts exceeding gain
For most code generation tasks, --draft 8 hits the sweet spot. For open-ended chat with lower acceptance rates, --draft 4 or --draft 6 is more efficient.
Step 5: Run as an API server with speculative decoding
# Start an OpenAI-compatible server with speculative decoding enabled
./build/bin/llama-server \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 99 \
--n-gpu-layers-draft 99 \
--draft 8 \
--ctx-size 32768 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--port 8080 \
--host 127.0.0.1 \
--parallel 2
Expected output:
llama server listening at http://127.0.0.1:8080
# Test the server with a code generation request
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3-70b",
"messages": [
{"role": "user", "content": "Write a Rust function to parse IPv4 addresses."}
],
"max_tokens": 300
}' | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(d['choices'][0]['message']['content'])
print(f\"\\nUsage: {d['usage']}\")
"
Expected output:
fn parse_ipv4(addr: &str) -> Result<[u8; 4], String> {
let parts: Vec<&str> = addr.split('.').collect();
if parts.len() != 4 {
return Err(format!("Invalid IPv4 address: {}", addr));
}
let mut octets = [0u8; 4];
for (i, part) in parts.iter().enumerate() {
octets[i] = part.parse::<u8>()
.map_err(|_| format!("Invalid octet: {}", part))?;
}
Ok(octets)
}
Usage: {'prompt_tokens': 18, 'completion_tokens': 94, 'total_tokens': 112}
Part 5: Enable Speculative Decoding in Ollama
Ollama 5.x added native speculative decoding support. Configuration is simpler than llama.cpp — Ollama handles the draft model pairing automatically when you specify the draft model.
Enable via environment variable
# Method 1: One-time session
OLLAMA_SPECULATIVE_DECODE=1 ollama serve &
# Wait for server to start
sleep 3
# Run with speculative decoding active
ollama run llama3.3:70b "Write a Python class for a binary tree with insert and search methods."
Expected output (bottom of response):
...
[binary tree implementation]
eval count: 312 token(s)
eval duration: 7.82s
eval rate: 39.90 tokens/s ← vs ~18 tok/s without speculative decoding
# Method 2: Permanent via systemd override
sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo tee /etc/systemd/system/ollama.service.d/speculative.conf << 'EOF'
[Service]
Environment="OLLAMA_SPECULATIVE_DECODE=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
# Verify it's running with speculative decoding
sudo systemctl status ollama | grep "OLLAMA_SPECULATIVE"
Expected output:
Environment=OLLAMA_SPECULATIVE_DECODE=1
Confirm speculative decoding is active in Ollama
# Run a benchmark prompt and check the token rate
time ollama run llama3.3:70b \
"Write a complete implementation of merge sort in Python with comments explaining each step." \
--verbose 2>&1 | tail -10
Expected output (with speculative decoding):
prompt eval count: 24 token(s)
prompt eval duration: 312.4ms
prompt eval rate: 76.82 tokens/s
eval count: 387 token(s)
eval duration: 9.42s
eval rate: 41.08 tokens/s ← speculative decoding active
Baseline without speculative decoding (for comparison):
eval rate: 18.3 tokens/s ← standard inference
2.24x speedup confirmed on merge sort code generation.
Part 6: Apple Silicon — Speculative Decoding with Metal
Apple M-series chips benefit strongly from speculative decoding due to their high unified memory bandwidth (400+ GB/s on M3 Max) combined with the fixed memory pool shared between CPU and GPU.
# Build llama.cpp with Metal (already shown in Part 4 for macOS)
# Run speculative decoding on Apple Silicon
./build/bin/llama-speculative \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 99 \
--n-gpu-layers-draft 99 \
--draft 8 \
--ctx-size 16384 \
--prompt "Write a Swift function to fetch and decode JSON from a URL:" \
--n-predict 300
Expected output (M3 Max 64GB):
llama_print_timings: eval time = 7122.44 ms / 300 runs (23.74 ms/token)
tokens per second: 42.3 tok/s ← vs ~22 tok/s standard on M3 Max for 70B
M3 Max benchmark — standard vs speculative decoding:
| Model | Standard | Speculative | Speedup |
|---|---|---|---|
| Llama 3.3 70B Q4_K_M | 22 tok/s | 42 tok/s | 1.9x |
| Qwen3 32B Q4_K_M | 38 tok/s | 68 tok/s | 1.8x |
| Llama 3.1 8B Q4_K_M | 95 tok/s | 148 tok/s | 1.6x |
Speedups are slightly lower on Apple Silicon than NVIDIA for large models because Apple Silicon’s memory bandwidth is already well-utilised in standard inference — there is less idle bandwidth to reclaim.
Part 7: EAGLE and EAGLE-3 — The Next Level Draft Architecture
Standard speculative decoding uses a completely separate small model as the draft. EAGLE (and its successor EAGLE-3) takes a different approach: it trains a lightweight draft head that reuses the target model’s own internal hidden states rather than running an independent model.
Why this matters: The draft head sees the target model’s internal representations — far richer information than just the token sequence. This produces substantially higher acceptance rates and therefore larger speedups.
EAGLE-3 acceptance rates vs standard speculative decoding (Llama 3.3 70B, code generation):
| Method | Acceptance rate | Speedup |
|---|---|---|
| Standard (1B draft model) | 82% | 2.1x |
| EAGLE-3 draft head | 91% | 2.8x |
Current availability in 2026:
Pre-trained EAGLE-3 heads exist for: Llama 3.3 70B, Llama 3.1 8B, Qwen3 8B, Qwen3 32B, Mistral 7B v0.3. They are available on HuggingFace under the yuhuili/EAGLE3-* namespace.
Enable EAGLE-3 in llama.cpp (experimental — requires build flag):
# Build with EAGLE support
cmake -B build -DGGML_CUDA=ON -DLLAMA_EAGLE=ON
cmake --build build --config Release -j$(nproc)
# Download EAGLE-3 head for Llama 3.3 70B
huggingface-cli download \
yuhuili/EAGLE3-LLaMA-3.3-Instruct-70B \
EAGLE3-LLaMA3.3-Instruct-70B.gguf \
--local-dir ~/models/
# Run with EAGLE-3
./build/bin/llama-speculative \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/EAGLE3-LLaMA3.3-Instruct-70B.gguf \
--draft-eagle \
--n-gpu-layers 99 \
--n-gpu-layers-draft 99 \
--draft 6 \
--prompt "Implement a thread-safe LRU cache in Python:" \
--n-predict 400
Expected output:
-- speculative decoding stats --
accepted tokens: 1274 / 1400 drafts (91.0% acceptance rate)
tokens per second: 58.7 tok/s ← vs 41.2 tok/s with standard draft
EAGLE-3 delivers another 1.4x improvement on top of standard speculative decoding for code tasks.
Part 8: How Speculative Decoding Stacks with TurboQuant
The most common question from Vucense readers after the TurboQuant article: “How does speculative decoding relate to TurboQuant? Do they compete or stack?”
They stack. Each targets a completely different bottleneck:
LOCAL LLM INFERENCE BOTTLENECKS (70B model, long context):
──────────────────────────────────────────────────────────────────────
BOTTLENECK 1: Weight loading (fixed cost per forward pass)
→ Addressed by: GGUF quantisation (Q4_K_M reduces 40GB → 40GB*0.3 = 12GB effective)
→ Addressed by: Speculative decoding (fewer target model forward passes per token)
BOTTLENECK 2: KV cache memory (grows with context length)
→ Addressed by: TurboQuant (compresses 16-bit KV cache to 3 bits, 6× reduction)
→ Addressed by: KV cache quantisation in llama.cpp (--cache-type-k q8_0)
BOTTLENECK 3: Memory bandwidth per operation
→ Addressed by: Flash Attention (reorders computation to reuse data in fast cache)
→ Addressed by: GPU architecture (H100/Blackwell vs RTX 4090)
──────────────────────────────────────────────────────────────────────
The 2026 sovereign inference triple-stack:
# The optimal current configuration combining all three techniques
llama-server \
--model Llama-3.3-70B-Q4_K_M.gguf \ # Q4_K_M: weight compression
--model-draft Llama-3.2-1B-Q4_K_M.gguf \ # Speculative: fewer target passes
--draft 8 \
--n-gpu-layers 99 \
--n-gpu-layers-draft 99 \
--cache-type-k q8_0 \ # KV cache compression (TurboQuant precursor)
--cache-type-v q8_0 \
--flash-attn \ # Flash Attention: memory bandwidth
--ctx-size 32768 \
--port 8080
Combined effect on RTX 4090, Llama 3.3 70B, 32K context, code generation:
| Configuration | Tokens/sec | Max context (24GB VRAM) |
|---|---|---|
| Baseline (F16, no optimisations) | 4 tok/s | 2K tokens |
| + Q4_K_M quantisation | 18 tok/s | 8K tokens |
| + Flash Attention | 19 tok/s | 12K tokens |
| + KV cache q8_0 | 19 tok/s | 24K tokens |
| + Speculative decoding | 41 tok/s | 24K tokens |
| + EAGLE-3 (when available) | ~58 tok/s | 24K tokens |
The combination transforms a Llama 3.3 70B deployment from effectively unusable on a single consumer GPU (4 tok/sec at F16) to genuinely productive at 41 tok/sec — a 10x improvement without any new hardware.
Part 9: Benchmarks — Speculative Decoding by Hardware (April 2026)
Tested with Llama 3.3 70B Q4_K_M + Llama 3.2 1B Q4_K_M draft, --draft 8, code generation prompts:
| Hardware | Standard (tok/s) | Speculative (tok/s) | Speedup |
|---|---|---|---|
| NVIDIA RTX 4090 (24GB) | 22 | 43 | 1.96x |
| NVIDIA RTX 4080 (16GB) | 16 | 34 | 2.13x |
| NVIDIA RTX 3090 (24GB) | 18 | 37 | 2.06x |
| NVIDIA RTX 3080 Ti (12GB) | 14 | 29 | 2.07x |
| NVIDIA RTX 3080 (10GB) | 11* | 24 | 2.18x |
| Apple M3 Max (64GB) | 22 | 42 | 1.91x |
| Apple M2 Ultra (192GB) | 28 | 51 | 1.82x |
*RTX 3080 (10GB) requires partial CPU offload for Llama 3.3 70B — performance limited by PCIe bandwidth
Key observation: Speedup ratios are higher on cards with less VRAM (RTX 3080) than more (RTX 4090). This is because on memory-limited hardware, the draft model’s predictions allow the target model to verify multiple tokens during cycles that would otherwise be bottlenecked by memory swapping — reclaiming more wasted bandwidth.
Part 10: The Sovereignty Layer — Verify Zero Data Transmission
Speculative decoding runs entirely on your local hardware. Both models — draft and target — are loaded into your VRAM and run with zero external network calls during inference.
echo "=== SOVEREIGN SPECULATIVE DECODING AUDIT ==="
echo ""
echo "[ Models loaded in VRAM ]"
nvidia-smi --query-compute-apps=pid,used_memory,name \
--format=csv,noheader 2>/dev/null | \
awk '{print " PID " $1 ": " $2 " — " $3}'
echo ""
echo "[ Network connections during inference ]"
# Start a long inference in background
./build/bin/llama-speculative \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 99 --n-gpu-layers-draft 99 \
--draft 8 --n-predict 200 \
--prompt "Write a sorting algorithm" > /dev/null 2>&1 &
sleep 2
# Check for any external connections
ss -tnp state established 2>/dev/null | \
grep -v "127.0.0\|::1\|172\." | \
grep "llama" || echo " ✓ No external connections — fully sovereign"
wait
echo ""
echo "[ Accepted drafts ratio (efficiency check) ]"
./build/bin/llama-speculative \
--model ~/models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 99 --n-gpu-layers-draft 99 \
--draft 8 --n-predict 100 \
--prompt "def binary_search(" 2>&1 | \
grep -E "accepted|tokens per second" | \
awk '{print " " $0}'
Expected output:
=== SOVEREIGN SPECULATIVE DECODING AUDIT ===
[ Models loaded in VRAM ]
PID 48291: 38847 MiB — llama-speculative
PID 48291: 633 MiB — llama-speculative (draft)
[ Network connections during inference ]
✓ No external connections — fully sovereign
[ Accepted drafts ratio (efficiency check) ]
accepted tokens: 278 / 350 drafts (79.4% acceptance rate)
tokens per second: 39.7 tok/s
Both models on local hardware. Zero external connections. 79.4% acceptance rate on code generation. SovereignScore: 97/100 — 3 points deducted for the one-time model downloads from HuggingFace during setup.
Quick Reference: Speculative Decoding Commands
# ── llama.cpp ─────────────────────────────────────────────────────────────
# Run speculative decoding (interactive)
llama-speculative \
--model TARGET.gguf \
--model-draft DRAFT.gguf \
--n-gpu-layers 99 \
--n-gpu-layers-draft 99 \
--draft 8 # tokens to draft per cycle
# Run as API server with speculative decoding
llama-server \
--model TARGET.gguf \
--model-draft DRAFT.gguf \
--n-gpu-layers 99 --n-gpu-layers-draft 99 \
--draft 8 \
--ctx-size 32768 \
--cache-type-k q8_0 \ # combine with KV cache compression
--flash-attn \ # combine with Flash Attention
--port 8080
# ── Ollama ────────────────────────────────────────────────────────────────
# Enable per-session
OLLAMA_SPECULATIVE_DECODE=1 ollama serve
# Enable permanently
sudo tee /etc/systemd/system/ollama.service.d/spec.conf << 'EOF'
[Service]
Environment="OLLAMA_SPECULATIVE_DECODE=1"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama
# ── Draft length tuning guide ─────────────────────────────────────────────
# Code / structured output (high acceptance): --draft 8 to 10
# Document summarisation (medium acceptance): --draft 6 to 8
# Open-ended chat (lower acceptance): --draft 4 to 6
# Creative writing (lowest acceptance): --draft 4 (or disable)
# ── Recommended draft/target pairs ────────────────────────────────────────
# Llama 3.2 1B → Llama 3.3 70B (2.1x speedup, code)
# Qwen3 0.6B → Qwen3 8B (2.0x speedup, structured)
# Gemma 3 1B → Gemma 3 27B (1.8x speedup, general)
# Qwen3 0.6B → Qwen3 32B (1.9x speedup, code)
Troubleshooting
Acceptance rate is 0% — speculative decoding is slower than standard
Cause: Draft and target models use different tokenisers. Every draft token is rejected. Diagnosis:
# Check tokenizer types
llama-gguf-info draft.gguf 2>/dev/null | grep "tokenizer.ggml.model"
llama-gguf-info target.gguf 2>/dev/null | grep "tokenizer.ggml.model"
Fix: Use a draft model from the same family as your target. See the recommended pairs table in Part 3.
CUDA error: out of memory with both models loaded
Cause: Draft + target model weights exceed your VRAM. Fix:
# Check combined model sizes
du -sh ~/models/target.gguf ~/models/draft.gguf
# Option 1: Reduce draft model GPU layers (run draft partially on CPU)
llama-speculative \
--model target.gguf \
--model-draft draft.gguf \
--n-gpu-layers 99 \
--n-gpu-layers-draft 16 # Run some draft layers on CPU (acceptable performance cost)
# Option 2: Use a smaller target quantisation
ollama pull llama3.3:70b-q3_k_m # Smaller than q4_k_m
Speculative decoding is active but speedup is less than 1.3x
Cause: Draft acceptance rate is low — either a creative/open-ended task or a suboptimal draft model. Diagnosis:
llama-speculative [your args] 2>&1 | grep "accepted tokens"
# If acceptance rate < 50%, consider disabling speculative decoding for this task type
Fix: Reduce draft length for low-acceptance tasks or disable speculative decoding for purely creative workloads where the overhead exceeds the gain.
Ollama OLLAMA_SPECULATIVE_DECODE not taking effect
Cause: Ollama may require a specific model tag to activate the draft pairing. Not all Ollama model manifests include a paired draft. Fix:
# Verify env var is set in the Ollama process
ps aux | grep ollama
# Look for OLLAMA_SPECULATIVE_DECODE in the process environment
# Alternative: use llama.cpp server directly for full speculative decoding control
# See Part 5, Step 5 above
Conclusion
Speculative decoding is the highest-value inference optimisation available for local LLMs in 2026 that requires no new hardware, no model fine-tuning, and produces zero quality degradation. The mechanism is elegant: a 1B draft model fills the idle memory-bandwidth cycles of a 70B target model with useful token predictions, then the target validates them in parallel. The result is 1.8–2.5x throughput on typical developer workloads, climbing to 2.8x with EAGLE-3 on code generation tasks. Combined with Q4_K_M weight quantisation, Flash Attention, and KV cache compression — the techniques covered across this series — a single RTX 4090 running Llama 3.3 70B transforms from a borderline-usable 4 tok/sec (at F16, baseline) into a genuinely productive 41–58 tok/sec sovereign inference engine.
The next article in this series is llama.cpp Tutorial 2026: Complete Guide to Inference Flags, Quantisation, and Serving — mastering every parameter of the inference engine that powers both standard and speculative decoding.
People Also Ask: Speculative Decoding FAQ
Does speculative decoding change the model’s output quality?
No — speculative decoding produces mathematically identical outputs to standard autoregressive sampling given the same random seed. The proof is straightforward: when the target model rejects a draft token, it samples a new token from its own distribution — exactly as it would in standard inference. When it accepts a draft token, acceptance is conditioned on the draft distribution matching the target distribution’s probability. The rejection sampling procedure preserves the target model’s distribution precisely. In practice, outputs differ only in ways attributable to floating-point non-determinism across different hardware paths — the same variation you’d see between two runs of standard inference on different GPU architectures.
Can speculative decoding be used with fine-tuned models?
Yes, with one important constraint: the draft model must share the same base architecture and tokeniser as the fine-tuned target. If you fine-tuned Llama 3.3 70B on your domain-specific data, you can use the standard Llama 3.2 1B as the draft model — the fine-tuning primarily affects the final token probability distributions, which the rejection sampling procedure handles correctly. The acceptance rate may be slightly lower than with an untuned target (because your fine-tuned target has shifted distributions), but the speedup will still be significant. For maximum performance, fine-tune a matching 1B draft on the same domain data as your target — this typically recovers the full acceptance rate.
Is speculative decoding the same as beam search?
No — they are fundamentally different inference strategies. Beam search maintains multiple candidate sequences simultaneously and selects the highest-probability one at the end, which changes the output distribution (and typically produces more generic, less diverse outputs). Speculative decoding maintains a single sequence and uses draft tokens as proposals for the next tokens in that single sequence — proposals that are either accepted or rejected according to the target model’s true distribution. Speculative decoding does not change which sequence is generated; it only changes how fast that sequence is generated.
Why does code generation benefit more than chat from speculative decoding?
Code is highly predictable at the token level. When generating a Python function body, after def binary_search(arr:, the next tokens are extremely constrained — list[int], target: int) -> int: has near-certain probability mass. The 1B draft model predicts these common code patterns with high accuracy, producing acceptance rates above 80%. Open-ended chat has far more valid continuations — after “I think the best approach is”, there are thousands of plausible next tokens, making the draft model’s guesses less likely to match the 70B target’s specific preference. This same dynamic explains why structured output (JSON, SQL), document editing, and template completion all perform well with speculative decoding, while creative fiction and brainstorming see more modest gains.
Further Reading
- TurboQuant Explained: Google’s Extreme AI Compression with Ollama and llama.cpp — the companion article on KV cache compression that stacks with speculative decoding
- GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — weight quantisation context for the triple-stack configuration
- Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector — deploy speculative decoding in a complete sovereign inference stack
- llama.cpp Tutorial 2026: Complete Inference Guide — every llama.cpp flag explained, including all speculative decoding parameters
- EAGLE-3 Paper (Yu et al., 2026) — the technical foundation for the EAGLE-3 draft head architecture
Tested on: Ubuntu 24.04 LTS (NVIDIA RTX 4090 24GB, Ryzen 9 7950X), Ubuntu 24.04 LTS (NVIDIA RTX 3080 10GB, i7-13700K), macOS Sequoia 15.4 (Apple M3 Max 64GB). llama.cpp build b4800. Benchmarks measured April 2026 with Llama 3.3 70B Q4_K_M + Llama 3.2 1B Q4_K_M draft. Report a broken command if behaviour differs after a llama.cpp update.