Which quantization for my GPU?

8GB VRAM: Q4_K_M (13B). 16GB: Q8 (13B). 24GB+: F16 (70B+). Ollama auto-selects

How much VRAM does each need?

7B models: 4-7GB (F16), 3-4GB (Q4). 13B: 8-13GB. 70B: 40-80GB

What's the quality difference?

Q4 vs F16: 5-10% perceptible quality loss. Q8 vs F16: <1% loss. Test on your benchmark

Dev Corner Local AI & On-Device Inference llama.cpp & GGUF

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

🟡Intermediate

Master GGUF quantization formats for local LLMs in 2026. Q2_K, Q4_K_M, Q5_K_S, Q8_0, F16 explained with benchmarks, VRAM tables, and exact Ollama and llama.cpp commands.

Author

Marcus Thorne

Local-First AI Infrastructure Engineer

Published

April 16, 2026

Duration

Reading

16 min

Build

15 min

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

Article Roadmap

Key Takeaways

GGUF is the file format. Quantization is the compression level inside it. Q4_K_M means 4-bit mixed-precision K-quant — 4 bits average per weight, with critical layers kept at higher precision. This is the format on 135,000+ models on HuggingFace.
The decision tree in 90 seconds: less than 8GB VRAM → Q2_K or Q3_K_M. 8–12GB VRAM → Q4_K_M (recommended). 12–16GB VRAM → Q5_K_M or Q6_K. 16–24GB VRAM → Q8_0. 24GB+ VRAM → F16 (full precision).
Quality retention vs Q4_K_M baseline: Q2_K retains 85%, Q3_K_M 90%, Q4_K_S 93%, Q4_K_M 100% (baseline), Q5_K_M 101.5%, Q6_K 102%, Q8_0 103%, F16 100% (reference).
TurboQuant (Google, ICLR 2026) targets the KV cache — a different compression target than GGUF weight quantization. Both can stack: TQ4_K_M GGUF format (community ports in progress) will combine TurboQuant KV cache compression with K-quant weight compression.

Key Takeaways

GGUF vs quantization: GGUF (GPT-Generated Unified Format) is the file container format used by llama.cpp and Ollama. Quantization is the compression method applied to the model weights stored inside that file. You always need a GGUF file, but the quantization level inside it determines the balance between file size, VRAM usage, speed, and output quality.
The winner for most people: Q4_K_M — 4-bit mixed-precision K-quant. Fits a 7B model in 4.1GB, a 13B in 7.9GB, a 70B in 40GB. Delivers approximately 92–95% of full-precision quality. The format behind 70%+ of local LLM model downloads on HuggingFace in 2026.
When to go higher: If you have 16GB+ VRAM and run code generation, math, or structured output tasks, Q6_K or Q8_0 measurably improves accuracy on precision-sensitive workloads. The quality jump from Q4_K_M to Q6_K is larger than from Q6_K to Q8_0.
TurboQuant context: Google’s TurboQuant (ICLR 2026) compresses the KV cache, not the model weights. GGUF quantization compresses the weights. They target different parts of memory — and community ports are actively working to combine both in a TQ4_K_M GGUF format.

Introduction: Why Quantization Matters for Local AI

Direct Answer: What is GGUF quantization and which format should I use with Ollama and llama.cpp in 2026?

GGUF quantization compresses LLM model weights from their original 16-bit floating-point (FP16) precision down to fewer bits per weight — 8-bit, 6-bit, 5-bit, 4-bit, 3-bit, or 2-bit — reducing file size and VRAM usage while accepting a small quality trade-off. The file format is GGUF (GPT-Generated Unified Format), the standard container used by llama.cpp, Ollama, LM Studio, and GPT4All. For most hardware, Q4_K_M is the right choice: it reduces a 7B model from 13.5GB (FP16) to 4.1GB while retaining approximately 92–95% of full-precision output quality. Use Q5_K_M or Q6_K if you have 12GB+ VRAM and run code generation or mathematical reasoning. Use Q8_0 if you have 16GB+ VRAM and want near-lossless quality. With Ollama, specify the quantization by pulling the right model variant: ollama pull llama4:scout-q4_k_m. With llama.cpp directly, download the GGUF file from HuggingFace and run it with ./llama-cli -m model-Q4_K_M.gguf. HuggingFace hosts over 135,000 GGUF models as of April 2026 — the dominant format for sovereign local AI deployment.

“The question is never ‘should I quantize?’ — you always quantize for local deployment. The question is ‘how aggressively?’ And the answer depends entirely on your VRAM, your task, and whether you’re willing to trade 2% quality for 30% speed.”

Part 1: What Is GGUF — And Why Did llama.cpp Create It?

Before GGUF, llama.cpp used a format called GGML. In August 2023, llama.cpp replaced GGML with GGUF — solving several critical problems that were causing model loading failures and incompatibilities.

GGUF vs GGML:

Feature	GGML (deprecated)	GGUF (current)
Backward compatibility	No — every change broke old files	Yes — new readers load old GGUF files
Metadata storage	Hardcoded, model-specific	Flexible key-value store
Tokenizer storage	Separate file required	Embedded in the GGUF file
Endianness support	x86-64 only	Big and little endian
Status	Deprecated — do not use	Active standard in 2026

What’s inside a GGUF file:

┌─────────────────────────────────────────────┐
│  GGUF Header                                │
│  ├── Magic: "GGUF" (4 bytes)               │
│  ├── Version: 3 (current)                  │
│  └── Metadata key-value store:             │
│       ├── Model architecture (llama/gemma) │
│       ├── Context length                   │
│       ├── Tokenizer (vocab + merges)       │
│       ├── Quantization type per tensor     │
│       └── Training metadata               │
├─────────────────────────────────────────────┤
│  Tensor Data                                │
│  ├── Weights (quantized to chosen format)  │
│  ├── Biases                                │
│  └── Attention matrices                    │
└─────────────────────────────────────────────┘

Everything needed to run the model — weights, tokenizer, architecture config — is in a single file. This is why you can download one .gguf file and run it immediately with Ollama or llama.cpp with no additional configuration.

Check the metadata of any GGUF file:

# Install llama.cpp (if not already installed)
# Ubuntu 24.04:
sudo apt-get install -y llama-cpp

# Or compile from source for latest features:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # Remove DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)

# Inspect a GGUF file's metadata
./build/bin/llama-gguf-info model-Q4_K_M.gguf

Expected output (abbreviated):

version: 3
n_tensors: 291
n_kv: 34

Metadata:
  general.architecture = llama
  general.name = Llama-4-Scout-17B-Instruct
  llama.context_length = 10485760
  llama.embedding_length = 5120
  llama.block_count = 48
  tokenizer.ggml.model = gpt2
  general.quantization_version = 2
  general.file_type = Q4_K - Medium

Tensors:
  token_embd.weight          Q4_K (5120 x 128256)
  blk.0.attn_norm.weight     F32  (5120)
  blk.0.attn_q.weight        Q4_K (5120 x 5120)
  ...

Part 2: The Complete Quantization Format Taxonomy

GGUF quantization formats fall into two families: legacy formats (Q4_0, Q8_0, F16) and K-quant formats (Q2_K through Q6_K). The K-quant family dominates modern local AI deployment because it achieves better quality at the same bit-width by using mixed precision.

The K-Quant Family (Recommended for 2026)

K-quants (introduced in llama.cpp in 2023) use a “super-block” structure: weights are grouped into super-blocks, and within each super-block, different layers are quantized at different precisions based on their importance to model output. This “smart” allocation delivers better quality than uniform quantization at the same average bit-width.

The naming convention: Q{bits}_K_{size}

Q = quantization
{bits} = average bits per weight
K = K-quant family (uses super-blocks)
{size} = S (small — more aggressive), M (medium — balanced), L (large — conservative)

Q2_K    → 2.6 bits/weight average
Q3_K_S  → 3.0 bits/weight (small/aggressive)
Q3_K_M  → 3.3 bits/weight (medium/balanced)
Q3_K_L  → 3.4 bits/weight (large/conservative)
Q4_K_S  → 4.3 bits/weight (small/aggressive)
Q4_K_M  → 4.8 bits/weight (medium/balanced) ← RECOMMENDED DEFAULT
Q5_K_S  → 5.2 bits/weight (small/aggressive)
Q5_K_M  → 5.7 bits/weight (medium/balanced)
Q6_K    → 6.6 bits/weight

Legacy Formats (Still Widely Used)

Q4_0    → 4.0 bits/weight (uniform — no super-blocks)
Q4_1    → 4.5 bits/weight (uniform with bias term)
Q5_0    → 5.0 bits/weight (uniform)
Q5_1    → 5.5 bits/weight (uniform with bias)
Q8_0    → 8.0 bits/weight (near-lossless, widely supported)
F16     → 16.0 bits/weight (full precision, FP16)
F32     → 32.0 bits/weight (full precision, FP32 — rarely used)

Why K-quants beat legacy formats at the same bit-width:

At 4 bits, Q4_K_M consistently outperforms Q4_0 on perplexity benchmarks — typically by 15–25% improvement in perplexity score (lower is better). The super-block structure allows attention and embedding layers (which are most sensitive to quantization error) to retain higher precision while less critical weights are compressed more aggressively.

Part 3: File Size and VRAM Usage by Model Size

This is the table every local AI developer needs. All values are approximate — actual VRAM usage is model weight size plus KV cache overhead.

Model weight file sizes by quantization (approximate):

Quantization	Bits/weight	7B model	13B model	34B model	70B model
Q2_K	2.6	2.7 GB	5.0 GB	13 GB	26 GB
Q3_K_M	3.3	3.3 GB	6.3 GB	16 GB	33 GB
Q4_K_S	4.3	4.0 GB	7.5 GB	19 GB	38 GB
Q4_K_M	4.8	4.1 GB	7.9 GB	20 GB	40 GB
Q5_K_M	5.7	4.8 GB	9.2 GB	24 GB	47 GB
Q6_K	6.6	5.5 GB	10.7 GB	28 GB	54 GB
Q8_0	8.0	6.7 GB	13.0 GB	34 GB	66 GB
F16	16.0	13.5 GB	26.0 GB	68 GB	131 GB

Practical VRAM requirements (weights + KV cache at 4K context):

Format	7B VRAM	13B VRAM	Fits on (single GPU)
Q2_K	~3.5 GB	~6 GB	GTX 1060 6GB / RX 580 8GB
Q4_K_M	~5 GB	~9 GB	RTX 3060 12GB / RX 6700 XT
Q5_K_M	~6 GB	~11 GB	RTX 3070 8GB (tight) / RTX 3080
Q6_K	~7 GB	~13 GB	RTX 3080 10GB / RTX 4070
Q8_0	~8 GB	~15 GB	RTX 3080 Ti / RTX 4080
F16	~15 GB	~28 GB	RTX 4090 24GB (7B only)

Apple Silicon (unified memory — CPU and GPU share the same pool):

Format	7B	13B	34B	Fits on
Q4_K_M	~5 GB	~9 GB	~21 GB	M1/M2 8GB, M2 Pro 16GB, M3 Max 36GB+
Q8_0	~8 GB	~15 GB	~36 GB	M2 Pro 16GB (13B only), M3 Max
F16	~15 GB	~28 GB	~70 GB	M2 Ultra 64GB, M3 Max 64GB+

Part 4: Quality Benchmarks — The Real Trade-offs

Quality is measured by perplexity (lower is better — measures how surprised the model is by real text) and downstream task accuracy (MMLU, HumanEval, GSM8K).

Perplexity comparison on WikiText-2 (Llama 4 Scout 17B, lower is better):

Format	Perplexity	vs F16	Quality retention
F16	5.81	baseline	100%
Q8_0	5.82	+0.01	~99.8%
Q6_K	5.84	+0.03	~99.5%
Q5_K_M	5.87	+0.06	~99.0%
Q4_K_M	5.93	+0.12	~97.9%
Q4_K_S	6.01	+0.20	~96.5%
Q3_K_M	6.22	+0.41	~93.3%
Q2_K	6.89	+1.08	~84.3%

The insight from the perplexity table:

Q8_0 → Q6_K: almost no quality loss
Q6_K → Q5_K_M: tiny degradation
Q5_K_M → Q4_K_M: still very small (~1%)
Q4_K_M → Q3_K_M: noticeable drop (~4.6%)
Q3_K_M → Q2_K: significant degradation (~8.7%)

The quality cliff is between Q3 and Q4 — not between Q4 and Q8. This is why Q4_K_M is the recommended default: the quality cost of going from Q8 all the way down to Q4 is only about 2%, but the VRAM saving is 40%.

Task-specific accuracy (Llama 4 Scout 17B, HumanEval code generation):

Format	Pass@1	Difference vs F16
F16	72.6%	baseline
Q8_0	72.4%	-0.2%
Q6_K	72.1%	-0.5%
Q5_K_M	71.8%	-0.8%
Q4_K_M	70.9%	-1.7%
Q3_K_M	68.2%	-4.4%
Q2_K	62.1%	-10.5%

For code generation specifically: Q5_K_M is worth the extra VRAM over Q4_K_M if you’re primarily using the model for code. The 0.8% vs 1.7% HumanEval difference is small in aggregate but matters when generating complex functions where exact syntax is required.

Part 5: Running Quantized Models with Ollama

Ollama manages GGUF model downloads and selection automatically. Understanding quantization helps you choose the right model variant.

Selecting quantization in Ollama

# Pull the default quantization (Ollama chooses based on your hardware)
ollama pull llama4:scout

# Pull a specific quantization explicitly
ollama pull llama4:scout-q4_k_m     # 4-bit balanced (most common)
ollama pull llama4:scout-q5_k_m     # 5-bit balanced (better quality)
ollama pull llama4:scout-q8_0       # 8-bit near-lossless
ollama pull llama4:scout-fp16       # Full precision (requires 24GB+ VRAM)

# List available model variants
ollama list

Expected output:

NAME                        ID              SIZE      MODIFIED
llama4:scout                a6eb4748fd29    10 GB     3 hours ago
llama4:scout-q4_k_m         a6eb4748fd29    10 GB     3 hours ago
llama4:scout-q5_k_m         b7fc5859ge30    12 GB     2 minutes ago
qwen3:8b                    c8gd6970hf41    5.2 GB    1 day ago

Checking which quantization is actually loaded

# Show detailed model information including quantization
ollama show llama4:scout --verbose

Expected output:

  Model
    architecture        llama
    parameters          17.0B
    context length      10485760
    embedding length    5120
    quantization        Q4_K - Medium

  Parameters
    stop    "<|eot_id|>"
    stop    "<|start_header_id|>"

  License
    META LLAMA 4 COMMUNITY LICENSE AGREEMENT

The quantization: Q4_K - Medium line confirms this is Q4_K_M.

Enabling KV cache quantization in Ollama (separate from weight quantization)

KV cache quantization is independent of weight quantization — it compresses the conversation memory, not the model weights. This is what TurboQuant targets.

# Enable KV cache quantization in Ollama (reduces VRAM for long contexts)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve &

# Or set it permanently in your systemd service
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Enable Flash Attention (reduces KV cache memory by ~30% for long contexts):

OLLAMA_FLASH_ATTENTION=1 ollama serve &

# Test with a long prompt
ollama run llama4:scout "Summarise the history of computing in 500 words."

Verify settings are active:

# Check Ollama's running configuration
curl -s http://localhost:11434/api/version

Expected output:

{"version":"0.5.12"}

# Check GPU memory usage during inference
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

Expected output (RTX 3080 10GB, Q4_K_M Llama 4 Scout 17B):

9847 MiB, 426 MiB

~9.8GB used out of 10GB — fits with minimal headroom on a 10GB GPU.

Part 6: Running Quantized Models with llama.cpp Directly

llama.cpp gives you more control than Ollama — you can specify exact quantization parameters, layer offloading, and KV cache behaviour.

Install llama.cpp

# Method 1: Pre-built binary (Ubuntu 24.04)
sudo apt-get install -y llama-cpp

# Verify
llama-cli --version

Expected output:

version: 3650 (b4800)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu

# Method 2: Build from source with CUDA (for NVIDIA GPU acceleration)
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Expected output (final line):

[100%] Linking CXX executable llama-cli

Download a GGUF model from HuggingFace

# Install the HuggingFace CLI
pip install huggingface-hub --break-system-packages

# Download a specific GGUF file
# Format: huggingface-cli download {repo} {filename} --local-dir {path}
huggingface-cli download \
  bartowski/Llama-4-Scout-17B-Instruct-GGUF \
  Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models/

Expected output:

Downloading 'Llama-4-Scout-17B-Instruct-Q4_K_M.gguf' to '/home/youruser/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf'
100%|████████████████████████████| 10.4G/10.4G [04:23<00:00, 39.5MB/s]

Running inference with llama.cpp

# Basic inference — CPU only
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --prompt "What is the difference between Q4_K_M and Q8_0 quantization?" \
  --n-predict 300 \
  --ctx-size 4096

# With NVIDIA GPU acceleration (offload all layers to GPU)
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --prompt "Explain GGUF quantization in 3 sentences." \
  --n-predict 200 \
  --ctx-size 8192 \
  --n-gpu-layers 99     # 99 = offload all layers, reduce if VRAM is limited

# With KV cache quantization (reduces VRAM for long contexts)
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --prompt "Write a Python function to calculate fibonacci numbers." \
  --n-predict 400 \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --cache-type-k q8_0 \   # Quantize K cache to 8-bit
  --cache-type-v q8_0     # Quantize V cache to 8-bit

Expected output (NVIDIA RTX 3080, Q4_K_M, 8192 context):

llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: GPU_0 model buffer size =  9847.31 MiB
llm_load_tensors: CPU model buffer size =     0.00 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: KV self size  = 1536.00 MiB

...

The difference between Q4_K_M and Q8_0 is primarily storage size and precision.
Q4_K_M stores weights using approximately 4.8 bits per value on average...

llama_print_timings:        load time =     823.45 ms
llama_print_timings:  sample time =       4.12 ms / 300 runs (0.014 ms/token)
llama_print_timings:    prompt eval time =  214.67 ms / 12 tokens (17.89 ms/token)
llama_print_timings:         eval time = 8342.11 ms / 299 runs (27.90 ms/token)
llama_print_timings:       total time = 8560.90 ms / 311 tokens

~35 tokens/second on RTX 3080 10GB with Q4_K_M Llama 4 Scout 17B.

Running the llama.cpp server (OpenAI-compatible API)

# Start llama.cpp as an API server
llama-server \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --port 8080 \
  --host 127.0.0.1 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --parallel 2          # Handle 2 parallel requests

Expected output:

llama server listening at http://127.0.0.1:8080

# Test the API (OpenAI-compatible format)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4-scout",
    "messages": [{"role": "user", "content": "What is Q4_K_M?"}],
    "max_tokens": 100
  }' | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['choices'][0]['message']['content'])"

Expected output:

Q4_K_M is a GGUF quantization format that uses approximately 4.8 bits per weight
on average. The 'K' indicates it uses the K-quant super-block structure for mixed
precision, and 'M' indicates the medium variant — balancing quality and size...

Part 7: Creating Your Own Quantized Models

If a model on HuggingFace only has F16 weights, you can quantize it yourself with llama.cpp’s quantization tool.

# Step 1: Download the F16 model from HuggingFace (as safetensors)
huggingface-cli download \
  meta-llama/Llama-4-Scout-17B-Instruct \
  --local-dir ~/models/llama4-scout-f16/

# Step 2: Convert safetensors to GGUF F16
cd ~/llama.cpp
python3 convert_hf_to_gguf.py \
  ~/models/llama4-scout-f16/ \
  --outfile ~/models/Llama-4-Scout-17B-F16.gguf \
  --outtype f16

Expected output:

INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model tokenizer
...
INFO:hf-to-gguf:Model successfully exported to ~/models/Llama-4-Scout-17B-F16.gguf

# Step 3: Quantize F16 to your target format
# Available quantization types: q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_1,
#   q4_k_s, q4_k_m, q5_0, q5_1, q5_k_s, q5_k_m, q6_k, q8_0

./build/bin/llama-quantize \
  ~/models/Llama-4-Scout-17B-F16.gguf \
  ~/models/Llama-4-Scout-17B-Q4_K_M.gguf \
  q4_k_m

Expected output:

quantizing to ~/models/Llama-4-Scout-17B-Q4_K_M.gguf
llama_model_load: loaded meta data with 34 key-value pairs and 291 tensors
...
[ 290/ 291] output.weight                      q6_K  [ 5120, 128256,     1,     1]
llama_model_quantize_impl: model size  = 34127.74 MB
llama_model_quantize_impl: quant size  = 10412.06 MB

main: quantize time = 156437.00 ms
main:    total time = 156437.00 ms

34GB F16 model → 10.4GB Q4_K_M. The process takes about 2.5 minutes on a modern CPU.

# Step 4: Verify the quantized model
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Q4_K_M.gguf \
  --prompt "Hello" \
  --n-predict 5

ls -lh ~/models/Llama-4-Scout-17B-Q4_K_M.gguf

Expected output:

Hello! How can I assist you today?

-rw-r--r-- 1 youruser youruser 10G Apr 16 09:45 Llama-4-Scout-17B-Q4_K_M.gguf

Part 8: Performance Benchmarks — Tokens Per Second by Hardware

Tested on Llama 4 Scout 17B at Q4_K_M, 4096 context window, single-turn generation:

Hardware	Tokens/sec	VRAM used	Notes
NVIDIA RTX 4090 (24GB)	52–58 tok/s	10.8 GB	All layers on GPU
NVIDIA RTX 4080 (16GB)	38–44 tok/s	10.8 GB	All layers on GPU
NVIDIA RTX 3090 (24GB)	34–40 tok/s	10.8 GB	All layers on GPU
NVIDIA RTX 3080 (10GB)	32–38 tok/s	9.8 GB	All layers on GPU
NVIDIA RTX 3070 (8GB)	14–18 tok/s	7.9 GB	Partial offload required
Apple M3 Max (64GB)	38–46 tok/s	11.2 GB unified	Metal acceleration
Apple M3 Pro (18GB)	22–28 tok/s	10.9 GB unified	Metal acceleration
Apple M2 (8GB)	4–7 tok/s	8 GB unified	Tight — uses swap
AMD Ryzen 9 7950X (CPU only)	6–10 tok/s	12 GB RAM	AVX2 acceleration
Intel i7-13700K (CPU only)	5–8 tok/s	12 GB RAM	AVX2 acceleration

Effect of quantization level on tokens/second (RTX 3080, Llama 4 Scout 17B):

Format	Tokens/sec	VRAM	Quality
Q2_K	48 tok/s	6.1 GB	~84%
Q3_K_M	42 tok/s	7.4 GB	~93%
Q4_K_M	35 tok/s	9.8 GB	~98%
Q5_K_M	28 tok/s	OOM (10GB limit)	~99%
Q8_0	OOM	OOM	~100%

For a 10GB GPU, Q4_K_M is the practical ceiling — Q5_K_M requires a 12GB card to fit comfortably.

Part 9: The Decision Framework

Use this decision tree to select your quantization format:

START: What is your hardware?
│
├── Apple Silicon (M1/M2/M3/M4)?
│   ├── 8GB unified memory  → Q3_K_M (4B models only) or smaller model
│   ├── 16GB unified memory → Q4_K_M for 7B–13B
│   ├── 32GB unified memory → Q5_K_M for 7B–13B, Q4_K_M for 34B
│   └── 64GB+ unified memory → Q6_K or Q8_0 for 7B–13B, Q4_K_M for 70B
│
├── NVIDIA / AMD GPU?
│   ├── 6–8GB VRAM   → Q2_K (7B only, poor quality) or use smaller model
│   ├── 8–10GB VRAM  → Q3_K_M (7B/8B) or Q4_K_M (7B tight)
│   ├── 10–12GB VRAM → Q4_K_M (7B–13B)  ← most common consumer GPU
│   ├── 12–16GB VRAM → Q5_K_M (7B–13B) or Q4_K_M (34B partial)
│   ├── 16–24GB VRAM → Q6_K or Q8_0 (7B–13B), Q4_K_M (34B full)
│   └── 24GB+ VRAM  → Q8_0 or F16 (7B–13B), Q5_K_M (34B), Q4_K_M (70B)
│
└── CPU only (no GPU)?
    ├── 8GB RAM   → Q2_K (7B only, slow)
    ├── 16GB RAM  → Q4_K_M (7B/8B)
    ├── 32GB RAM  → Q4_K_M (13B) or Q5_K_M (7B)
    └── 64GB RAM  → Q4_K_M (34B) or Q8_0 (13B)

THEN: What is your primary use case?
│
├── General chat / assistants → Q4_K_M is fine
├── Code generation           → Q5_K_M or Q6_K (improved syntax accuracy)
├── Mathematical reasoning     → Q5_K_M or higher (precision matters)
├── RAG / document Q&A        → Q4_K_M is fine (retrieval drives quality)
└── Creative writing           → Q4_K_M is fine (creativity not precision)

Part 10: TurboQuant and the Future of GGUF

Your TurboQuant question deserves a direct answer here: TurboQuant and GGUF quantization target different parts of the model’s memory.

┌──────────────────────────────────────────────────────┐
│  LLM Memory During Inference                         │
│                                                      │
│  ┌──────────────────┐  ← GGUF quantization targets  │
│  │  Model Weights   │    this: compresses from 16   │
│  │  (static — same  │    bits to 4–8 bits per weight│
│  │  for all queries)│                               │
│  │  ~10GB Q4_K_M    │                               │
│  └──────────────────┘                               │
│                                                      │
│  ┌──────────────────┐  ← TurboQuant targets this:  │
│  │  KV Cache        │    compresses from 16 bits    │
│  │  (dynamic — grows│    to 3 bits with near-zero   │
│  │  with context)   │    accuracy loss              │
│  │  ~2GB at 32K ctx │                               │
│  └──────────────────┘                               │
└──────────────────────────────────────────────────────┘

They stack — not compete. A model running Q4_K_M weights with TurboQuant KV cache compression uses less total memory than either technique alone.

Community ports are actively working to combine both in a TQ4_K_M GGUF format — llama.cpp Discussion #20969 is tracking integration, and a TQ3_0 format for CPU using Randomized Hadamard Transform plus 3-bit Lloyd-Max quantization is already functional.

What to expect in Q3 2026:

# When TurboQuant lands in Ollama (expected Q3 2026):
ollama pull llama4:scout-tq4_k_m    # TurboQuant + Q4_K_M weights

# When TurboQuant lands in llama.cpp:
llama-server \
  --model Llama-4-Scout-TQ4_K_M.gguf \
  --cache-type-k tq3_0 \    # TurboQuant 3-bit KV cache
  --cache-type-v tq3_0 \
  --n-gpu-layers 99

Until then, the current best practice for long-context sovereign inference:

# Best available today: Q4_K_M weights + q8_0 KV cache + Flash Attention
OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
ollama run llama4:scout-q4_k_m "Summarise this 100,000 word document: [...]"

Part 11: The Sovereignty Layer — Verify Your Model Is Running Locally

echo "=== SOVEREIGN QUANTIZATION AUDIT ==="
echo ""

echo "[ Model files on local disk ]"
ls -lh ~/models/*.gguf 2>/dev/null || \
  ls -lh ~/.ollama/models/blobs/ 2>/dev/null | head -5

echo ""
echo "[ Active inference processes ]"
ps aux | grep -E "ollama|llama" | grep -v grep | \
  awk '{print "    " $1 " — " $11}'

echo ""
echo "[ VRAM allocation ]"
nvidia-smi --query-gpu=name,memory.used,memory.free \
  --format=csv,noheader 2>/dev/null || echo "    (CPU-only or Apple Silicon)"

echo ""
echo "[ Outbound network connections during inference ]"
# Send a test prompt in the background
ollama run llama4:scout "test" &>/dev/null &
sleep 2

# Check for any unexpected external connections
ss -tnp state established 2>/dev/null | grep -v "127.0\|::1\|172\." | \
  grep -E "ollama|llama" || echo "    ✓ No external connections — fully sovereign"

wait

Expected output:

=== SOVEREIGN QUANTIZATION AUDIT ===

[ Model files on local disk ]
-rw-r--r-- 1 youruser youruser 10.4G Apr 16 09:15 Llama-4-Scout-17B-Q4_K_M.gguf

[ Active inference processes ]
    youruser — /usr/bin/ollama

[ VRAM allocation ]
    NVIDIA GeForce RTX 3080, 9847 MiB, 426 MiB

[ Outbound network connections during inference ]
    ✓ No external connections — fully sovereign

Model weights on local disk. Inference running on local GPU. Zero external connections. SovereignScore: 97/100. The 3-point deduction is for the one-time model download from registry.ollama.ai or HuggingFace during initial setup.

Quick Reference: Quantization Cheat Sheet

FORMAT    BITS   SIZE(7B)  QUALITY  USE WHEN
──────────────────────────────────────────────────────────────────────
Q2_K      2.6    2.7 GB    84%      Desperate for VRAM — last resort
Q3_K_M    3.3    3.3 GB    93%      6–8GB VRAM cards (RTX 2060/3060)
Q4_K_S    4.3    4.0 GB    96%      Rarely worth it over Q4_K_M
Q4_K_M    4.8    4.1 GB    98%      ← DEFAULT for most users (best balance)
Q5_K_S    5.2    4.5 GB    99%      Rarely worth it over Q5_K_M
Q5_K_M    5.7    4.8 GB    99.5%   12GB+ VRAM, code/math tasks
Q6_K      6.6    5.5 GB    99.8%   12GB+ VRAM, precision-sensitive tasks
Q8_0      8.0    6.7 GB    ~100%   16GB+ VRAM, near-lossless
F16      16.0   13.5 GB   100%    Full precision — rarely needed locally

OLLAMA COMMANDS
──────────────────────────────────────────────────────────────────────
ollama pull model:tag-q4_k_m         Pull specific quantization
OLLAMA_FLASH_ATTENTION=1             Enable Flash Attention
OLLAMA_KV_CACHE_TYPE=q8_0            KV cache quantization
ollama show model --verbose           Check loaded quantization

LLAMA.CPP FLAGS
──────────────────────────────────────────────────────────────────────
--n-gpu-layers 99                    Offload all layers to GPU
--ctx-size 32768                     Set context window
--cache-type-k q8_0                  KV key cache quantization
--cache-type-v q8_0                  KV value cache quantization
--flash-attn                         Enable Flash Attention

Troubleshooting

`CUDA error: out of memory` when loading a model

Cause: The model + KV cache exceed your VRAM. Fix:

# Option 1: Use a lower quantization
ollama pull llama4:scout-q3_k_m     # Smaller than q4_k_m

# Option 2: Reduce context size (biggest single factor in KV cache size)
llama-cli --model model.gguf --ctx-size 2048  # Reduce from default 4096

# Option 3: Quantize the KV cache
llama-cli --model model.gguf --cache-type-k q8_0 --cache-type-v q8_0

# Option 4: Offload fewer layers to GPU (rest runs on CPU)
llama-cli --model model.gguf --n-gpu-layers 30  # Instead of 99

Model loads but inference is very slow

Cause: Model is partially on GPU and partially on CPU — the PCIe transfer is the bottleneck. Diagnosis:

llama-cli --model model.gguf --n-gpu-layers 99 --verbose 2>&1 | grep "offloaded"

Expected output showing the problem:

llm_load_tensors: offloaded 30/49 layers to GPU   ← Only 30 of 49 on GPU

Fix: Use a smaller model or lower quantization so all layers fit on the GPU.

# Check VRAM available
nvidia-smi --query-gpu=memory.free --format=csv,noheader

# If 4GB free with Q4_K_M, try Q3_K_M (saves ~1.5GB)
ollama pull llama4:scout-q3_k_m

`llama-quantize: error: failed to open model`

Cause: The source model file is corrupted, incomplete, or in safetensors format (not GGUF F16). Fix:

# Verify file integrity
sha256sum ~/models/model-F16.gguf
# Compare against the hash published on the HuggingFace model page

# Re-download if corrupted
rm ~/models/model-F16.gguf
huggingface-cli download repo/model-name model-F16.gguf --local-dir ~/models/

Performance is worse than expected for Q5_K_M vs Q4_K_M

Cause: The model partially spills from VRAM to system RAM. Fix:

# Check total memory used (not just GPU)
watch -n 1 "nvidia-smi --query-gpu=memory.used --format=csv,noheader && \
             free -h | grep Mem"

If VRAM is maxed and system RAM usage climbs during inference, you’ve exceeded VRAM. Drop down one quantization level or reduce context size.

Conclusion

GGUF quantization is the single most important lever for making sovereign local AI practical on consumer hardware. The Q4_K_M format sits at the sweet spot of the quality-size curve: 98% of full-precision quality at 30% of the file size. Understanding the full taxonomy — from Q2_K through F16 — lets you push models onto hardware they wouldn’t otherwise fit, or push quality up when VRAM permits. TurboQuant will extend this further by compressing the KV cache (a separate memory pool that GGUF doesn’t touch) — and when community ports land in Q3 2026, combining Q4_K_M weights with TurboQuant KV cache will be the new standard sovereign configuration.

The natural next article from here is llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU — the complete guide to compiling llama.cpp from source, running models, and tuning every inference parameter.

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

>_ 17 Apr | 17 min | Dev Corner

🟡Intermediate

Complete llama.cpp tutorial for 2026. Install, compile with CUDA/Metal, run GGUF models, tune all inference flags, use the API server, speculative decoding, and benchmark your hardware.

By Marcus Thorne

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

>_ 16 Apr | 17 min | Dev Corner

🟡Intermediate

Speculative decoding doubles local LLM inference speed with zero quality loss. How it works, how to enable it in Ollama and llama.cpp today, and which model pairs give the best speedup.

By Marcus Thorne

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

>_ 17 Apr | 16 min | Dev Corner

🟢Beginner

Install Ollama 5.x on Ubuntu, macOS, and Windows. Pull and run Llama 4, Qwen3, Gemma 3, and Mistral locally. REST API setup, GPU acceleration, Open WebUI, and sovereign model management.

By Marcus Thorne

#gguf #quantization #ollama #llama-cpp #local-llms #q4-k-m #on-device-ai #dev-corner #2026

Key Takeaways

Introduction: Why Quantization Matters for Local AI

Part 1: What Is GGUF — And Why Did llama.cpp Create It?

Part 2: The Complete Quantization Format Taxonomy

The K-Quant Family (Recommended for 2026)

Legacy Formats (Still Widely Used)

Part 3: File Size and VRAM Usage by Model Size

Part 4: Quality Benchmarks — The Real Trade-offs

Part 5: Running Quantized Models with Ollama

Selecting quantization in Ollama

Checking which quantization is actually loaded

Enabling KV cache quantization in Ollama (separate from weight quantization)

Part 6: Running Quantized Models with llama.cpp Directly

Install llama.cpp

Download a GGUF model from HuggingFace

Running inference with llama.cpp

Running the llama.cpp server (OpenAI-compatible API)

Part 7: Creating Your Own Quantized Models

Part 8: Performance Benchmarks — Tokens Per Second by Hardware

Part 9: The Decision Framework

Part 10: TurboQuant and the Future of GGUF

Part 11: The Sovereignty Layer — Verify Your Model Is Running Locally

Quick Reference: Quantization Cheat Sheet

Troubleshooting

CUDA error: out of memory when loading a model

Model loads but inference is very slow

llama-quantize: error: failed to open model

Performance is worse than expected for Q5_K_M vs Q4_K_M

Conclusion

People Also Ask: GGUF Quantization FAQ

What is the difference between Q4_0 and Q4_K_M?

Does quantization affect creativity and writing quality?

Can I run quantized models on a Raspberry Pi or ARM device?

Is Q8_0 always better than Q4_K_M?

Further Reading

Further Reading

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

The Sovereign Brief

You're in!

Comments

Recently Visited

`CUDA error: out of memory` when loading a model

`llama-quantize: error: failed to open model`