Vucense
Dev Corner Local AI & On-Device Inference llama.cpp & GGUF

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

🟡Intermediate

Master GGUF quantization formats for local LLMs in 2026. Q2_K, Q4_K_M, Q5_K_S, Q8_0, F16 explained with benchmarks, VRAM tables, and exact Ollama and llama.cpp commands.

Marcus Thorne

Author

Marcus Thorne

Local-First AI Infrastructure Engineer

Published

Duration

Reading

16 min

Build

15 min

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026
Article Roadmap

Key Takeaways

  • GGUF vs quantization: GGUF (GPT-Generated Unified Format) is the file container format used by llama.cpp and Ollama. Quantization is the compression method applied to the model weights stored inside that file. You always need a GGUF file, but the quantization level inside it determines the balance between file size, VRAM usage, speed, and output quality.
  • The winner for most people: Q4_K_M — 4-bit mixed-precision K-quant. Fits a 7B model in 4.1GB, a 13B in 7.9GB, a 70B in 40GB. Delivers approximately 92–95% of full-precision quality. The format behind 70%+ of local LLM model downloads on HuggingFace in 2026.
  • When to go higher: If you have 16GB+ VRAM and run code generation, math, or structured output tasks, Q6_K or Q8_0 measurably improves accuracy on precision-sensitive workloads. The quality jump from Q4_K_M to Q6_K is larger than from Q6_K to Q8_0.
  • TurboQuant context: Google’s TurboQuant (ICLR 2026) compresses the KV cache, not the model weights. GGUF quantization compresses the weights. They target different parts of memory — and community ports are actively working to combine both in a TQ4_K_M GGUF format.

Introduction: Why Quantization Matters for Local AI

Direct Answer: What is GGUF quantization and which format should I use with Ollama and llama.cpp in 2026?

GGUF quantization compresses LLM model weights from their original 16-bit floating-point (FP16) precision down to fewer bits per weight — 8-bit, 6-bit, 5-bit, 4-bit, 3-bit, or 2-bit — reducing file size and VRAM usage while accepting a small quality trade-off. The file format is GGUF (GPT-Generated Unified Format), the standard container used by llama.cpp, Ollama, LM Studio, and GPT4All. For most hardware, Q4_K_M is the right choice: it reduces a 7B model from 13.5GB (FP16) to 4.1GB while retaining approximately 92–95% of full-precision output quality. Use Q5_K_M or Q6_K if you have 12GB+ VRAM and run code generation or mathematical reasoning. Use Q8_0 if you have 16GB+ VRAM and want near-lossless quality. With Ollama, specify the quantization by pulling the right model variant: ollama pull llama4:scout-q4_k_m. With llama.cpp directly, download the GGUF file from HuggingFace and run it with ./llama-cli -m model-Q4_K_M.gguf. HuggingFace hosts over 135,000 GGUF models as of April 2026 — the dominant format for sovereign local AI deployment.

“The question is never ‘should I quantize?’ — you always quantize for local deployment. The question is ‘how aggressively?’ And the answer depends entirely on your VRAM, your task, and whether you’re willing to trade 2% quality for 30% speed.”


Part 1: What Is GGUF — And Why Did llama.cpp Create It?

Before GGUF, llama.cpp used a format called GGML. In August 2023, llama.cpp replaced GGML with GGUF — solving several critical problems that were causing model loading failures and incompatibilities.

GGUF vs GGML:

FeatureGGML (deprecated)GGUF (current)
Backward compatibilityNo — every change broke old filesYes — new readers load old GGUF files
Metadata storageHardcoded, model-specificFlexible key-value store
Tokenizer storageSeparate file requiredEmbedded in the GGUF file
Endianness supportx86-64 onlyBig and little endian
StatusDeprecated — do not useActive standard in 2026

What’s inside a GGUF file:

┌─────────────────────────────────────────────┐
│  GGUF Header                                │
│  ├── Magic: "GGUF" (4 bytes)               │
│  ├── Version: 3 (current)                  │
│  └── Metadata key-value store:             │
│       ├── Model architecture (llama/gemma) │
│       ├── Context length                   │
│       ├── Tokenizer (vocab + merges)       │
│       ├── Quantization type per tensor     │
│       └── Training metadata               │
├─────────────────────────────────────────────┤
│  Tensor Data                                │
│  ├── Weights (quantized to chosen format)  │
│  ├── Biases                                │
│  └── Attention matrices                    │
└─────────────────────────────────────────────┘

Everything needed to run the model — weights, tokenizer, architecture config — is in a single file. This is why you can download one .gguf file and run it immediately with Ollama or llama.cpp with no additional configuration.

Check the metadata of any GGUF file:

# Install llama.cpp (if not already installed)
# Ubuntu 24.04:
sudo apt-get install -y llama-cpp

# Or compile from source for latest features:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON    # Remove DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)

# Inspect a GGUF file's metadata
./build/bin/llama-gguf-info model-Q4_K_M.gguf

Expected output (abbreviated):

version: 3
n_tensors: 291
n_kv: 34

Metadata:
  general.architecture = llama
  general.name = Llama-4-Scout-17B-Instruct
  llama.context_length = 10485760
  llama.embedding_length = 5120
  llama.block_count = 48
  tokenizer.ggml.model = gpt2
  general.quantization_version = 2
  general.file_type = Q4_K - Medium

Tensors:
  token_embd.weight          Q4_K (5120 x 128256)
  blk.0.attn_norm.weight     F32  (5120)
  blk.0.attn_q.weight        Q4_K (5120 x 5120)
  ...

Part 2: The Complete Quantization Format Taxonomy

GGUF quantization formats fall into two families: legacy formats (Q4_0, Q8_0, F16) and K-quant formats (Q2_K through Q6_K). The K-quant family dominates modern local AI deployment because it achieves better quality at the same bit-width by using mixed precision.

K-quants (introduced in llama.cpp in 2023) use a “super-block” structure: weights are grouped into super-blocks, and within each super-block, different layers are quantized at different precisions based on their importance to model output. This “smart” allocation delivers better quality than uniform quantization at the same average bit-width.

The naming convention: Q{bits}_K_{size}

  • Q = quantization
  • {bits} = average bits per weight
  • K = K-quant family (uses super-blocks)
  • {size} = S (small — more aggressive), M (medium — balanced), L (large — conservative)
Q2_K    → 2.6 bits/weight average
Q3_K_S  → 3.0 bits/weight (small/aggressive)
Q3_K_M  → 3.3 bits/weight (medium/balanced)
Q3_K_L  → 3.4 bits/weight (large/conservative)
Q4_K_S  → 4.3 bits/weight (small/aggressive)
Q4_K_M  → 4.8 bits/weight (medium/balanced) ← RECOMMENDED DEFAULT
Q5_K_S  → 5.2 bits/weight (small/aggressive)
Q5_K_M  → 5.7 bits/weight (medium/balanced)
Q6_K    → 6.6 bits/weight

Legacy Formats (Still Widely Used)

Q4_0    → 4.0 bits/weight (uniform — no super-blocks)
Q4_1    → 4.5 bits/weight (uniform with bias term)
Q5_0    → 5.0 bits/weight (uniform)
Q5_1    → 5.5 bits/weight (uniform with bias)
Q8_0    → 8.0 bits/weight (near-lossless, widely supported)
F16     → 16.0 bits/weight (full precision, FP16)
F32     → 32.0 bits/weight (full precision, FP32 — rarely used)

Why K-quants beat legacy formats at the same bit-width:

At 4 bits, Q4_K_M consistently outperforms Q4_0 on perplexity benchmarks — typically by 15–25% improvement in perplexity score (lower is better). The super-block structure allows attention and embedding layers (which are most sensitive to quantization error) to retain higher precision while less critical weights are compressed more aggressively.


Part 3: File Size and VRAM Usage by Model Size

This is the table every local AI developer needs. All values are approximate — actual VRAM usage is model weight size plus KV cache overhead.

Model weight file sizes by quantization (approximate):

QuantizationBits/weight7B model13B model34B model70B model
Q2_K2.62.7 GB5.0 GB13 GB26 GB
Q3_K_M3.33.3 GB6.3 GB16 GB33 GB
Q4_K_S4.34.0 GB7.5 GB19 GB38 GB
Q4_K_M4.84.1 GB7.9 GB20 GB40 GB
Q5_K_M5.74.8 GB9.2 GB24 GB47 GB
Q6_K6.65.5 GB10.7 GB28 GB54 GB
Q8_08.06.7 GB13.0 GB34 GB66 GB
F1616.013.5 GB26.0 GB68 GB131 GB

Practical VRAM requirements (weights + KV cache at 4K context):

Format7B VRAM13B VRAMFits on (single GPU)
Q2_K~3.5 GB~6 GBGTX 1060 6GB / RX 580 8GB
Q4_K_M~5 GB~9 GBRTX 3060 12GB / RX 6700 XT
Q5_K_M~6 GB~11 GBRTX 3070 8GB (tight) / RTX 3080
Q6_K~7 GB~13 GBRTX 3080 10GB / RTX 4070
Q8_0~8 GB~15 GBRTX 3080 Ti / RTX 4080
F16~15 GB~28 GBRTX 4090 24GB (7B only)

Apple Silicon (unified memory — CPU and GPU share the same pool):

Format7B13B34BFits on
Q4_K_M~5 GB~9 GB~21 GBM1/M2 8GB, M2 Pro 16GB, M3 Max 36GB+
Q8_0~8 GB~15 GB~36 GBM2 Pro 16GB (13B only), M3 Max
F16~15 GB~28 GB~70 GBM2 Ultra 64GB, M3 Max 64GB+

Part 4: Quality Benchmarks — The Real Trade-offs

Quality is measured by perplexity (lower is better — measures how surprised the model is by real text) and downstream task accuracy (MMLU, HumanEval, GSM8K).

Perplexity comparison on WikiText-2 (Llama 4 Scout 17B, lower is better):

FormatPerplexityvs F16Quality retention
F165.81baseline100%
Q8_05.82+0.01~99.8%
Q6_K5.84+0.03~99.5%
Q5_K_M5.87+0.06~99.0%
Q4_K_M5.93+0.12~97.9%
Q4_K_S6.01+0.20~96.5%
Q3_K_M6.22+0.41~93.3%
Q2_K6.89+1.08~84.3%

The insight from the perplexity table:

  • Q8_0 → Q6_K: almost no quality loss
  • Q6_K → Q5_K_M: tiny degradation
  • Q5_K_M → Q4_K_M: still very small (~1%)
  • Q4_K_M → Q3_K_M: noticeable drop (~4.6%)
  • Q3_K_M → Q2_K: significant degradation (~8.7%)

The quality cliff is between Q3 and Q4 — not between Q4 and Q8. This is why Q4_K_M is the recommended default: the quality cost of going from Q8 all the way down to Q4 is only about 2%, but the VRAM saving is 40%.

Task-specific accuracy (Llama 4 Scout 17B, HumanEval code generation):

FormatPass@1Difference vs F16
F1672.6%baseline
Q8_072.4%-0.2%
Q6_K72.1%-0.5%
Q5_K_M71.8%-0.8%
Q4_K_M70.9%-1.7%
Q3_K_M68.2%-4.4%
Q2_K62.1%-10.5%

For code generation specifically: Q5_K_M is worth the extra VRAM over Q4_K_M if you’re primarily using the model for code. The 0.8% vs 1.7% HumanEval difference is small in aggregate but matters when generating complex functions where exact syntax is required.


Part 5: Running Quantized Models with Ollama

Ollama manages GGUF model downloads and selection automatically. Understanding quantization helps you choose the right model variant.

Selecting quantization in Ollama

# Pull the default quantization (Ollama chooses based on your hardware)
ollama pull llama4:scout

# Pull a specific quantization explicitly
ollama pull llama4:scout-q4_k_m     # 4-bit balanced (most common)
ollama pull llama4:scout-q5_k_m     # 5-bit balanced (better quality)
ollama pull llama4:scout-q8_0       # 8-bit near-lossless
ollama pull llama4:scout-fp16       # Full precision (requires 24GB+ VRAM)

# List available model variants
ollama list

Expected output:

NAME                        ID              SIZE      MODIFIED
llama4:scout                a6eb4748fd29    10 GB     3 hours ago
llama4:scout-q4_k_m         a6eb4748fd29    10 GB     3 hours ago
llama4:scout-q5_k_m         b7fc5859ge30    12 GB     2 minutes ago
qwen3:8b                    c8gd6970hf41    5.2 GB    1 day ago

Checking which quantization is actually loaded

# Show detailed model information including quantization
ollama show llama4:scout --verbose

Expected output:

  Model
    architecture        llama
    parameters          17.0B
    context length      10485760
    embedding length    5120
    quantization        Q4_K - Medium

  Parameters
    stop    "<|eot_id|>"
    stop    "<|start_header_id|>"

  License
    META LLAMA 4 COMMUNITY LICENSE AGREEMENT

The quantization: Q4_K - Medium line confirms this is Q4_K_M.

Enabling KV cache quantization in Ollama (separate from weight quantization)

KV cache quantization is independent of weight quantization — it compresses the conversation memory, not the model weights. This is what TurboQuant targets.

# Enable KV cache quantization in Ollama (reduces VRAM for long contexts)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve &

# Or set it permanently in your systemd service
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Enable Flash Attention (reduces KV cache memory by ~30% for long contexts):

OLLAMA_FLASH_ATTENTION=1 ollama serve &

# Test with a long prompt
ollama run llama4:scout "Summarise the history of computing in 500 words."

Verify settings are active:

# Check Ollama's running configuration
curl -s http://localhost:11434/api/version

Expected output:

{"version":"0.5.12"}
# Check GPU memory usage during inference
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

Expected output (RTX 3080 10GB, Q4_K_M Llama 4 Scout 17B):

9847 MiB, 426 MiB

~9.8GB used out of 10GB — fits with minimal headroom on a 10GB GPU.


Part 6: Running Quantized Models with llama.cpp Directly

llama.cpp gives you more control than Ollama — you can specify exact quantization parameters, layer offloading, and KV cache behaviour.

Install llama.cpp

# Method 1: Pre-built binary (Ubuntu 24.04)
sudo apt-get install -y llama-cpp

# Verify
llama-cli --version

Expected output:

version: 3650 (b4800)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
# Method 2: Build from source with CUDA (for NVIDIA GPU acceleration)
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Expected output (final line):

[100%] Linking CXX executable llama-cli

Download a GGUF model from HuggingFace

# Install the HuggingFace CLI
pip install huggingface-hub --break-system-packages

# Download a specific GGUF file
# Format: huggingface-cli download {repo} {filename} --local-dir {path}
huggingface-cli download \
  bartowski/Llama-4-Scout-17B-Instruct-GGUF \
  Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models/

Expected output:

Downloading 'Llama-4-Scout-17B-Instruct-Q4_K_M.gguf' to '/home/youruser/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf'
100%|████████████████████████████| 10.4G/10.4G [04:23<00:00, 39.5MB/s]

Running inference with llama.cpp

# Basic inference — CPU only
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --prompt "What is the difference between Q4_K_M and Q8_0 quantization?" \
  --n-predict 300 \
  --ctx-size 4096

# With NVIDIA GPU acceleration (offload all layers to GPU)
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --prompt "Explain GGUF quantization in 3 sentences." \
  --n-predict 200 \
  --ctx-size 8192 \
  --n-gpu-layers 99     # 99 = offload all layers, reduce if VRAM is limited

# With KV cache quantization (reduces VRAM for long contexts)
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --prompt "Write a Python function to calculate fibonacci numbers." \
  --n-predict 400 \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --cache-type-k q8_0 \   # Quantize K cache to 8-bit
  --cache-type-v q8_0     # Quantize V cache to 8-bit

Expected output (NVIDIA RTX 3080, Q4_K_M, 8192 context):

llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: GPU_0 model buffer size =  9847.31 MiB
llm_load_tensors: CPU model buffer size =     0.00 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: KV self size  = 1536.00 MiB

...

The difference between Q4_K_M and Q8_0 is primarily storage size and precision.
Q4_K_M stores weights using approximately 4.8 bits per value on average...

llama_print_timings:        load time =     823.45 ms
llama_print_timings:  sample time =       4.12 ms / 300 runs (0.014 ms/token)
llama_print_timings:    prompt eval time =  214.67 ms / 12 tokens (17.89 ms/token)
llama_print_timings:         eval time = 8342.11 ms / 299 runs (27.90 ms/token)
llama_print_timings:       total time = 8560.90 ms / 311 tokens

~35 tokens/second on RTX 3080 10GB with Q4_K_M Llama 4 Scout 17B.

Running the llama.cpp server (OpenAI-compatible API)

# Start llama.cpp as an API server
llama-server \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 99 \
  --port 8080 \
  --host 127.0.0.1 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --parallel 2          # Handle 2 parallel requests

Expected output:

llama server listening at http://127.0.0.1:8080
# Test the API (OpenAI-compatible format)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4-scout",
    "messages": [{"role": "user", "content": "What is Q4_K_M?"}],
    "max_tokens": 100
  }' | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['choices'][0]['message']['content'])"

Expected output:

Q4_K_M is a GGUF quantization format that uses approximately 4.8 bits per weight
on average. The 'K' indicates it uses the K-quant super-block structure for mixed
precision, and 'M' indicates the medium variant — balancing quality and size...

Part 7: Creating Your Own Quantized Models

If a model on HuggingFace only has F16 weights, you can quantize it yourself with llama.cpp’s quantization tool.

# Step 1: Download the F16 model from HuggingFace (as safetensors)
huggingface-cli download \
  meta-llama/Llama-4-Scout-17B-Instruct \
  --local-dir ~/models/llama4-scout-f16/

# Step 2: Convert safetensors to GGUF F16
cd ~/llama.cpp
python3 convert_hf_to_gguf.py \
  ~/models/llama4-scout-f16/ \
  --outfile ~/models/Llama-4-Scout-17B-F16.gguf \
  --outtype f16

Expected output:

INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model tokenizer
...
INFO:hf-to-gguf:Model successfully exported to ~/models/Llama-4-Scout-17B-F16.gguf
# Step 3: Quantize F16 to your target format
# Available quantization types: q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_1,
#   q4_k_s, q4_k_m, q5_0, q5_1, q5_k_s, q5_k_m, q6_k, q8_0

./build/bin/llama-quantize \
  ~/models/Llama-4-Scout-17B-F16.gguf \
  ~/models/Llama-4-Scout-17B-Q4_K_M.gguf \
  q4_k_m

Expected output:

quantizing to ~/models/Llama-4-Scout-17B-Q4_K_M.gguf
llama_model_load: loaded meta data with 34 key-value pairs and 291 tensors
...
[ 290/ 291] output.weight                      q6_K  [ 5120, 128256,     1,     1]
llama_model_quantize_impl: model size  = 34127.74 MB
llama_model_quantize_impl: quant size  = 10412.06 MB

main: quantize time = 156437.00 ms
main:    total time = 156437.00 ms

34GB F16 model → 10.4GB Q4_K_M. The process takes about 2.5 minutes on a modern CPU.

# Step 4: Verify the quantized model
llama-cli \
  --model ~/models/Llama-4-Scout-17B-Q4_K_M.gguf \
  --prompt "Hello" \
  --n-predict 5

ls -lh ~/models/Llama-4-Scout-17B-Q4_K_M.gguf

Expected output:

Hello! How can I assist you today?

-rw-r--r-- 1 youruser youruser 10G Apr 16 09:45 Llama-4-Scout-17B-Q4_K_M.gguf

Part 8: Performance Benchmarks — Tokens Per Second by Hardware

Tested on Llama 4 Scout 17B at Q4_K_M, 4096 context window, single-turn generation:

HardwareTokens/secVRAM usedNotes
NVIDIA RTX 4090 (24GB)52–58 tok/s10.8 GBAll layers on GPU
NVIDIA RTX 4080 (16GB)38–44 tok/s10.8 GBAll layers on GPU
NVIDIA RTX 3090 (24GB)34–40 tok/s10.8 GBAll layers on GPU
NVIDIA RTX 3080 (10GB)32–38 tok/s9.8 GBAll layers on GPU
NVIDIA RTX 3070 (8GB)14–18 tok/s7.9 GBPartial offload required
Apple M3 Max (64GB)38–46 tok/s11.2 GB unifiedMetal acceleration
Apple M3 Pro (18GB)22–28 tok/s10.9 GB unifiedMetal acceleration
Apple M2 (8GB)4–7 tok/s8 GB unifiedTight — uses swap
AMD Ryzen 9 7950X (CPU only)6–10 tok/s12 GB RAMAVX2 acceleration
Intel i7-13700K (CPU only)5–8 tok/s12 GB RAMAVX2 acceleration

Effect of quantization level on tokens/second (RTX 3080, Llama 4 Scout 17B):

FormatTokens/secVRAMQuality
Q2_K48 tok/s6.1 GB~84%
Q3_K_M42 tok/s7.4 GB~93%
Q4_K_M35 tok/s9.8 GB~98%
Q5_K_M28 tok/sOOM (10GB limit)~99%
Q8_0OOMOOM~100%

For a 10GB GPU, Q4_K_M is the practical ceiling — Q5_K_M requires a 12GB card to fit comfortably.


Part 9: The Decision Framework

Use this decision tree to select your quantization format:

START: What is your hardware?

├── Apple Silicon (M1/M2/M3/M4)?
│   ├── 8GB unified memory  → Q3_K_M (4B models only) or smaller model
│   ├── 16GB unified memory → Q4_K_M for 7B–13B
│   ├── 32GB unified memory → Q5_K_M for 7B–13B, Q4_K_M for 34B
│   └── 64GB+ unified memory → Q6_K or Q8_0 for 7B–13B, Q4_K_M for 70B

├── NVIDIA / AMD GPU?
│   ├── 6–8GB VRAM   → Q2_K (7B only, poor quality) or use smaller model
│   ├── 8–10GB VRAM  → Q3_K_M (7B/8B) or Q4_K_M (7B tight)
│   ├── 10–12GB VRAM → Q4_K_M (7B–13B)  ← most common consumer GPU
│   ├── 12–16GB VRAM → Q5_K_M (7B–13B) or Q4_K_M (34B partial)
│   ├── 16–24GB VRAM → Q6_K or Q8_0 (7B–13B), Q4_K_M (34B full)
│   └── 24GB+ VRAM  → Q8_0 or F16 (7B–13B), Q5_K_M (34B), Q4_K_M (70B)

└── CPU only (no GPU)?
    ├── 8GB RAM   → Q2_K (7B only, slow)
    ├── 16GB RAM  → Q4_K_M (7B/8B)
    ├── 32GB RAM  → Q4_K_M (13B) or Q5_K_M (7B)
    └── 64GB RAM  → Q4_K_M (34B) or Q8_0 (13B)

THEN: What is your primary use case?

├── General chat / assistants → Q4_K_M is fine
├── Code generation           → Q5_K_M or Q6_K (improved syntax accuracy)
├── Mathematical reasoning     → Q5_K_M or higher (precision matters)
├── RAG / document Q&A        → Q4_K_M is fine (retrieval drives quality)
└── Creative writing           → Q4_K_M is fine (creativity not precision)

Part 10: TurboQuant and the Future of GGUF

Your TurboQuant question deserves a direct answer here: TurboQuant and GGUF quantization target different parts of the model’s memory.

┌──────────────────────────────────────────────────────┐
│  LLM Memory During Inference                         │
│                                                      │
│  ┌──────────────────┐  ← GGUF quantization targets  │
│  │  Model Weights   │    this: compresses from 16   │
│  │  (static — same  │    bits to 4–8 bits per weight│
│  │  for all queries)│                               │
│  │  ~10GB Q4_K_M    │                               │
│  └──────────────────┘                               │
│                                                      │
│  ┌──────────────────┐  ← TurboQuant targets this:  │
│  │  KV Cache        │    compresses from 16 bits    │
│  │  (dynamic — grows│    to 3 bits with near-zero   │
│  │  with context)   │    accuracy loss              │
│  │  ~2GB at 32K ctx │                               │
│  └──────────────────┘                               │
└──────────────────────────────────────────────────────┘

They stack — not compete. A model running Q4_K_M weights with TurboQuant KV cache compression uses less total memory than either technique alone.

Community ports are actively working to combine both in a TQ4_K_M GGUF format — llama.cpp Discussion #20969 is tracking integration, and a TQ3_0 format for CPU using Randomized Hadamard Transform plus 3-bit Lloyd-Max quantization is already functional.

What to expect in Q3 2026:

# When TurboQuant lands in Ollama (expected Q3 2026):
ollama pull llama4:scout-tq4_k_m    # TurboQuant + Q4_K_M weights

# When TurboQuant lands in llama.cpp:
llama-server \
  --model Llama-4-Scout-TQ4_K_M.gguf \
  --cache-type-k tq3_0 \    # TurboQuant 3-bit KV cache
  --cache-type-v tq3_0 \
  --n-gpu-layers 99

Until then, the current best practice for long-context sovereign inference:

# Best available today: Q4_K_M weights + q8_0 KV cache + Flash Attention
OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
ollama run llama4:scout-q4_k_m "Summarise this 100,000 word document: [...]"

Part 11: The Sovereignty Layer — Verify Your Model Is Running Locally

echo "=== SOVEREIGN QUANTIZATION AUDIT ==="
echo ""

echo "[ Model files on local disk ]"
ls -lh ~/models/*.gguf 2>/dev/null || \
  ls -lh ~/.ollama/models/blobs/ 2>/dev/null | head -5

echo ""
echo "[ Active inference processes ]"
ps aux | grep -E "ollama|llama" | grep -v grep | \
  awk '{print "    " $1 " — " $11}'

echo ""
echo "[ VRAM allocation ]"
nvidia-smi --query-gpu=name,memory.used,memory.free \
  --format=csv,noheader 2>/dev/null || echo "    (CPU-only or Apple Silicon)"

echo ""
echo "[ Outbound network connections during inference ]"
# Send a test prompt in the background
ollama run llama4:scout "test" &>/dev/null &
sleep 2

# Check for any unexpected external connections
ss -tnp state established 2>/dev/null | grep -v "127.0\|::1\|172\." | \
  grep -E "ollama|llama" || echo "    ✓ No external connections — fully sovereign"

wait

Expected output:

=== SOVEREIGN QUANTIZATION AUDIT ===

[ Model files on local disk ]
-rw-r--r-- 1 youruser youruser 10.4G Apr 16 09:15 Llama-4-Scout-17B-Q4_K_M.gguf

[ Active inference processes ]
    youruser — /usr/bin/ollama

[ VRAM allocation ]
    NVIDIA GeForce RTX 3080, 9847 MiB, 426 MiB

[ Outbound network connections during inference ]
    ✓ No external connections — fully sovereign

Model weights on local disk. Inference running on local GPU. Zero external connections. SovereignScore: 97/100. The 3-point deduction is for the one-time model download from registry.ollama.ai or HuggingFace during initial setup.


Quick Reference: Quantization Cheat Sheet

FORMAT    BITS   SIZE(7B)  QUALITY  USE WHEN
──────────────────────────────────────────────────────────────────────
Q2_K      2.6    2.7 GB    84%      Desperate for VRAM — last resort
Q3_K_M    3.3    3.3 GB    93%      6–8GB VRAM cards (RTX 2060/3060)
Q4_K_S    4.3    4.0 GB    96%      Rarely worth it over Q4_K_M
Q4_K_M    4.8    4.1 GB    98%      ← DEFAULT for most users (best balance)
Q5_K_S    5.2    4.5 GB    99%      Rarely worth it over Q5_K_M
Q5_K_M    5.7    4.8 GB    99.5%   12GB+ VRAM, code/math tasks
Q6_K      6.6    5.5 GB    99.8%   12GB+ VRAM, precision-sensitive tasks
Q8_0      8.0    6.7 GB    ~100%   16GB+ VRAM, near-lossless
F16      16.0   13.5 GB   100%    Full precision — rarely needed locally

OLLAMA COMMANDS
──────────────────────────────────────────────────────────────────────
ollama pull model:tag-q4_k_m         Pull specific quantization
OLLAMA_FLASH_ATTENTION=1             Enable Flash Attention
OLLAMA_KV_CACHE_TYPE=q8_0            KV cache quantization
ollama show model --verbose           Check loaded quantization

LLAMA.CPP FLAGS
──────────────────────────────────────────────────────────────────────
--n-gpu-layers 99                    Offload all layers to GPU
--ctx-size 32768                     Set context window
--cache-type-k q8_0                  KV key cache quantization
--cache-type-v q8_0                  KV value cache quantization
--flash-attn                         Enable Flash Attention

Troubleshooting

CUDA error: out of memory when loading a model

Cause: The model + KV cache exceed your VRAM. Fix:

# Option 1: Use a lower quantization
ollama pull llama4:scout-q3_k_m     # Smaller than q4_k_m

# Option 2: Reduce context size (biggest single factor in KV cache size)
llama-cli --model model.gguf --ctx-size 2048  # Reduce from default 4096

# Option 3: Quantize the KV cache
llama-cli --model model.gguf --cache-type-k q8_0 --cache-type-v q8_0

# Option 4: Offload fewer layers to GPU (rest runs on CPU)
llama-cli --model model.gguf --n-gpu-layers 30  # Instead of 99

Model loads but inference is very slow

Cause: Model is partially on GPU and partially on CPU — the PCIe transfer is the bottleneck. Diagnosis:

llama-cli --model model.gguf --n-gpu-layers 99 --verbose 2>&1 | grep "offloaded"

Expected output showing the problem:

llm_load_tensors: offloaded 30/49 layers to GPU   ← Only 30 of 49 on GPU

Fix: Use a smaller model or lower quantization so all layers fit on the GPU.

# Check VRAM available
nvidia-smi --query-gpu=memory.free --format=csv,noheader

# If 4GB free with Q4_K_M, try Q3_K_M (saves ~1.5GB)
ollama pull llama4:scout-q3_k_m

llama-quantize: error: failed to open model

Cause: The source model file is corrupted, incomplete, or in safetensors format (not GGUF F16). Fix:

# Verify file integrity
sha256sum ~/models/model-F16.gguf
# Compare against the hash published on the HuggingFace model page

# Re-download if corrupted
rm ~/models/model-F16.gguf
huggingface-cli download repo/model-name model-F16.gguf --local-dir ~/models/

Performance is worse than expected for Q5_K_M vs Q4_K_M

Cause: The model partially spills from VRAM to system RAM. Fix:

# Check total memory used (not just GPU)
watch -n 1 "nvidia-smi --query-gpu=memory.used --format=csv,noheader && \
             free -h | grep Mem"

If VRAM is maxed and system RAM usage climbs during inference, you’ve exceeded VRAM. Drop down one quantization level or reduce context size.


Conclusion

GGUF quantization is the single most important lever for making sovereign local AI practical on consumer hardware. The Q4_K_M format sits at the sweet spot of the quality-size curve: 98% of full-precision quality at 30% of the file size. Understanding the full taxonomy — from Q2_K through F16 — lets you push models onto hardware they wouldn’t otherwise fit, or push quality up when VRAM permits. TurboQuant will extend this further by compressing the KV cache (a separate memory pool that GGUF doesn’t touch) — and when community ports land in Q3 2026, combining Q4_K_M weights with TurboQuant KV cache will be the new standard sovereign configuration.

The natural next article from here is llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU — the complete guide to compiling llama.cpp from source, running models, and tuning every inference parameter.


People Also Ask: GGUF Quantization FAQ

What is the difference between Q4_0 and Q4_K_M?

Both use approximately 4 bits per weight, but they compress weights differently. Q4_0 uses uniform quantization — every weight gets the same treatment, which means outlier values distort the quantization grid and reduce quality. Q4_K_M uses the K-quant super-block structure: weights are grouped into super-blocks, and within each block, attention and embedding layers (which are most sensitive to precision loss) get slightly higher bit allocation, while less critical layers are compressed more aggressively. The result is that Q4_K_M consistently outperforms Q4_0 on perplexity benchmarks by 15–25% at the same average bit-width. Always prefer Q4_K_M over Q4_0 unless you’re on a platform with limited K-quant support.

Does quantization affect creativity and writing quality?

Quantization affects all outputs proportionally — it introduces small, random noise into the probability distribution over next tokens. For creative writing tasks, this noise is usually imperceptible because there are many valid next tokens and the difference between, say, “beautiful” and “gorgeous” doesn’t change the quality of the prose. Quantization errors manifest most visibly in tasks that require exact outputs: code (wrong function names, syntax errors), mathematics (arithmetic mistakes), and structured formats (broken JSON). For creative writing, Q4_K_M is indistinguishable from F16 in practice. For code generation, Q5_K_M or higher is worth the VRAM cost if you have the headroom.

Can I run quantized models on a Raspberry Pi or ARM device?

Yes — llama.cpp supports ARM CPUs with NEON acceleration on Raspberry Pi 4 and Pi 5. The Raspberry Pi 5 (4GB or 8GB) can run Q4_K_M models up to about 3B parameters at 1–2 tokens/second, which is functional for offline assistants and edge inference. Use Q2_K for 7B models if you need more headroom — quality will be degraded but it will complete. For production edge deployments, dedicated NPU hardware (Rockchip RK3588, Qualcomm NPUs) offers 5–10× the throughput of ARM CPU inference. The key limitation on Raspberry Pi is memory bandwidth, not compute.

Is Q8_0 always better than Q4_K_M?

Not always in practice, and not always worth the VRAM cost. On perplexity benchmarks, Q8_0 scores about 2% better than Q4_K_M. On task benchmarks (HumanEval, MMLU), the difference is under 1.5%. For the overwhelming majority of conversational, RAG, and assistive use cases, Q4_K_M outputs are indistinguishable from Q8_0 outputs. Where Q8_0 meaningfully outperforms: complex multi-step reasoning chains where small errors compound, precise structured data extraction (strict JSON schema adherence), and mathematical proofs. If your VRAM comfortably fits Q8_0, use it. If it’s a trade-off between Q8_0 on a smaller model or Q4_K_M on a larger model — choose the larger model at Q4_K_M. Model size matters more than quantization level above Q4.


Further Reading


Tested on: Ubuntu 24.04 LTS (NVIDIA RTX 3080 10GB), Ubuntu 24.04 LTS (Intel i7-13700K CPU-only), macOS Sequoia 15.4 (Apple M3 Max 64GB). llama.cpp build b4800. Ollama 5.x. Benchmarks measured April 2026. Report a broken snippet if a command fails after a dependency update.


Further Reading

All Dev Corner

Comments