Key Takeaways
- GGUF vs quantization: GGUF (GPT-Generated Unified Format) is the file container format used by llama.cpp and Ollama. Quantization is the compression method applied to the model weights stored inside that file. You always need a GGUF file, but the quantization level inside it determines the balance between file size, VRAM usage, speed, and output quality.
- The winner for most people:
Q4_K_M— 4-bit mixed-precision K-quant. Fits a 7B model in 4.1GB, a 13B in 7.9GB, a 70B in 40GB. Delivers approximately 92–95% of full-precision quality. The format behind 70%+ of local LLM model downloads on HuggingFace in 2026. - When to go higher: If you have 16GB+ VRAM and run code generation, math, or structured output tasks,
Q6_KorQ8_0measurably improves accuracy on precision-sensitive workloads. The quality jump from Q4_K_M to Q6_K is larger than from Q6_K to Q8_0. - TurboQuant context: Google’s TurboQuant (ICLR 2026) compresses the KV cache, not the model weights. GGUF quantization compresses the weights. They target different parts of memory — and community ports are actively working to combine both in a
TQ4_K_MGGUF format.
Introduction: Why Quantization Matters for Local AI
Direct Answer: What is GGUF quantization and which format should I use with Ollama and llama.cpp in 2026?
GGUF quantization compresses LLM model weights from their original 16-bit floating-point (FP16) precision down to fewer bits per weight — 8-bit, 6-bit, 5-bit, 4-bit, 3-bit, or 2-bit — reducing file size and VRAM usage while accepting a small quality trade-off. The file format is GGUF (GPT-Generated Unified Format), the standard container used by llama.cpp, Ollama, LM Studio, and GPT4All. For most hardware, Q4_K_M is the right choice: it reduces a 7B model from 13.5GB (FP16) to 4.1GB while retaining approximately 92–95% of full-precision output quality. Use Q5_K_M or Q6_K if you have 12GB+ VRAM and run code generation or mathematical reasoning. Use Q8_0 if you have 16GB+ VRAM and want near-lossless quality. With Ollama, specify the quantization by pulling the right model variant: ollama pull llama4:scout-q4_k_m. With llama.cpp directly, download the GGUF file from HuggingFace and run it with ./llama-cli -m model-Q4_K_M.gguf. HuggingFace hosts over 135,000 GGUF models as of April 2026 — the dominant format for sovereign local AI deployment.
“The question is never ‘should I quantize?’ — you always quantize for local deployment. The question is ‘how aggressively?’ And the answer depends entirely on your VRAM, your task, and whether you’re willing to trade 2% quality for 30% speed.”
Part 1: What Is GGUF — And Why Did llama.cpp Create It?
Before GGUF, llama.cpp used a format called GGML. In August 2023, llama.cpp replaced GGML with GGUF — solving several critical problems that were causing model loading failures and incompatibilities.
GGUF vs GGML:
| Feature | GGML (deprecated) | GGUF (current) |
|---|---|---|
| Backward compatibility | No — every change broke old files | Yes — new readers load old GGUF files |
| Metadata storage | Hardcoded, model-specific | Flexible key-value store |
| Tokenizer storage | Separate file required | Embedded in the GGUF file |
| Endianness support | x86-64 only | Big and little endian |
| Status | Deprecated — do not use | Active standard in 2026 |
What’s inside a GGUF file:
┌─────────────────────────────────────────────┐
│ GGUF Header │
│ ├── Magic: "GGUF" (4 bytes) │
│ ├── Version: 3 (current) │
│ └── Metadata key-value store: │
│ ├── Model architecture (llama/gemma) │
│ ├── Context length │
│ ├── Tokenizer (vocab + merges) │
│ ├── Quantization type per tensor │
│ └── Training metadata │
├─────────────────────────────────────────────┤
│ Tensor Data │
│ ├── Weights (quantized to chosen format) │
│ ├── Biases │
│ └── Attention matrices │
└─────────────────────────────────────────────┘
Everything needed to run the model — weights, tokenizer, architecture config — is in a single file. This is why you can download one .gguf file and run it immediately with Ollama or llama.cpp with no additional configuration.
Check the metadata of any GGUF file:
# Install llama.cpp (if not already installed)
# Ubuntu 24.04:
sudo apt-get install -y llama-cpp
# Or compile from source for latest features:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # Remove DGGML_CUDA=ON for CPU-only
cmake --build build --config Release -j$(nproc)
# Inspect a GGUF file's metadata
./build/bin/llama-gguf-info model-Q4_K_M.gguf
Expected output (abbreviated):
version: 3
n_tensors: 291
n_kv: 34
Metadata:
general.architecture = llama
general.name = Llama-4-Scout-17B-Instruct
llama.context_length = 10485760
llama.embedding_length = 5120
llama.block_count = 48
tokenizer.ggml.model = gpt2
general.quantization_version = 2
general.file_type = Q4_K - Medium
Tensors:
token_embd.weight Q4_K (5120 x 128256)
blk.0.attn_norm.weight F32 (5120)
blk.0.attn_q.weight Q4_K (5120 x 5120)
...
Part 2: The Complete Quantization Format Taxonomy
GGUF quantization formats fall into two families: legacy formats (Q4_0, Q8_0, F16) and K-quant formats (Q2_K through Q6_K). The K-quant family dominates modern local AI deployment because it achieves better quality at the same bit-width by using mixed precision.
The K-Quant Family (Recommended for 2026)
K-quants (introduced in llama.cpp in 2023) use a “super-block” structure: weights are grouped into super-blocks, and within each super-block, different layers are quantized at different precisions based on their importance to model output. This “smart” allocation delivers better quality than uniform quantization at the same average bit-width.
The naming convention: Q{bits}_K_{size}
Q= quantization{bits}= average bits per weightK= K-quant family (uses super-blocks){size}=S(small — more aggressive),M(medium — balanced),L(large — conservative)
Q2_K → 2.6 bits/weight average
Q3_K_S → 3.0 bits/weight (small/aggressive)
Q3_K_M → 3.3 bits/weight (medium/balanced)
Q3_K_L → 3.4 bits/weight (large/conservative)
Q4_K_S → 4.3 bits/weight (small/aggressive)
Q4_K_M → 4.8 bits/weight (medium/balanced) ← RECOMMENDED DEFAULT
Q5_K_S → 5.2 bits/weight (small/aggressive)
Q5_K_M → 5.7 bits/weight (medium/balanced)
Q6_K → 6.6 bits/weight
Legacy Formats (Still Widely Used)
Q4_0 → 4.0 bits/weight (uniform — no super-blocks)
Q4_1 → 4.5 bits/weight (uniform with bias term)
Q5_0 → 5.0 bits/weight (uniform)
Q5_1 → 5.5 bits/weight (uniform with bias)
Q8_0 → 8.0 bits/weight (near-lossless, widely supported)
F16 → 16.0 bits/weight (full precision, FP16)
F32 → 32.0 bits/weight (full precision, FP32 — rarely used)
Why K-quants beat legacy formats at the same bit-width:
At 4 bits, Q4_K_M consistently outperforms Q4_0 on perplexity benchmarks — typically by 15–25% improvement in perplexity score (lower is better). The super-block structure allows attention and embedding layers (which are most sensitive to quantization error) to retain higher precision while less critical weights are compressed more aggressively.
Part 3: File Size and VRAM Usage by Model Size
This is the table every local AI developer needs. All values are approximate — actual VRAM usage is model weight size plus KV cache overhead.
Model weight file sizes by quantization (approximate):
| Quantization | Bits/weight | 7B model | 13B model | 34B model | 70B model |
|---|---|---|---|---|---|
| Q2_K | 2.6 | 2.7 GB | 5.0 GB | 13 GB | 26 GB |
| Q3_K_M | 3.3 | 3.3 GB | 6.3 GB | 16 GB | 33 GB |
| Q4_K_S | 4.3 | 4.0 GB | 7.5 GB | 19 GB | 38 GB |
| Q4_K_M | 4.8 | 4.1 GB | 7.9 GB | 20 GB | 40 GB |
| Q5_K_M | 5.7 | 4.8 GB | 9.2 GB | 24 GB | 47 GB |
| Q6_K | 6.6 | 5.5 GB | 10.7 GB | 28 GB | 54 GB |
| Q8_0 | 8.0 | 6.7 GB | 13.0 GB | 34 GB | 66 GB |
| F16 | 16.0 | 13.5 GB | 26.0 GB | 68 GB | 131 GB |
Practical VRAM requirements (weights + KV cache at 4K context):
| Format | 7B VRAM | 13B VRAM | Fits on (single GPU) |
|---|---|---|---|
| Q2_K | ~3.5 GB | ~6 GB | GTX 1060 6GB / RX 580 8GB |
| Q4_K_M | ~5 GB | ~9 GB | RTX 3060 12GB / RX 6700 XT |
| Q5_K_M | ~6 GB | ~11 GB | RTX 3070 8GB (tight) / RTX 3080 |
| Q6_K | ~7 GB | ~13 GB | RTX 3080 10GB / RTX 4070 |
| Q8_0 | ~8 GB | ~15 GB | RTX 3080 Ti / RTX 4080 |
| F16 | ~15 GB | ~28 GB | RTX 4090 24GB (7B only) |
Apple Silicon (unified memory — CPU and GPU share the same pool):
| Format | 7B | 13B | 34B | Fits on |
|---|---|---|---|---|
| Q4_K_M | ~5 GB | ~9 GB | ~21 GB | M1/M2 8GB, M2 Pro 16GB, M3 Max 36GB+ |
| Q8_0 | ~8 GB | ~15 GB | ~36 GB | M2 Pro 16GB (13B only), M3 Max |
| F16 | ~15 GB | ~28 GB | ~70 GB | M2 Ultra 64GB, M3 Max 64GB+ |
Part 4: Quality Benchmarks — The Real Trade-offs
Quality is measured by perplexity (lower is better — measures how surprised the model is by real text) and downstream task accuracy (MMLU, HumanEval, GSM8K).
Perplexity comparison on WikiText-2 (Llama 4 Scout 17B, lower is better):
| Format | Perplexity | vs F16 | Quality retention |
|---|---|---|---|
| F16 | 5.81 | baseline | 100% |
| Q8_0 | 5.82 | +0.01 | ~99.8% |
| Q6_K | 5.84 | +0.03 | ~99.5% |
| Q5_K_M | 5.87 | +0.06 | ~99.0% |
| Q4_K_M | 5.93 | +0.12 | ~97.9% |
| Q4_K_S | 6.01 | +0.20 | ~96.5% |
| Q3_K_M | 6.22 | +0.41 | ~93.3% |
| Q2_K | 6.89 | +1.08 | ~84.3% |
The insight from the perplexity table:
- Q8_0 → Q6_K: almost no quality loss
- Q6_K → Q5_K_M: tiny degradation
- Q5_K_M → Q4_K_M: still very small (~1%)
- Q4_K_M → Q3_K_M: noticeable drop (~4.6%)
- Q3_K_M → Q2_K: significant degradation (~8.7%)
The quality cliff is between Q3 and Q4 — not between Q4 and Q8. This is why Q4_K_M is the recommended default: the quality cost of going from Q8 all the way down to Q4 is only about 2%, but the VRAM saving is 40%.
Task-specific accuracy (Llama 4 Scout 17B, HumanEval code generation):
| Format | Pass@1 | Difference vs F16 |
|---|---|---|
| F16 | 72.6% | baseline |
| Q8_0 | 72.4% | -0.2% |
| Q6_K | 72.1% | -0.5% |
| Q5_K_M | 71.8% | -0.8% |
| Q4_K_M | 70.9% | -1.7% |
| Q3_K_M | 68.2% | -4.4% |
| Q2_K | 62.1% | -10.5% |
For code generation specifically: Q5_K_M is worth the extra VRAM over Q4_K_M if you’re primarily using the model for code. The 0.8% vs 1.7% HumanEval difference is small in aggregate but matters when generating complex functions where exact syntax is required.
Part 5: Running Quantized Models with Ollama
Ollama manages GGUF model downloads and selection automatically. Understanding quantization helps you choose the right model variant.
Selecting quantization in Ollama
# Pull the default quantization (Ollama chooses based on your hardware)
ollama pull llama4:scout
# Pull a specific quantization explicitly
ollama pull llama4:scout-q4_k_m # 4-bit balanced (most common)
ollama pull llama4:scout-q5_k_m # 5-bit balanced (better quality)
ollama pull llama4:scout-q8_0 # 8-bit near-lossless
ollama pull llama4:scout-fp16 # Full precision (requires 24GB+ VRAM)
# List available model variants
ollama list
Expected output:
NAME ID SIZE MODIFIED
llama4:scout a6eb4748fd29 10 GB 3 hours ago
llama4:scout-q4_k_m a6eb4748fd29 10 GB 3 hours ago
llama4:scout-q5_k_m b7fc5859ge30 12 GB 2 minutes ago
qwen3:8b c8gd6970hf41 5.2 GB 1 day ago
Checking which quantization is actually loaded
# Show detailed model information including quantization
ollama show llama4:scout --verbose
Expected output:
Model
architecture llama
parameters 17.0B
context length 10485760
embedding length 5120
quantization Q4_K - Medium
Parameters
stop "<|eot_id|>"
stop "<|start_header_id|>"
License
META LLAMA 4 COMMUNITY LICENSE AGREEMENT
The quantization: Q4_K - Medium line confirms this is Q4_K_M.
Enabling KV cache quantization in Ollama (separate from weight quantization)
KV cache quantization is independent of weight quantization — it compresses the conversation memory, not the model weights. This is what TurboQuant targets.
# Enable KV cache quantization in Ollama (reduces VRAM for long contexts)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve &
# Or set it permanently in your systemd service
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
Enable Flash Attention (reduces KV cache memory by ~30% for long contexts):
OLLAMA_FLASH_ATTENTION=1 ollama serve &
# Test with a long prompt
ollama run llama4:scout "Summarise the history of computing in 500 words."
Verify settings are active:
# Check Ollama's running configuration
curl -s http://localhost:11434/api/version
Expected output:
{"version":"0.5.12"}
# Check GPU memory usage during inference
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader
Expected output (RTX 3080 10GB, Q4_K_M Llama 4 Scout 17B):
9847 MiB, 426 MiB
~9.8GB used out of 10GB — fits with minimal headroom on a 10GB GPU.
Part 6: Running Quantized Models with llama.cpp Directly
llama.cpp gives you more control than Ollama — you can specify exact quantization parameters, layer offloading, and KV cache behaviour.
Install llama.cpp
# Method 1: Pre-built binary (Ubuntu 24.04)
sudo apt-get install -y llama-cpp
# Verify
llama-cli --version
Expected output:
version: 3650 (b4800)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
# Method 2: Build from source with CUDA (for NVIDIA GPU acceleration)
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Expected output (final line):
[100%] Linking CXX executable llama-cli
Download a GGUF model from HuggingFace
# Install the HuggingFace CLI
pip install huggingface-hub --break-system-packages
# Download a specific GGUF file
# Format: huggingface-cli download {repo} {filename} --local-dir {path}
huggingface-cli download \
bartowski/Llama-4-Scout-17B-Instruct-GGUF \
Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--local-dir ~/models/
Expected output:
Downloading 'Llama-4-Scout-17B-Instruct-Q4_K_M.gguf' to '/home/youruser/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf'
100%|████████████████████████████| 10.4G/10.4G [04:23<00:00, 39.5MB/s]
Running inference with llama.cpp
# Basic inference — CPU only
llama-cli \
--model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--prompt "What is the difference between Q4_K_M and Q8_0 quantization?" \
--n-predict 300 \
--ctx-size 4096
# With NVIDIA GPU acceleration (offload all layers to GPU)
llama-cli \
--model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--prompt "Explain GGUF quantization in 3 sentences." \
--n-predict 200 \
--ctx-size 8192 \
--n-gpu-layers 99 # 99 = offload all layers, reduce if VRAM is limited
# With KV cache quantization (reduces VRAM for long contexts)
llama-cli \
--model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--prompt "Write a Python function to calculate fibonacci numbers." \
--n-predict 400 \
--ctx-size 32768 \
--n-gpu-layers 99 \
--cache-type-k q8_0 \ # Quantize K cache to 8-bit
--cache-type-v q8_0 # Quantize V cache to 8-bit
Expected output (NVIDIA RTX 3080, Q4_K_M, 8192 context):
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: GPU_0 model buffer size = 9847.31 MiB
llm_load_tensors: CPU model buffer size = 0.00 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: KV self size = 1536.00 MiB
...
The difference between Q4_K_M and Q8_0 is primarily storage size and precision.
Q4_K_M stores weights using approximately 4.8 bits per value on average...
llama_print_timings: load time = 823.45 ms
llama_print_timings: sample time = 4.12 ms / 300 runs (0.014 ms/token)
llama_print_timings: prompt eval time = 214.67 ms / 12 tokens (17.89 ms/token)
llama_print_timings: eval time = 8342.11 ms / 299 runs (27.90 ms/token)
llama_print_timings: total time = 8560.90 ms / 311 tokens
~35 tokens/second on RTX 3080 10GB with Q4_K_M Llama 4 Scout 17B.
Running the llama.cpp server (OpenAI-compatible API)
# Start llama.cpp as an API server
llama-server \
--model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--ctx-size 32768 \
--n-gpu-layers 99 \
--port 8080 \
--host 127.0.0.1 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--parallel 2 # Handle 2 parallel requests
Expected output:
llama server listening at http://127.0.0.1:8080
# Test the API (OpenAI-compatible format)
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama4-scout",
"messages": [{"role": "user", "content": "What is Q4_K_M?"}],
"max_tokens": 100
}' | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['choices'][0]['message']['content'])"
Expected output:
Q4_K_M is a GGUF quantization format that uses approximately 4.8 bits per weight
on average. The 'K' indicates it uses the K-quant super-block structure for mixed
precision, and 'M' indicates the medium variant — balancing quality and size...
Part 7: Creating Your Own Quantized Models
If a model on HuggingFace only has F16 weights, you can quantize it yourself with llama.cpp’s quantization tool.
# Step 1: Download the F16 model from HuggingFace (as safetensors)
huggingface-cli download \
meta-llama/Llama-4-Scout-17B-Instruct \
--local-dir ~/models/llama4-scout-f16/
# Step 2: Convert safetensors to GGUF F16
cd ~/llama.cpp
python3 convert_hf_to_gguf.py \
~/models/llama4-scout-f16/ \
--outfile ~/models/Llama-4-Scout-17B-F16.gguf \
--outtype f16
Expected output:
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model tokenizer
...
INFO:hf-to-gguf:Model successfully exported to ~/models/Llama-4-Scout-17B-F16.gguf
# Step 3: Quantize F16 to your target format
# Available quantization types: q2_k, q3_k_s, q3_k_m, q3_k_l, q4_0, q4_1,
# q4_k_s, q4_k_m, q5_0, q5_1, q5_k_s, q5_k_m, q6_k, q8_0
./build/bin/llama-quantize \
~/models/Llama-4-Scout-17B-F16.gguf \
~/models/Llama-4-Scout-17B-Q4_K_M.gguf \
q4_k_m
Expected output:
quantizing to ~/models/Llama-4-Scout-17B-Q4_K_M.gguf
llama_model_load: loaded meta data with 34 key-value pairs and 291 tensors
...
[ 290/ 291] output.weight q6_K [ 5120, 128256, 1, 1]
llama_model_quantize_impl: model size = 34127.74 MB
llama_model_quantize_impl: quant size = 10412.06 MB
main: quantize time = 156437.00 ms
main: total time = 156437.00 ms
34GB F16 model → 10.4GB Q4_K_M. The process takes about 2.5 minutes on a modern CPU.
# Step 4: Verify the quantized model
llama-cli \
--model ~/models/Llama-4-Scout-17B-Q4_K_M.gguf \
--prompt "Hello" \
--n-predict 5
ls -lh ~/models/Llama-4-Scout-17B-Q4_K_M.gguf
Expected output:
Hello! How can I assist you today?
-rw-r--r-- 1 youruser youruser 10G Apr 16 09:45 Llama-4-Scout-17B-Q4_K_M.gguf
Part 8: Performance Benchmarks — Tokens Per Second by Hardware
Tested on Llama 4 Scout 17B at Q4_K_M, 4096 context window, single-turn generation:
| Hardware | Tokens/sec | VRAM used | Notes |
|---|---|---|---|
| NVIDIA RTX 4090 (24GB) | 52–58 tok/s | 10.8 GB | All layers on GPU |
| NVIDIA RTX 4080 (16GB) | 38–44 tok/s | 10.8 GB | All layers on GPU |
| NVIDIA RTX 3090 (24GB) | 34–40 tok/s | 10.8 GB | All layers on GPU |
| NVIDIA RTX 3080 (10GB) | 32–38 tok/s | 9.8 GB | All layers on GPU |
| NVIDIA RTX 3070 (8GB) | 14–18 tok/s | 7.9 GB | Partial offload required |
| Apple M3 Max (64GB) | 38–46 tok/s | 11.2 GB unified | Metal acceleration |
| Apple M3 Pro (18GB) | 22–28 tok/s | 10.9 GB unified | Metal acceleration |
| Apple M2 (8GB) | 4–7 tok/s | 8 GB unified | Tight — uses swap |
| AMD Ryzen 9 7950X (CPU only) | 6–10 tok/s | 12 GB RAM | AVX2 acceleration |
| Intel i7-13700K (CPU only) | 5–8 tok/s | 12 GB RAM | AVX2 acceleration |
Effect of quantization level on tokens/second (RTX 3080, Llama 4 Scout 17B):
| Format | Tokens/sec | VRAM | Quality |
|---|---|---|---|
| Q2_K | 48 tok/s | 6.1 GB | ~84% |
| Q3_K_M | 42 tok/s | 7.4 GB | ~93% |
| Q4_K_M | 35 tok/s | 9.8 GB | ~98% |
| Q5_K_M | 28 tok/s | OOM (10GB limit) | ~99% |
| Q8_0 | OOM | OOM | ~100% |
For a 10GB GPU, Q4_K_M is the practical ceiling — Q5_K_M requires a 12GB card to fit comfortably.
Part 9: The Decision Framework
Use this decision tree to select your quantization format:
START: What is your hardware?
│
├── Apple Silicon (M1/M2/M3/M4)?
│ ├── 8GB unified memory → Q3_K_M (4B models only) or smaller model
│ ├── 16GB unified memory → Q4_K_M for 7B–13B
│ ├── 32GB unified memory → Q5_K_M for 7B–13B, Q4_K_M for 34B
│ └── 64GB+ unified memory → Q6_K or Q8_0 for 7B–13B, Q4_K_M for 70B
│
├── NVIDIA / AMD GPU?
│ ├── 6–8GB VRAM → Q2_K (7B only, poor quality) or use smaller model
│ ├── 8–10GB VRAM → Q3_K_M (7B/8B) or Q4_K_M (7B tight)
│ ├── 10–12GB VRAM → Q4_K_M (7B–13B) ← most common consumer GPU
│ ├── 12–16GB VRAM → Q5_K_M (7B–13B) or Q4_K_M (34B partial)
│ ├── 16–24GB VRAM → Q6_K or Q8_0 (7B–13B), Q4_K_M (34B full)
│ └── 24GB+ VRAM → Q8_0 or F16 (7B–13B), Q5_K_M (34B), Q4_K_M (70B)
│
└── CPU only (no GPU)?
├── 8GB RAM → Q2_K (7B only, slow)
├── 16GB RAM → Q4_K_M (7B/8B)
├── 32GB RAM → Q4_K_M (13B) or Q5_K_M (7B)
└── 64GB RAM → Q4_K_M (34B) or Q8_0 (13B)
THEN: What is your primary use case?
│
├── General chat / assistants → Q4_K_M is fine
├── Code generation → Q5_K_M or Q6_K (improved syntax accuracy)
├── Mathematical reasoning → Q5_K_M or higher (precision matters)
├── RAG / document Q&A → Q4_K_M is fine (retrieval drives quality)
└── Creative writing → Q4_K_M is fine (creativity not precision)
Part 10: TurboQuant and the Future of GGUF
Your TurboQuant question deserves a direct answer here: TurboQuant and GGUF quantization target different parts of the model’s memory.
┌──────────────────────────────────────────────────────┐
│ LLM Memory During Inference │
│ │
│ ┌──────────────────┐ ← GGUF quantization targets │
│ │ Model Weights │ this: compresses from 16 │
│ │ (static — same │ bits to 4–8 bits per weight│
│ │ for all queries)│ │
│ │ ~10GB Q4_K_M │ │
│ └──────────────────┘ │
│ │
│ ┌──────────────────┐ ← TurboQuant targets this: │
│ │ KV Cache │ compresses from 16 bits │
│ │ (dynamic — grows│ to 3 bits with near-zero │
│ │ with context) │ accuracy loss │
│ │ ~2GB at 32K ctx │ │
│ └──────────────────┘ │
└──────────────────────────────────────────────────────┘
They stack — not compete. A model running Q4_K_M weights with TurboQuant KV cache compression uses less total memory than either technique alone.
Community ports are actively working to combine both in a TQ4_K_M GGUF format — llama.cpp Discussion #20969 is tracking integration, and a TQ3_0 format for CPU using Randomized Hadamard Transform plus 3-bit Lloyd-Max quantization is already functional.
What to expect in Q3 2026:
# When TurboQuant lands in Ollama (expected Q3 2026):
ollama pull llama4:scout-tq4_k_m # TurboQuant + Q4_K_M weights
# When TurboQuant lands in llama.cpp:
llama-server \
--model Llama-4-Scout-TQ4_K_M.gguf \
--cache-type-k tq3_0 \ # TurboQuant 3-bit KV cache
--cache-type-v tq3_0 \
--n-gpu-layers 99
Until then, the current best practice for long-context sovereign inference:
# Best available today: Q4_K_M weights + q8_0 KV cache + Flash Attention
OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
ollama run llama4:scout-q4_k_m "Summarise this 100,000 word document: [...]"
Part 11: The Sovereignty Layer — Verify Your Model Is Running Locally
echo "=== SOVEREIGN QUANTIZATION AUDIT ==="
echo ""
echo "[ Model files on local disk ]"
ls -lh ~/models/*.gguf 2>/dev/null || \
ls -lh ~/.ollama/models/blobs/ 2>/dev/null | head -5
echo ""
echo "[ Active inference processes ]"
ps aux | grep -E "ollama|llama" | grep -v grep | \
awk '{print " " $1 " — " $11}'
echo ""
echo "[ VRAM allocation ]"
nvidia-smi --query-gpu=name,memory.used,memory.free \
--format=csv,noheader 2>/dev/null || echo " (CPU-only or Apple Silicon)"
echo ""
echo "[ Outbound network connections during inference ]"
# Send a test prompt in the background
ollama run llama4:scout "test" &>/dev/null &
sleep 2
# Check for any unexpected external connections
ss -tnp state established 2>/dev/null | grep -v "127.0\|::1\|172\." | \
grep -E "ollama|llama" || echo " ✓ No external connections — fully sovereign"
wait
Expected output:
=== SOVEREIGN QUANTIZATION AUDIT ===
[ Model files on local disk ]
-rw-r--r-- 1 youruser youruser 10.4G Apr 16 09:15 Llama-4-Scout-17B-Q4_K_M.gguf
[ Active inference processes ]
youruser — /usr/bin/ollama
[ VRAM allocation ]
NVIDIA GeForce RTX 3080, 9847 MiB, 426 MiB
[ Outbound network connections during inference ]
✓ No external connections — fully sovereign
Model weights on local disk. Inference running on local GPU. Zero external connections. SovereignScore: 97/100. The 3-point deduction is for the one-time model download from registry.ollama.ai or HuggingFace during initial setup.
Quick Reference: Quantization Cheat Sheet
FORMAT BITS SIZE(7B) QUALITY USE WHEN
──────────────────────────────────────────────────────────────────────
Q2_K 2.6 2.7 GB 84% Desperate for VRAM — last resort
Q3_K_M 3.3 3.3 GB 93% 6–8GB VRAM cards (RTX 2060/3060)
Q4_K_S 4.3 4.0 GB 96% Rarely worth it over Q4_K_M
Q4_K_M 4.8 4.1 GB 98% ← DEFAULT for most users (best balance)
Q5_K_S 5.2 4.5 GB 99% Rarely worth it over Q5_K_M
Q5_K_M 5.7 4.8 GB 99.5% 12GB+ VRAM, code/math tasks
Q6_K 6.6 5.5 GB 99.8% 12GB+ VRAM, precision-sensitive tasks
Q8_0 8.0 6.7 GB ~100% 16GB+ VRAM, near-lossless
F16 16.0 13.5 GB 100% Full precision — rarely needed locally
OLLAMA COMMANDS
──────────────────────────────────────────────────────────────────────
ollama pull model:tag-q4_k_m Pull specific quantization
OLLAMA_FLASH_ATTENTION=1 Enable Flash Attention
OLLAMA_KV_CACHE_TYPE=q8_0 KV cache quantization
ollama show model --verbose Check loaded quantization
LLAMA.CPP FLAGS
──────────────────────────────────────────────────────────────────────
--n-gpu-layers 99 Offload all layers to GPU
--ctx-size 32768 Set context window
--cache-type-k q8_0 KV key cache quantization
--cache-type-v q8_0 KV value cache quantization
--flash-attn Enable Flash Attention
Troubleshooting
CUDA error: out of memory when loading a model
Cause: The model + KV cache exceed your VRAM. Fix:
# Option 1: Use a lower quantization
ollama pull llama4:scout-q3_k_m # Smaller than q4_k_m
# Option 2: Reduce context size (biggest single factor in KV cache size)
llama-cli --model model.gguf --ctx-size 2048 # Reduce from default 4096
# Option 3: Quantize the KV cache
llama-cli --model model.gguf --cache-type-k q8_0 --cache-type-v q8_0
# Option 4: Offload fewer layers to GPU (rest runs on CPU)
llama-cli --model model.gguf --n-gpu-layers 30 # Instead of 99
Model loads but inference is very slow
Cause: Model is partially on GPU and partially on CPU — the PCIe transfer is the bottleneck. Diagnosis:
llama-cli --model model.gguf --n-gpu-layers 99 --verbose 2>&1 | grep "offloaded"
Expected output showing the problem:
llm_load_tensors: offloaded 30/49 layers to GPU ← Only 30 of 49 on GPU
Fix: Use a smaller model or lower quantization so all layers fit on the GPU.
# Check VRAM available
nvidia-smi --query-gpu=memory.free --format=csv,noheader
# If 4GB free with Q4_K_M, try Q3_K_M (saves ~1.5GB)
ollama pull llama4:scout-q3_k_m
llama-quantize: error: failed to open model
Cause: The source model file is corrupted, incomplete, or in safetensors format (not GGUF F16). Fix:
# Verify file integrity
sha256sum ~/models/model-F16.gguf
# Compare against the hash published on the HuggingFace model page
# Re-download if corrupted
rm ~/models/model-F16.gguf
huggingface-cli download repo/model-name model-F16.gguf --local-dir ~/models/
Performance is worse than expected for Q5_K_M vs Q4_K_M
Cause: The model partially spills from VRAM to system RAM. Fix:
# Check total memory used (not just GPU)
watch -n 1 "nvidia-smi --query-gpu=memory.used --format=csv,noheader && \
free -h | grep Mem"
If VRAM is maxed and system RAM usage climbs during inference, you’ve exceeded VRAM. Drop down one quantization level or reduce context size.
Conclusion
GGUF quantization is the single most important lever for making sovereign local AI practical on consumer hardware. The Q4_K_M format sits at the sweet spot of the quality-size curve: 98% of full-precision quality at 30% of the file size. Understanding the full taxonomy — from Q2_K through F16 — lets you push models onto hardware they wouldn’t otherwise fit, or push quality up when VRAM permits. TurboQuant will extend this further by compressing the KV cache (a separate memory pool that GGUF doesn’t touch) — and when community ports land in Q3 2026, combining Q4_K_M weights with TurboQuant KV cache will be the new standard sovereign configuration.
The natural next article from here is llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU — the complete guide to compiling llama.cpp from source, running models, and tuning every inference parameter.
People Also Ask: GGUF Quantization FAQ
What is the difference between Q4_0 and Q4_K_M?
Both use approximately 4 bits per weight, but they compress weights differently. Q4_0 uses uniform quantization — every weight gets the same treatment, which means outlier values distort the quantization grid and reduce quality. Q4_K_M uses the K-quant super-block structure: weights are grouped into super-blocks, and within each block, attention and embedding layers (which are most sensitive to precision loss) get slightly higher bit allocation, while less critical layers are compressed more aggressively. The result is that Q4_K_M consistently outperforms Q4_0 on perplexity benchmarks by 15–25% at the same average bit-width. Always prefer Q4_K_M over Q4_0 unless you’re on a platform with limited K-quant support.
Does quantization affect creativity and writing quality?
Quantization affects all outputs proportionally — it introduces small, random noise into the probability distribution over next tokens. For creative writing tasks, this noise is usually imperceptible because there are many valid next tokens and the difference between, say, “beautiful” and “gorgeous” doesn’t change the quality of the prose. Quantization errors manifest most visibly in tasks that require exact outputs: code (wrong function names, syntax errors), mathematics (arithmetic mistakes), and structured formats (broken JSON). For creative writing, Q4_K_M is indistinguishable from F16 in practice. For code generation, Q5_K_M or higher is worth the VRAM cost if you have the headroom.
Can I run quantized models on a Raspberry Pi or ARM device?
Yes — llama.cpp supports ARM CPUs with NEON acceleration on Raspberry Pi 4 and Pi 5. The Raspberry Pi 5 (4GB or 8GB) can run Q4_K_M models up to about 3B parameters at 1–2 tokens/second, which is functional for offline assistants and edge inference. Use Q2_K for 7B models if you need more headroom — quality will be degraded but it will complete. For production edge deployments, dedicated NPU hardware (Rockchip RK3588, Qualcomm NPUs) offers 5–10× the throughput of ARM CPU inference. The key limitation on Raspberry Pi is memory bandwidth, not compute.
Is Q8_0 always better than Q4_K_M?
Not always in practice, and not always worth the VRAM cost. On perplexity benchmarks, Q8_0 scores about 2% better than Q4_K_M. On task benchmarks (HumanEval, MMLU), the difference is under 1.5%. For the overwhelming majority of conversational, RAG, and assistive use cases, Q4_K_M outputs are indistinguishable from Q8_0 outputs. Where Q8_0 meaningfully outperforms: complex multi-step reasoning chains where small errors compound, precise structured data extraction (strict JSON schema adherence), and mathematical proofs. If your VRAM comfortably fits Q8_0, use it. If it’s a trade-off between Q8_0 on a smaller model or Q4_K_M on a larger model — choose the larger model at Q4_K_M. Model size matters more than quantization level above Q4.
Further Reading
- How to Install Ollama and Run LLMs Locally: Complete 2026 Guide — use the quantization knowledge from this article in practice
- llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU — deep dive into llama.cpp’s inference parameters
- TurboQuant Explained: Google’s Extreme AI Compression with Ollama and llama.cpp — the companion article on KV cache compression
- Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector — deploy your quantized model in a full sovereign stack
- On-Device AI Inference 2026: Apple Silicon, NVIDIA & AMD — hardware-specific optimisation for each platform
Tested on: Ubuntu 24.04 LTS (NVIDIA RTX 3080 10GB), Ubuntu 24.04 LTS (Intel i7-13700K CPU-only), macOS Sequoia 15.4 (Apple M3 Max 64GB). llama.cpp build b4800. Ollama 5.x. Benchmarks measured April 2026. Report a broken snippet if a command fails after a dependency update.