TOPIC

llama.cpp & GGUF

Maximum sovereignty with llama.cpp: compile from source, GGUF model formats, quantisation levels (Q4_K_M, Q8_0), CLI inference, server mode, and performance tuning for CPU and GPU.

Total articles

Featured build

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

Featured build

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

>_ 17 Apr | 17 min | Dev Corner • Marcus Thorne

Complete llama.cpp tutorial for 2026. Install, compile with CUDA/Metal, run GGUF models, tune all inference flags, use the API server, speculative decoding, and benchmark your hardware.

🟡Intermediate

All articles

3 Articles

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

>_ 16 Apr | 17 min | Dev Corner

🟡Intermediate

Speculative decoding doubles local LLM inference speed with zero quality loss. How it works, how to enable it in Ollama and llama.cpp today, and which model pairs give the best speedup.

By Marcus Thorne

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

>_ 16 Apr | 16 min | Dev Corner

🟡Intermediate

Master GGUF quantization formats for local LLMs in 2026. Q2_K, Q4_K_M, Q5_K_S, Q8_0, F16 explained with benchmarks, VRAM tables, and exact Ollama and llama.cpp commands.

By Marcus Thorne

Trending llama.cpp & GGUF

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

llama-cpp Apr 17

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

llama-cpp Apr 16

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

llama-cpp Apr 16

Explore All Archives

About this topic

Maximum sovereignty with llama.cpp: compile from source, GGUF model formats, quantisation levels (Q4_K_M, Q8_0), CLI inference, server mode, and performance tuning for CPU and GPU.

Back to Dev Corner overview →