Local AI & On-Device Inference
Build, deploy, and optimise AI that runs entirely on local hardware. This category covers sovereign model deployment, inference performance, and edge-friendly AI stacks.
Topic breadth
6
Active builds, guides, and subtopic coverage.
Subtopics
Ollama
View topicRun open-weight LLMs locally with Ollama 5.x: model pulling, Modelfile customisation, REST API, GPU acceleration, multi-model management, and zero-cloud inference verification.
llama.cpp & GGUF
View topicMaximum sovereignty with llama.cpp: compile from source, GGUF model formats, quantisation levels (Q4_K_M, Q8_0), CLI inference, server mode, and performance tuning for CPU and GPU.
On-Device Inference
View topicRun AI models entirely on local hardware: Apple Silicon with MLX framework, NVIDIA CUDA with TensorRT, and AMD ROCm. Covers memory requirements, throughput benchmarks, and chip selection.
Local AI Stack Builds
View topicEnd-to-end sovereign AI stack builds: Ollama + Open WebUI + pgvector + LangChain running 100% on-device. Full companion repo, tested code, and sovereignty audit included.
Benchmarking Local Models
View topicBenchmark sovereign local LLM performance: tokens/second, memory usage, first-token latency, and quality metrics across Ollama models on Apple Silicon, NVIDIA, and AMD hardware.