Vucense
Dev Corner Local AI & On-Device Inference On-Device Inference

On-Device AI Inference 2026: Apple Silicon, NVIDIA & AMD Guide

🟡Intermediate

Run LLMs entirely on local hardware without cloud APIs. Covers Apple Silicon with MLX, NVIDIA CUDA setup, AMD ROCm, memory requirements, throughput benchmarks, and model selection by hardware.

On-Device AI Inference 2026: Apple Silicon, NVIDIA & AMD Guide
Article Roadmap

Key Takeaways

  • Unified memory wins for large models: Apple Silicon’s unified memory means the CPU and GPU share the same memory pool — a 64GB M3 Max runs 27B models that would require an expensive multi-GPU NVIDIA setup.
  • VRAM is the constraint, not TFLOPS: LLM inference is memory-bandwidth bound, not compute-bound. More VRAM = larger models = better quality. More TFLOPS ≠ proportionally faster LLM inference.
  • Ollama handles the hardware abstraction: The same ollama run model command works on NVIDIA (CUDA), AMD (ROCm), Apple Silicon (Metal), and CPU. Ollama auto-detects the best backend.
  • Q4_K_M is the sweet spot: 4-bit quantisation with K-quants delivers ~95% of full-precision quality at ~25% the memory footprint. Use it for all inference unless quality benchmarks show a specific task needs Q8.

Introduction

Direct Answer: What hardware do I need to run local LLMs on-device in 2026, and how do I set it up?

Any modern computer with 8GB+ RAM can run a local LLM, but GPU hardware dramatically improves speed. On NVIDIA GPUs: install CUDA 12.4 (ubuntu-drivers install on Ubuntu 24.04), then ollama run qwen3:7b runs at 40-55 tok/s on an RTX 3080 10GB. On Apple Silicon Macs: Ollama uses the Metal backend natively — brew install ollama && ollama run qwen3:14b runs at 25-30 tok/s on M3 Max 64GB. On AMD GPUs: install ROCm 6.x (sudo apt-get install rocm) and use ollama run qwen3:7b — ROCm support is production-ready in Ollama 0.5+. On CPU only: Ollama falls back to llama.cpp’s CPU inference — slower (2-5 tok/s) but functional for testing. Choose your model based on available VRAM: 6GB → Qwen3 7B, 10GB → Qwen3 14B, 16GB → Gemma3 27B, 24GB → Llama 4 Scout or Qwen3 32B.


Part 1: Hardware Selection Guide

MEMORY REQUIREMENTS BY MODEL (Q4_K_M quantisation):

Model Size  | Min VRAM  | Recommended | Best Model at this level
────────────|───────────|─────────────|─────────────────────────
1–3B        | 2–3 GB   | 4+ GB       | Qwen3 1.7B, Phi-4-mini
7B          | 5–6 GB   | 8+ GB       | Qwen3 7B, Gemma3 7B
14B         | 9–10 GB  | 12+ GB      | Qwen3 14B ← sweet spot
27B         | 16–18 GB | 24+ GB      | Gemma3 27B, Qwen3 32B
70B         | 40–48 GB | 64+ GB      | Llama 3.3 70B

2026 hardware recommendations by budget:

BudgetHardwareVRAMBest Model
$400RTX 3060 12GB12GBQwen3 14B Q4_K_M
$700RTX 3090 24GB24GBLlama 4 Scout 17B
$1,600RTX 4090 24GB24GBLlama 4 Scout + fast
$2,000Mac Studio M3 Max 64GB64GB unifiedQwen3 32B
$4,000Mac Studio M2 Ultra 192GB192GB unifiedLlama 3.3 70B
$600/mo2× H100 (rent)160GB405B models

Part 2: NVIDIA CUDA Setup (Ubuntu 24.04)

# Check GPU is detected
lspci | grep -i nvidia

Expected output:

01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
# Install NVIDIA drivers (Ubuntu 24.04 includes them via ubuntu-drivers)
sudo ubuntu-drivers install
# Or specify version:
sudo apt-get install -y nvidia-driver-550

# Verify driver installation (requires reboot first)
nvidia-smi | head -10

Expected output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03    Driver Version: 560.35.03    CUDA Version: 12.4   |
+-------------------------------+----------------------+----------------------+
| GPU  Name            | VRAM   |
| 0  NVIDIA RTX 4090   | 24564MiB |
# Ollama auto-detects CUDA — just install and run
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b
ollama run qwen3:14b "Summarise the CUDA memory hierarchy in two sentences"

# Verify GPU is being used
nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv,noheader

Expected output during inference:

87, 10240 MiB     ← GPU at 87% util, 10GB VRAM used

Part 3: Apple Silicon Setup (macOS / MLX)

# Install Homebrew (if not installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install Ollama (uses Metal backend automatically on Apple Silicon)
brew install ollama

# Start Ollama service
brew services start ollama

# Run a model — Metal backend is used automatically
ollama pull qwen3:14b
ollama run qwen3:14b "What is Metal in the context of Apple Silicon?"

Expected output:

Metal is Apple's graphics and compute API that allows software to directly access 
the GPU on Apple Silicon devices, enabling high-performance parallel computation 
for tasks like machine learning inference without CPU overhead.
# Verify Metal backend is being used
ollama ps

Expected output:

NAME          ID            SIZE     PROCESSOR    UNTIL
qwen3:14b     abc123def456  10 GB    100% GPU     4 minutes from now

100% GPU confirms Metal acceleration.

MLX for maximum Apple Silicon performance:

pip install mlx-lm

# MLX is Apple's ML framework optimised for Apple Silicon
# Often 10-30% faster than Ollama on the same model
python3 -c "
from mlx_lm import load, generate
model, tokenizer = load('mlx-community/Qwen3-14B-4bit')
response = generate(model, tokenizer, prompt='Hello, what are you?', max_tokens=50)
print(response)
"

Part 4: AMD ROCm Setup (Ubuntu 24.04)

# Install ROCm 6.x
wget https://repo.radeon.com/amdgpu-install/6.1.3/ubuntu/noble/amdgpu-install_6.1.60103-1_all.deb
sudo apt-get install ./amdgpu-install_6.1.60103-1_all.deb
sudo amdgpu-install --usecase=rocm

# Add user to render group
sudo usermod -aG render,video $USER
# Log out and back in

# Verify ROCm installation
rocm-smi

Expected output:

GPU  Temp   AvgPwr  SCLK    MCLK    Fan     Perf  PwrCap  VRAM%  GPU%
  0  42.0°   45.0W  1500Mhz 1000Mhz  34.0%  auto  355.0W    0%    0%

GPU[0] : GPU ID      : 0x744C       GFX Version: gfx1100
GPU[0] : Card Series : Radeon RX 7900 XTX
# Ollama 0.5+ supports ROCm natively
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen3:14b
ollama run qwen3:14b "Test ROCm inference"

# Verify ROCm backend
rocm-smi --showuse

Part 5: Benchmark Your Setup

# Benchmark script — tokens per second
cat > benchmark.sh << 'EOF'
#!/bin/bash
MODEL="${1:-qwen3:14b}"
PROMPT="Write a 200-word explanation of how transformers work in machine learning."

echo "Benchmarking: $MODEL"
echo "--------------------------------"

for i in 1 2 3; do
    start=$SECONDS
    output=$(ollama run "$MODEL" "$PROMPT" 2>&1)
    elapsed=$((SECONDS - start))
    tokens=$(echo "$output" | wc -w)
    tps=$((tokens / (elapsed > 0 ? elapsed : 1)))
    echo "Run $i: ~${tps} tok/s (${elapsed}s for ~${tokens} tokens)"
done
EOF

chmod +x benchmark.sh
./benchmark.sh qwen3:14b

Expected output (RTX 4090):

Benchmarking: qwen3:14b
--------------------------------
Run 1: ~58 tok/s (14s for ~812 tokens)
Run 2: ~61 tok/s (13s for ~793 tokens)
Run 3: ~59 tok/s (14s for ~826 tokens)

Part 6: Performance Comparison Table

Real-world benchmarks (Qwen3 14B, Q4_K_M, ollama run):

HardwareTok/sVRAM/RAM UsedPower Draw
RTX 4090 24GB58–6510 GB250W
RTX 3090 24GB38–4410 GB200W
RTX 3080 10GB32–389.8 GB150W
RX 7900 XTX 24GB (ROCm)50–5610 GB210W
M3 Max 64GB (Metal)25–3010 GB25W
M3 Pro 18GB (Metal)18–2210 GB15W
Apple M2 Ultra 192GB28–3510 GB60W
AMD Ryzen 9 7950X (CPU)3–510 GB RAM130W

Apple Silicon is 5–10× more power-efficient than NVIDIA. For a home setup where the machine runs 24/7, M3 Max saves ~$150/year in electricity versus an RTX 4090.


Conclusion

On-device AI inference is accessible in 2026 across all hardware platforms. NVIDIA delivers the highest throughput; Apple Silicon delivers the best efficiency; AMD ROCm 6.x is now a first-class alternative. Ollama abstracts the hardware differences — the same commands work regardless of whether you’re on CUDA, Metal, or ROCm.

See How to Install Ollama and Run LLMs Locally for the Ollama setup guide, and Best Local LLM Models for Coding in 2026 for model selection by hardware tier.


People Also Ask

Can I run local LLMs on a laptop without a discrete GPU?

Yes — Ollama’s CPU inference via llama.cpp runs on any hardware. A MacBook Pro M3 (18GB unified memory) achieves 18–22 tok/s on Qwen3 14B — perfectly usable for development and Q&A. A Windows/Linux laptop with 16GB RAM and no GPU achieves 2–5 tok/s via CPU — slow but functional for testing. For regular use on a CPU-only machine, Qwen3 7B (6GB RAM) balances quality and speed.

Does more VRAM always mean a better model?

More VRAM enables a larger model, and larger models generally produce better output. But there are diminishing returns: Qwen3 14B (10GB) is dramatically better than Qwen3 7B (6GB). Qwen3 32B (20GB) is noticeably better than Qwen3 14B. Llama 3.3 70B (40GB) is only marginally better than Qwen3 32B for most tasks. Beyond ~32B parameters, the quality improvement per GB of VRAM flattens significantly. For most practical tasks, Qwen3 14B on 10GB VRAM is the optimal point on the quality/VRAM curve.


Part 7: Model Quantisation and Memory Planning

Getting on-device inference right means balancing model quality with memory and compute availability.

7.1 Quantisation options

The most common quantisation formats in 2026 are:

  • Q4_K_M: 4-bit quantisation with group-wise scaling — best quality/memory tradeoff
  • Q6_K: 6-bit quantisation for slightly higher quality at a still compact size
  • F16: half-precision floating point — higher quality but much larger memory footprint

For most local inference setups, Q4_K_M is the recommended baseline. It reduces memory use by ~75% compared to FP16 and preserves the quality of 14B and 27B models well.

7.2 VRAM and unified memory planning

A good rule of thumb is to allocate 1.2× the model size in RAM/VRAM for headroom. That means:

  • 10GB of usable memory → 8GB model
  • 24GB of usable memory → 20GB model
  • 64GB of usable memory → 53GB model

On Apple Silicon, unified memory means the GPU and CPU share the same pool. This is why a 64GB M3 Max can run larger models than a 24GB discrete GPU in practice — there is no separate VRAM reservation.

7.3 Batch size and inference latency

Local inference should usually use a batch size of 1 for interactive applications. Higher batch sizes improve throughput for bulk generation, but they increase latency.

For single-user tools, keep the model request as small as possible and use token streaming to show progress.

Part 8: Hardware-Specific Setup Notes

8.1 Apple Silicon tips

  • Use the Metal backend with Ollama or MLX.
  • Keep the machine cool — sustained Metal workloads can heat the system and reduce clock speed.
  • Use a sleep inhibitor if the app is intended to run continuously.

8.2 NVIDIA best practices

  • Enable persistence mode: sudo nvidia-smi -pm 1
  • Reserve enough GPU memory for the model plus OS and any other processes.
  • Use nvidia-smi and htop to monitor both GPU and CPU consumption.

8.3 AMD ROCm best practices

  • Use the latest ROCm release supported by your GPU.
  • Confirm the device appears in rocm-smi and that the backend is not in gfxUnknown mode.
  • If a model fails to load, reduce the memory footprint by switching to a smaller quantisation format or a smaller model.

Part 9: Software Stack and Version Pinning

A self-hosted AI system should pin software versions for reproducibility.

9.1 Ollama and model versions

Pin both the Ollama daemon version and the model tag in a deployment manifest.

ollama:
  version: '0.5.12'
models:
  - qwen3:14b
  - llm4:scout

9.2 Driver and OS versions

For NVIDIA, pin the driver version and CUDA version. For AMD, pin the ROCm stack. For Apple Silicon, pin the macOS version and the Ollama/Metal runtime.

9.3 Dependency verification

Keep checksums for model files and use them at deploy time.

sha256sum qwen3-14b.q4_k_m.bin > qwen3-14b.q4_k_m.sha256
shasum -a 256 -c qwen3-14b.q4_k_m.sha256

Part 10: Performance Engineering and Real-World Throughput

Throughput depends on the entire chain: prompt size, tokenization, GPU utilisation, and output repetition.

10.1 Measuring effective tokens per second

Measure real prompt+response performance, not just raw model decode speed. A 200-token response from a 14B model with a 512-token prompt is the real user experience.

10.2 Avoiding I/O bottlenecks

Keep the model files on a fast local SSD. Slow disk access can stall the model if the runtime swaps parts of the model or reads parameter shards lazily.

10.3 Multi-tenant inference

If multiple users share the same host, use a small inference server in front of the LLM and queue requests. Do not run multiple heavy models on the same GPU unless you have enough memory.

Part 11: Use Cases and Application Patterns

11.1 Local assistant for developers

Run code generation, summarisation, and shell automation locally. Store prompts in a local database and keep the model offline.

Use a local LLM to answer queries over private documents. Combine with RAG and local vector search to stay sovereign.

11.3 Personal knowledge workspace

Run a personal notebook or knowledge assistant entirely on-device. Save the model weights and user notes locally.

Part 12: Final Hardware Decision Tree

Do you need raw throughput for many simultaneous requests? → NVIDIA RTX 4090 or better
Do you need efficiency and low power? → Apple Silicon M3 Max / M3 Ultra
Do you need vendor-neutral local GPU support? → AMD ROCm 6.x
Do you need the cheapest local inference? → CPU only with Qwen3 7B or 4B

12.1 What to buy first

For a local AI experiment: start with a 10–12GB GPU such as RTX 4060 Ti or 3060. For a production self-hosted deployment: use 24GB GPUs or 64GB Apple Silicon.

12.2 What to buy next

If you outgrow the 14B model, move to 27B on a larger host or a multi-GPU setup. If you need more efficiency, use a smaller quantised model and more carefully engineered prompts.

Part 13: Practical Checklist for On-Device AI

  • verify GPU/CPU backend compatibility before model selection
  • choose quantisation based on your VRAM/UMEM budget
  • pin driver, Ollama, and model versions
  • benchmark with real prompts, not synthetic loops
  • secure the local runtime with network restrictions
  • use local caches for models and embeddings
  • monitor GPU utilisation and temperature
  • maintain a hardware/driver compatibility matrix
  • document the exact inference workflow and recovery steps

A successful on-device AI deployment is one that is reproducible, measurable, and maintainable. Keep the hardware and software stack aligned, and treat the model itself as one more system component in your self-hosted infrastructure.

Part 14: Runtime Selection and Inference APIs

Different applications require different runtime APIs and inference modes.

14.1 CLI vs API vs library mode

On-device inference can run as:

  • a CLI tool for ad-hoc queries
  • a local HTTP API for web apps
  • a Python/JS library for direct integration

For production self-hosted apps, a local HTTP API is often the best balance because it isolates the model process, allows request queuing, and keeps the model behind a standard interface.

14.2 System-level supervision

Run the model server under a supervisor such as systemd, Docker, or a process manager. Ensure it restarts on failure and captures logs.

Example systemd unit:

[Unit]
Description=Ollama Local LLM Server
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve --port 11434
Restart=on-failure
User=ollama
RestartSec=5

[Install]
WantedBy=multi-user.target

14.3 Resource-aware scheduling

If the host runs other services, schedule model-heavy batch jobs during off-peak hours. Use nice and cpulimit if necessary.

Part 15: Multi-GPU and Distributed Inference

For larger deployments, multiple GPUs may be necessary.

15.1 Single-node multi-GPU

If your machine has more than one GPU, use a local inference orchestration layer to route requests to the least-loaded GPU, or split model shards across GPUs if the runtime supports it.

15.2 Multi-node inference cluster

For very large models or high throughput, use a cluster with a front-end router and inference workers. Keep all communication on a private network.

15.3 Model placement strategy

Place the most frequently used models on the fastest GPU. Keep lower-priority or experimental models on slower GPUs or disk.

Part 16: Debugging and Failure Modes

A local inference system can fail in predictable ways.

16.1 OOM failures

If the model fails to load, reduce the batch size, switch to a smaller quantisation format, or choose a smaller model. Monitor dmesg for OOM killer events on Linux.

16.2 Driver incompatibilities

When using ROCm or CUDA, driver mismatches are the most common error. Pin driver versions and verify with nvidia-smi or rocm-smi.

16.3 Slow startup

Large models can take several seconds to load. For interactive apps, warm up the model at service startup so the first user does not wait for the load time.

16.4 Inference hangs

If the service hangs, capture a stack trace or inspect the process with strace or debugger tools. Often the issue is a backend thread deadlock or a corrupted model file.

Part 17: Continuous Improvement and Benchmarks

Maintain a benchmark suite for your hardware and models.

17.1 Real workload benchmarks

Use actual prompts from your application instead of synthetic loops. Measure the end-to-end time from request to response, including tokenization and output processing.

17.2 Track quality vs performance

Record the model, quantisation, token count, latency, and subjective quality. This allows you to make decisions with data rather than assumptions.

17.3 Scheduled re-evaluation

Re-run benchmarks quarterly or after any hardware/software change. This ensures the stack remains tuned and avoids performance regressions.

Part 18: Final Result

A successful on-device AI deployment is one where the hardware, the model, and the application are all designed together. Choose the smallest model that meets your quality needs, pin your runtime versions, monitor your inference metrics, and keep the stack local. That is the essence of sovereignty in 2026: local control, repeatable performance, and predictable operations.

Part 19: Hybrid On-Device Architectures

Sometimes the best solution is a hybrid between local device inference and remote services.

19.1 Local-first, cloud-fallback

Run the model locally by default. If the local host lacks capacity or the request is too heavy, fall back to a trusted remote service. This pattern preserves sovereignty for most queries while providing reliability when needed.

19.2 Local cache of remote results

For frequently repeated queries, cache the remote fallback result locally so the host can respond faster next time.

19.3 Offloading heavy workloads

Use local inference for interactive tasks and preserve batch or analytics workloads for a larger remote host. This keeps the local host responsive.

Part 20: Model Selection Workflow

Choose models based on the task, not just the hardware.

20.1 Qualitative selection

For conversational agents, select models with strong dialogue coherence. For summarisation, select models with trained summarisation behaviour. For code generation, choose code-capable models.

20.2 Quantised footprint

Use the lowest quantisation format that meets your quality bar. Start with Q4_K_M, and only move to Q6_K or F16 if the task absolutely requires it.

20.3 Model evaluation pipeline

Keep a small suite of evaluation prompts and human-reviewed expected outputs. Re-run them whenever you change the model or quantisation format.

Part 21: Local Infrastructure Considerations

A self-hosted inference host is infrastructure.

21.1 Backup and recovery

Back up model files, Ollama configuration, and the underlying OS image. Use a local backup strategy that can restore the host in under an hour.

21.2 Hardware lifecycle

Track GPU firmware and driver updates. For AMD and NVIDIA, driver updates can improve performance and fix bugs, but they also carry risk. Test updates on a staging host first.

21.3 Monitoring and alerting

Track model service uptime, GPU memory pressure, GPU temperature, latency, and request failure rate. Alert on repeated failures or when memory usage approaches capacity.

Part 22: Final On-Device Adoption Notes

On-device AI is a powerful sovereignty tool because it keeps your inference on local hardware. It also requires discipline: pin software versions, monitor resource usage, and remain conservative with model size. A successful on-device deployment is one where the hardware and software are aligned, the workloads are understood, and the host is treated as an operational asset rather than a toy.

Part 23: Local Inference Resilience

Resilience is essential for on-device inference when hardware can be unstable.

23.1 Graceful degradation

If a preferred GPU is unavailable, degrade to CPU inference or a smaller model. The service should still return an answer, even if it is slower.

23.2 Automatic fallback logic

Implement a startup check that tests the chosen model and the available device. If the device is not healthy, switch to a fallback model or service.

23.3 Checkpoint validation

Validate model files at startup using checksums. If a model file is corrupted, fail fast and restart or switch to a known-good backup.

Part 24: Usability and Developer Feedback

A good on-device AI host is also easy to use.

24.1 Local diagnostics page

Expose a simple diagnostics page or endpoint that reports model status, device health, and current memory usage. This helps developers verify the host quickly.

24.2 Version and capability discovery

Provide an endpoint that reports supported models, available devices, and runtime capabilities. Clients can use this information to choose the best inference path.

24.3 Documentation

Document the supported hardware, required drivers, and the recommended software stack. The documentation should be part of the repository, not just tribal knowledge.

Further Reading

Tested on: Ubuntu 24.04 LTS (RTX 4090, RX 7900 XTX), macOS Sequoia 15.4 (M3 Max, M3 Pro). Ollama 0.5.12, ROCm 6.1.3, CUDA 12.4. Last verified: April 29, 2026.

Kofi Mensah

About the Author

Inference Economics & Hardware Architect

Electrical Engineer | Hardware Systems Architect | 8+ Years in GPU/AI Optimization | ARM & x86 Specialist

Kofi Mensah is a hardware architect and AI infrastructure specialist focused on optimizing inference costs for on-device and local-first AI deployments. With expertise in CPU/GPU architectures, Kofi analyzes real-world performance trade-offs between commercial cloud AI services and sovereign, self-hosted models running on consumer and enterprise hardware (Apple Silicon, NVIDIA, AMD, custom ARM systems). He quantifies the total cost of ownership for AI infrastructure and evaluates which deployment models (cloud, hybrid, on-device) make economic sense for different workloads and use cases. Kofi's technical analysis covers model quantization, inference optimization techniques (llama.cpp, vLLM), and hardware acceleration for language models, vision models, and multimodal systems. At Vucense, Kofi provides detailed cost analysis and performance benchmarks to help developers understand the real economics of sovereign AI.

View Profile

Further Reading

All Dev Corner

Comments