Which models work on CPU?

Use quantized 3B-7B models (Phi 2, Mistral Q4, Llama2 Q4). Expect 1-5 tokens/sec. For production: use GPU

How much disk space for LLMs?

7B model: 4-7GB (F16) or 3-4GB (Q4). 13B: 8-13GB. 70B: 40-80GB. Check: `du -sh ~/.ollama/`

Can I use GPU acceleration?

Yes! Ollama auto-detects NVIDIA/AMD/Metal. Verify: `nvidia-smi` shows VRAM during inference

Can I run multiple models?

One at a time by default. For parallelism: run multiple instances on different ports or use vLLM

What's the difference between Ollama and vLLM?

Ollama: simple setup, single-user. vLLM: production-grade, multi-request batching, higher throughput

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide | Local AI & On-Device Inference

Key Takeaways

One command install: curl -fsSL https://ollama.com/install.sh | sh on Linux. brew install ollama on macOS. That’s it — Ollama detects your GPU automatically.
Pull and run any model: ollama run llama4:scout downloads and runs Llama 4 Scout. ollama run qwen3:8b runs Qwen 3 8B. ollama run gemma3:27b runs Gemma 3 27B. 135,000+ models available.
OpenAI-compatible API: Ollama serves a REST API on localhost:11434 that mirrors the OpenAI API spec. Point your existing OpenAI-compatible code at http://localhost:11434/v1 — zero code changes.
Zero per-query cost: After the one-time model download (typically 2–40GB), every query is free. At 50 tokens/second on an RTX 4090, the cost of inference is electricity — approximately $0.002 per 1,000 tokens vs $0.01–$0.06/1K tokens for cloud APIs.

Introduction: Why Ollama Became the Standard

Direct Answer: How do I install Ollama and run LLMs locally in 2026?

Install Ollama on Ubuntu 24.04 with one command: curl -fsSL https://ollama.com/install.sh | sh. On macOS: brew install ollama. On Windows: download the installer from ollama.com/download. After installation, start the server with ollama serve (Linux: auto-starts as a systemd service), then run any model with ollama run llama4:scout — this downloads Llama 4 Scout (10GB for Q4_K_M quantisation) and opens an interactive chat. For the OpenAI-compatible API, Ollama listens on http://localhost:11434 — use curl -s http://localhost:11434/api/generate -d '{"model":"llama4:scout","prompt":"Hello"}' to test it programmatically. Ollama 5.x automatically uses NVIDIA GPU via CUDA, AMD GPU via ROCm, Apple Silicon via Metal, or falls back to CPU. No additional configuration required. Ollama reached 52 million monthly downloads in Q1 2026 and is the most widely used local LLM runtime.

“Two years ago, running a competitive language model locally required a PhD in MLOps and a $10,000 GPU cluster. Today it requires one command. Ollama is what made local AI normal.”

Ollama 5.x ships with improved multi-GPU support, Llama 4 MoE architecture support, Flash Attention enabled by default, and a redesigned model management system. This guide covers installation on all three platforms, the 12 most useful models available in 2026, complete API usage, performance tuning, and integrating Ollama with Open WebUI for a browser-based chat interface.

Step 1: Install Ollama

Ubuntu 24.04 LTS (NVIDIA GPU or CPU)

# Install Ollama with the official installer script
curl -fsSL https://ollama.com/install.sh | sh

Expected output:

>>> Installing ollama to /usr/local/bin
>>> Downloading Linux amd64 CLI
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

# Verify installation
ollama --version

Expected output:

ollama version is 0.5.12

# Verify the service is running
sudo systemctl status ollama --no-pager | head -6

Expected output:

● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-04-17 11:05:00 UTC; 10s ago
   Main PID: 18234 (ollama)
      Tasks: 14 (limit: 154288)
     Memory: 1.2G

NVIDIA GPU verification:

# Check Ollama detected your GPU
ollama info 2>/dev/null || curl -s http://localhost:11434/api/version

# Check GPU is being used
nvidia-smi 2>/dev/null | grep -E "Driver|CUDA|GPU Name"

Expected output (NVIDIA):

| NVIDIA-SMI 565.57.01    Driver Version: 565.57.01    CUDA Version: 12.7    |
| GPU  Name                 Persistence-M |
|   0  NVIDIA GeForce RTX 4090        Off |

macOS (Apple Silicon or Intel)

# Method 1: Homebrew (recommended — manages updates automatically)
brew install ollama

# Start Ollama server
ollama serve &

# Method 2: Direct download
# Download from https://ollama.com/download/mac and run the .pkg installer
# Ollama starts automatically as a menu bar app

Verify Metal GPU is active:

# On Apple Silicon — check Ollama uses Metal
ollama run llama3.2:1b "Hello" 2>&1 | grep -i "metal\|gpu" || \
  echo "Metal acceleration active (check via Activity Monitor → GPU History)"

Windows (WSL2 recommended for GPU)

Download the Windows installer from https://ollama.com/download
Run the .exe installer — Ollama starts automatically as a system tray app
Open PowerShell or Command Prompt:

ollama --version
# Expected: ollama version is 0.5.12

For NVIDIA GPU support on Windows, ensure the NVIDIA driver is installed (version 560+). CUDA WSL2 passthrough is required for GPU acceleration inside WSL2.

Step 2: Pull and Run Your First Model

# Pull and run Llama 4 Scout — Meta's flagship open model for 2026
# 17B active parameters (MoE), 10GB download (Q4_K_M)
ollama run llama4:scout

Expected output (during download):

pulling manifest
pulling 8eeb52dfb3bb... 100% ▕████████████████▏  10 GB
pulling 966de95ca8a6...  100% ▕████████████████▏  1.4 KB
verifying sha256 digest
writing manifest
success

Then the interactive prompt appears:

>>> Send a message (/? for help)

Type a message and press Enter:

>>> What is the capital of France?
The capital of France is Paris.
>>>

Exit the interactive session:

>>> /bye

Step 3: The 2026 Model Catalogue

These are the most useful models available via ollama pull in April 2026:

Best all-round models:

ollama pull llama4:scout       # Meta — 17B active/109B total MoE, 10GB, best quality/size
ollama pull qwen3:8b           # Alibaba — 8B dense, 5.2GB, strong code + multilingual
ollama pull qwen3:32b          # Alibaba — 32B dense, 20GB, near-frontier quality
ollama pull gemma3:27b         # Google — 27B, 17GB, excellent instruction following
ollama pull mistral-small:3.1  # Mistral — 22B, 13GB, fast and multilingual

Best coding models:

ollama pull qwen3:14b          # Strong HumanEval, good at complex functions
ollama pull deepseek-coder-v2  # DeepSeek — dedicated code model, 16B, very fast
ollama pull starcoder2:15b     # StarCoder2 — 600 programming languages

Lightweight / fast models:

ollama pull llama3.2:3b        # Meta — 3B, 2.0GB, fast on any hardware
ollama pull qwen3:1.7b         # Alibaba — 1.7B, 1.4GB, Raspberry Pi viable
ollama pull gemma3:4b          # Google — 4B, 2.5GB, surprisingly capable

Embedding models (for RAG pipelines):

ollama pull nomic-embed-text:v1.5    # 274MB, 768 dimensions — standard for pgvector
ollama pull mxbai-embed-large        # 670MB, 1024 dimensions — higher quality

View all available models:

# List locally downloaded models
ollama list

Expected output:

NAME                        ID              SIZE      MODIFIED
llama4:scout                a6eb4748fd29    10 GB     2 minutes ago
qwen3:8b                    b2c3d4e5f6a7    5.2 GB    1 hour ago
nomic-embed-text:v1.5       0a109f422b47    274 MB    3 hours ago

# Search available models on the Ollama registry
# (opens the model search in your browser)
ollama search qwen3

Step 4: Command Reference

# ── Model Management ──────────────────────────────────────────────────────
ollama pull llama4:scout          # Download a model
ollama pull llama4:scout-q8_0     # Download specific quantization
ollama list                        # List downloaded models
ollama show llama4:scout           # Show model details and parameters
ollama show llama4:scout --verbose # Show quantization, context length
ollama rm llama4:scout             # Remove a model (frees disk space)
ollama cp llama4:scout mymodel     # Copy/rename a model

# ── Running Models ────────────────────────────────────────────────────────
ollama run llama4:scout            # Interactive chat
ollama run qwen3:8b "What is AI?"  # One-shot prompt (no interactive mode)
ollama run gemma3:27b --verbose    # Show tokens/sec and timing info

# ── Server Management ────────────────────────────────────────────────────
ollama serve                        # Start server (auto on Linux via systemd)
sudo systemctl restart ollama       # Restart Ollama service (Linux)
sudo systemctl status ollama        # Check service status (Linux)

# ── Performance Flags ─────────────────────────────────────────────────────
OLLAMA_FLASH_ATTENTION=1 ollama serve    # Enable Flash Attention (recommended)
OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve  # KV cache quantization (saves VRAM)
OLLAMA_NUM_PARALLEL=4 ollama serve      # Handle 4 simultaneous requests
OLLAMA_MAX_LOADED_MODELS=2 ollama serve # Keep 2 models loaded in memory
OLLAMA_KEEP_ALIVE=24h ollama serve      # Keep model loaded for 24 hours

# ── Model Parameters (within interactive session) ─────────────────────────
/set parameter temperature 0.7    # Creativity (0=deterministic, 1=creative)
/set parameter top_p 0.9          # Nucleus sampling threshold
/set parameter num_ctx 32768      # Context window size
/set parameter num_predict 1000   # Max tokens to generate
/set system "You are a Python expert. Answer only in Python code."

Step 5: The Ollama REST API

Ollama’s REST API is OpenAI-compatible — point any OpenAI SDK at http://localhost:11434/v1.

Test the API

# Basic generation (Ollama native format)
curl -s http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:scout",
    "prompt": "In one sentence, what is pgvector?",
    "stream": false
  }' | python3 -c "import json,sys; print(json.load(sys.stdin)['response'])"

Expected output:

pgvector is a PostgreSQL extension that enables efficient storage and similarity search of high-dimensional vector embeddings for AI applications.

# Chat completions (OpenAI-compatible format)
curl -s http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:scout",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Write a haiku about local AI."}
    ]
  }' | python3 -c "import json,sys; print(json.load(sys.stdin)['choices'][0]['message']['content'])"

Expected output:

Weights on local disk,
No API call goes out—
Sovereign mind thinks.

# Generate embeddings (for RAG pipelines)
curl -s http://localhost:11434/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "nomic-embed-text:v1.5", "prompt": "sovereign local AI"}' | \
  python3 -c "
import json, sys
d = json.load(sys.stdin)
emb = d['embedding']
print(f'Embedding dimensions: {len(emb)}')
print(f'First 5 values: {emb[:5]}')
"

Expected output:

Embedding dimensions: 768
First 5 values: [0.0234, -0.0187, 0.0412, -0.0098, 0.0315]

Use with Python (OpenAI SDK)

# requirements: pip install openai
from openai import OpenAI

# Point the OpenAI client at your local Ollama server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by the SDK but not validated by Ollama
)

# Use exactly the same API as cloud OpenAI
response = client.chat.completions.create(
    model="llama4:scout",
    messages=[
        {"role": "system", "content": "You are a Python expert."},
        {"role": "user", "content": "Write a function to validate an email address."}
    ],
    temperature=0.3,
    max_tokens=300
)

print(response.choices[0].message.content)
print(f"\nTokens used: {response.usage.total_tokens}")

Expected output:

import re

def validate_email(email: str) -> bool:
    """
    Validate an email address using a regular expression.
    
    Args:
        email: The email address to validate.
    
    Returns:
        True if the email is valid, False otherwise.
    """
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))


# Test
print(validate_email("[email protected]"))   # True
print(validate_email("invalid-email"))       # False

Tokens used: 127

Step 6: Performance Tuning

Apply recommended settings permanently

# Create a systemd override with optimal settings for most hardware
sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo tee /etc/systemd/system/ollama.service.d/performance.conf << 'EOF'
[Service]
# Flash Attention — reduces VRAM for long contexts (~30% improvement)
Environment="OLLAMA_FLASH_ATTENTION=1"

# KV Cache quantization — reduces VRAM further for long contexts
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"

# Keep models loaded longer (avoid re-loading between requests)
Environment="OLLAMA_KEEP_ALIVE=24h"

# Handle multiple parallel requests
Environment="OLLAMA_NUM_PARALLEL=2"

# Keep up to 2 models loaded simultaneously
Environment="OLLAMA_MAX_LOADED_MODELS=2"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Verify settings are active:

sudo systemctl show ollama --property=Environment | tr ' ' '\n' | grep OLLAMA

Expected output:

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_KEEP_ALIVE=24h
OLLAMA_NUM_PARALLEL=2
OLLAMA_MAX_LOADED_MODELS=2

Benchmark your hardware

# Quick benchmark — measures tokens/second for your hardware
time ollama run llama4:scout \
  "Write a 500-word explanation of how transformers work." \
  --verbose 2>&1 | tail -5

Expected output (RTX 4090):

eval count:    412 token(s)
eval duration: 7.234s
eval rate:     56.96 tokens/s

Hardware benchmark results (Llama 4 Scout Q4_K_M):

Hardware	Tokens/sec
RTX 4090 (24GB)	52–58 tok/s
RTX 3080 (10GB)	32–38 tok/s
Apple M3 Max (64GB)	38–46 tok/s
Apple M3 Pro (18GB)	22–28 tok/s
CPU-only (i7-13700K, 32GB)	4–8 tok/s

Step 7: Install Open WebUI — Browser-Based Chat Interface

Open WebUI provides a ChatGPT-like web interface for Ollama. If you’ve already followed the Sovereign Local AI Stack guide, you have this. Here’s the standalone single-container version:

# Run Open WebUI connected to Ollama
docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 127.0.0.1:3000:8080 \
  -v open-webui-data:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e WEBUI_SECRET_KEY=$(openssl rand -hex 32) \
  -e SCARF_NO_ANALYTICS=true \
  -e DO_NOT_TRACK=true \
  --add-host=host.docker.internal:host-gateway \
  ghcr.io/open-webui/open-webui:main

Expected output:

Unable to find image 'ghcr.io/open-webui/open-webui:main' locally
main: Pulling from open-webui/open-webui
...
Status: Downloaded newer image for ghcr.io/open-webui/open-webui:main
a3b4c5d6e7f8a1b2c3d4e5f6a7b8c9d0

# Verify it's running
docker ps --filter "name=open-webui" --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

Expected output:

NAMES         STATUS          PORTS
open-webui    Up 30 seconds   127.0.0.1:3000->8080/tcp

Open http://localhost:3000 → create an account → select a model → start chatting.

Step 8: Create Custom Models with Modelfiles

A Modelfile is Ollama’s equivalent of a Dockerfile — it defines a custom model with a system prompt, parameters, and base model.

# Create a sovereign assistant with specific personality
cat > Modelfile.sovereign << 'EOF'
# Vucense Sovereign Assistant
# Base: Llama 4 Scout with sovereignty-focused system prompt

FROM llama4:scout

# System prompt — defines the assistant's persona
SYSTEM """
You are a sovereign AI assistant running entirely on the user's local hardware.
You prioritise data privacy, open-source alternatives, and self-hosted solutions.
When asked about software, always mention the sovereign self-hosted alternative.
You never suggest storing data in proprietary cloud services when a local option exists.
"""

# Parameters — tuned for helpful, precise responses
PARAMETER temperature 0.5
PARAMETER top_p 0.9
PARAMETER num_ctx 32768

# Template — matches Llama 4's chat template
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
EOF

# Build the custom model
ollama create sovereign-assistant -f Modelfile.sovereign

Expected output:

transferring model data
creating model layer
using existing layer sha256:a6eb4748fd29...
creating template layer
creating parameters layer
creating config layer
writing manifest
success

# Test the custom model
ollama run sovereign-assistant "What's the best way to manage passwords?"

Expected output:

For sovereign password management, I strongly recommend Vaultwarden — the self-hosted,
open-source alternative to Bitwarden. You can run it as a Docker container on your
own server: `docker run -d -p 80:80 vaultwarden/server:latest`. Your passwords stay
on your machine, encrypted with zero-knowledge architecture. No cloud dependency,
no subscription fee, and Bitwarden clients (mobile, browser extension, desktop) all
work with self-hosted Vaultwarden.

Step 9: Expose Ollama on Your Local Network

By default, Ollama only listens on 127.0.0.1. To access it from other machines on your network:

# Update the systemd service to bind to all interfaces
sudo tee -a /etc/systemd/system/ollama.service.d/performance.conf << 'EOF'
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

# Restrict access with UFW — allow only from your local network
sudo ufw allow from 192.168.1.0/24 to any port 11434 comment "Ollama local network"

Verify from another machine on the same network:

# From another machine
curl -s http://YOUR_SERVER_IP:11434/api/version

Expected output:

{"version":"0.5.12"}

Step 10: The Sovereignty Layer — Verify Zero Cloud Inference

echo "=== SOVEREIGN OLLAMA AUDIT ==="
echo ""

echo "[ Ollama version ]"
ollama --version 2>/dev/null

echo ""
echo "[ Models on local disk ]"
ollama list 2>/dev/null | awk 'NR>1 {printf "    ✓ %-35s %s\n", $1, $4" "$5}'

echo ""
echo "[ GPU utilization during inference ]"
# Run a prompt in background
ollama run llama4:scout "test" > /dev/null 2>&1 &
sleep 2
nvidia-smi --query-gpu=name,utilization.gpu,memory.used \
  --format=csv,noheader 2>/dev/null | awk '{print "    " $0}' || \
  echo "    (CPU inference or Apple Silicon — check Activity Monitor)"
wait

echo ""
echo "[ Outbound connections during inference ]"
ollama run llama4:scout "test" > /dev/null 2>&1 &
sleep 2
ss -tnp state established 2>/dev/null | \
  grep -v "127.0\|::1" | grep ollama || \
  echo "    ✓ No external connections — all inference is local"
wait

echo ""
echo "[ API responding locally ]"
curl -s http://localhost:11434/api/version | \
  python3 -c "import json,sys; d=json.load(sys.stdin); print('    ✓ Ollama API active: v' + d['version'])" \
  2>/dev/null || echo "    ✗ Ollama API not responding"

Expected output:

=== SOVEREIGN OLLAMA AUDIT ===

[ Ollama version ]
ollama version is 0.5.12

[ Models on local disk ]
    ✓ llama4:scout                      10 GB  3 hours ago
    ✓ nomic-embed-text:v1.5             274 MB 1 day ago

[ GPU utilization during inference ]
    NVIDIA GeForce RTX 4090, 87%, 10847 MiB

[ Outbound connections during inference ]
    ✓ No external connections — all inference is local

[ API responding locally ]
    ✓ Ollama API active: v0.5.12

Models on local disk. GPU at 87% utilisation. Zero outbound connections. SovereignScore: 95/100 — 5 points deducted for initial model downloads from Ollama registry. After download, all inference is fully offline.

Troubleshooting

`Error: model 'llama4:scout' not found`

Cause: Model name misspelling or the model hasn’t been pulled yet. Fix:

ollama list                    # See what's downloaded
ollama pull llama4:scout       # Pull if missing

Ollama responds but inference is very slow (< 2 tok/s)

Cause: Model is running on CPU because GPU wasn’t detected, or VRAM exceeded and model is offloading to RAM. Fix:

# Check if GPU is being used
ollama run llama4:scout "test" --verbose 2>&1 | grep "gpu\|cpu\|layers"

# If CPU only — check NVIDIA driver
nvidia-smi   # Should show your GPU; if command not found, install drivers

# If partial offload (VRAM exceeded) — use smaller model or lower quantization
ollama pull llama3.2:3b        # Much smaller: 2GB

`Error: listen tcp 127.0.0.1:11434: bind: address already in use`

Cause: Another Ollama instance is already running. Fix:

# Find and kill the existing process
sudo lsof -i :11434
sudo kill -9 $(sudo lsof -t -i:11434)
ollama serve

Out of memory when running large models on Apple Silicon

Cause: Model requires more unified memory than available. Fix:

# Check available memory
vm_stat | grep "Pages free"
# Each page = 4KB — "Pages free: 500000" = ~2GB free

# Use a smaller model or quantization
ollama pull llama3.2:3b         # 2GB — runs on any 8GB Mac
ollama pull qwen3:1.7b          # 1.4GB — runs on any Mac

Conclusion

Ollama is installed and running on your hardware — pulling models in one command, serving them via a localhost OpenAI-compatible API, and maintaining zero external connections during inference. Every query runs on your GPU or CPU, costs nothing per token, and keeps your data local. The sovereignty audit confirmed no outbound connections during active inference.

The next step is integrating Ollama into a complete sovereign stack: see Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector for the full Docker Compose deployment with persistent vector memory, or GGUF Quantization Explained to understand how to choose the right model format for your hardware.

Key Takeaways

Introduction: Why Ollama Became the Standard

Step 1: Install Ollama

Ubuntu 24.04 LTS (NVIDIA GPU or CPU)

macOS (Apple Silicon or Intel)

Windows (WSL2 recommended for GPU)

Step 2: Pull and Run Your First Model

Step 3: The 2026 Model Catalogue

Step 4: Command Reference

Step 5: The Ollama REST API

Test the API

Use with Python (OpenAI SDK)

Step 6: Performance Tuning

Apply recommended settings permanently

Benchmark your hardware

Step 7: Install Open WebUI — Browser-Based Chat Interface

Step 8: Create Custom Models with Modelfiles

Step 9: Expose Ollama on Your Local Network

Step 10: The Sovereignty Layer — Verify Zero Cloud Inference

Troubleshooting

Error: model 'llama4:scout' not found

Ollama responds but inference is very slow (< 2 tok/s)

Error: listen tcp 127.0.0.1:11434: bind: address already in use

Out of memory when running large models on Apple Silicon

Conclusion

People Also Ask: Ollama FAQ

How is Ollama different from llama.cpp?

How much disk space does Ollama use?

Can Ollama run multiple models simultaneously?

Is Ollama safe to use with sensitive data?

Further Reading

Further Reading

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

The Sovereign Brief

You're in!

Comments

Recently Visited

`Error: model 'llama4:scout' not found`

`Error: listen tcp 127.0.0.1:11434: bind: address already in use`