Dev Corner Generative AI & LLMs Multimodal Builds

Run Vision Models Locally with Ollama 2026: Images + LLMs

97 / 100

🟡Intermediate

Run local vision-language models on Ubuntu 24.04 with Ollama in 2026. Covers Llama 4 Scout, Gemma3 Vision, LLaVA, image analysis, document OCR, screenshot understanding, and Python API integration.

Current

By Kofi Mensah

Feb 9, 2026

15 min

Run Vision Models Locally with Ollama 2026: Images + LLMs

Article Roadmap

Key Takeaways

Llama 4 Scout 17B (April 2026) is the best local vision-language model in 2026 — it understands images, screenshots, charts, and handwritten text at near-GPT-4o quality while running on a single RTX 4090 or Apple M3 Max.
Vision models in Ollama accept images as base64-encoded strings in the API, or directly as file paths in the CLI — 'ollama run llama4:scout' followed by pasting image data or using '/path/to/image.jpg' in the prompt.
Sovereign vision AI eliminates a critical privacy risk: screenshots, document scans, and photos contain sensitive personal data — sending them to cloud vision APIs (GPT-4V, Gemini Vision, Claude's vision API) creates a permanent record on external servers.
Vision models enable practical automation: screenshot-based UI testing, invoice and receipt parsing, diagram-to-code generation, and accessibility descriptions — all achievable locally in 2026 on consumer hardware.

Key Takeaways

Llama 4 Scout is the 2026 standard: Meta’s April 2026 multimodal MoE model understands images, charts, screenshots, and documents. It runs on an RTX 4090 or M3 Max and handles the same image analysis tasks as GPT-4V.
Same Ollama API, just add images: Vision models use the same ollama.chat() API as text models — you pass an additional images parameter with base64-encoded image data or file paths.
Images never leave the machine: Unlike GPT-4V (api.openai.com) or Gemini Vision (generativelanguage.googleapis.com), local vision inference has zero outbound data. Verify with ss -tnp state established | grep ollama.
Practical uses: Code review from screenshots, OCR on document scans, UI bug reports from screenshots, invoice/receipt parsing, accessibility alt-text generation, diagram-to-code.

Introduction

Direct Answer: How do I run vision-language models locally with Ollama on Ubuntu 24.04 in 2026?

Pull a vision-capable model with ollama pull llama4:scout (12GB, RTX 4090 / M3 Max 16GB+) or ollama pull gemma3:12b (8GB, RTX 3080). Send images to the model via Ollama’s Python SDK: import ollama; response = ollama.chat(model='llama4:scout', messages=[{'role':'user','content':'Describe this image','images':['path/to/image.jpg']}]); print(response['message']['content']). From the CLI: ollama run llama4:scout, then type your question and paste or reference the image. Llama 4 Scout understands natural images, screenshots, charts, diagrams, handwritten text, and documents — at near-GPT-4o quality for most tasks. All inference runs locally via Ollama; no data is sent to any external server.

Part 1: Pull a Vision Model

# Verify Ollama is running
ollama list

# Pull vision-capable models (choose based on VRAM)
# RTX 4090 / M3 Max 16GB+ (best quality):
ollama pull llama4:scout          # 12GB — best overall vision quality

# RTX 3080/3090, M2 Max, M3 Pro (good balance):
ollama pull gemma3:12b            # 8GB — strong vision, 128K context

# RTX 3060 12GB, M2/M3 base (accessible):
ollama pull llava:13b             # 8GB — original vision model, widely supported
ollama pull gemma3:4b             # 3GB — fast, basic vision capabilities

echo "Vision models ready"
ollama list | grep -E "llama4|gemma3|llava"

Expected output:

NAME              ID            SIZE   MODIFIED
llama4:scout      abc123def456  12 GB  2 minutes ago
gemma3:12b        def456abc123  8.1 GB 5 minutes ago

Part 2: CLI Usage

# Describe an image from the command line
ollama run llama4:scout

In the interactive prompt:

>>> What is in this image? /home/ubuntu/screenshot.png

Or pass directly:

# One-shot image analysis
echo "Describe what you see in detail" | ollama run llama4:scout --image /path/to/image.jpg

# With a question
echo "What error is shown in this screenshot? How would you fix it?" | \
  ollama run llama4:scout --image /path/to/error-screenshot.png

Expected output:

The screenshot shows a Python AttributeError traceback. The error occurs on line 47 of
api_client.py: 'NoneType' object has no attribute 'get'. This means the variable
'response' is None when .get('data') is called.

To fix this: add a null check before accessing the attribute:
    if response is not None:
        data = response.get('data', {})
    else:
        raise ValueError("API returned None response")

Part 3: Python API Integration

# vision_analysis.py — analyse images with local Ollama vision models
import ollama
import base64
from pathlib import Path

def analyse_image(image_path: str, question: str, model: str = "llama4:scout") -> str:
    """
    Analyse an image with a local vision model.
    Returns the model's text response.
    """
    image_path = Path(image_path)

    # Method 1: Pass file path directly (Ollama handles encoding)
    response = ollama.chat(
        model=model,
        messages=[{
            "role": "user",
            "content": question,
            "images": [str(image_path)]   # Path to image file
        }]
    )
    return response["message"]["content"]

def analyse_image_bytes(image_bytes: bytes, question: str, model: str = "llama4:scout") -> str:
    """Analyse image from bytes (e.g., from a screenshot or web download)."""
    image_b64 = base64.b64encode(image_bytes).decode("utf-8")
    response = ollama.chat(
        model=model,
        messages=[{
            "role": "user",
            "content": question,
            "images": [image_b64]         # Base64-encoded image
        }]
    )
    return response["message"]["content"]


# ── Example 1: Describe an image ──────────────────────────────────────────
result = analyse_image(
    "photo.jpg",
    "Describe this image in detail, including any text visible."
)
print("Description:")
print(result)
print()

# ── Example 2: Extract text from a screenshot (OCR) ───────────────────────
result = analyse_image(
    "screenshot.png",
    "Extract all text from this screenshot. Format it as plain text, preserving structure."
)
print("Extracted text:")
print(result)
print()

# ── Example 3: Analyse a chart or graph ───────────────────────────────────
result = analyse_image(
    "sales-chart.png",
    "What does this chart show? What are the key trends and the most important data points?"
)
print("Chart analysis:")
print(result)

Expected output (screenshot analysis):

Description:
The image shows a Python programming terminal window on Ubuntu 24.04 with a dark theme.
The terminal displays a traceback error in red text, indicating a KeyError at line 23...

Extracted text:
Traceback (most recent call last):
  File "app.py", line 23, in process_data
    value = data['result']['status']
KeyError: 'status'

Chart analysis:
This is a bar chart showing monthly sales revenue for Q1 2026. January shows the highest
revenue at approximately $142,000. February dipped to $98,000 (-31%). March recovered
to $127,000...

Part 4: Practical Use Cases

Invoice and Receipt Parser

# invoice_parser.py — extract structured data from invoice images
import ollama
import json

def parse_invoice(image_path: str) -> dict:
    """Extract structured data from an invoice or receipt image."""
    prompt = """Extract the following information from this invoice/receipt image.
Return ONLY a JSON object with these fields (use null if not found):
{
  "vendor_name": "",
  "date": "",
  "invoice_number": "",
  "line_items": [{"description": "", "quantity": 0, "unit_price": 0, "total": 0}],
  "subtotal": 0,
  "tax": 0,
  "total": 0,
  "currency": ""
}
Return only the JSON, no explanation."""

    response = ollama.chat(
        model="llama4:scout",
        messages=[{"role": "user", "content": prompt, "images": [image_path]}],
        format="json"    # Force JSON output
    )

    return json.loads(response["message"]["content"])

# Test with an invoice scan
invoice_data = parse_invoice("invoice.jpg")
print(f"Vendor: {invoice_data['vendor_name']}")
print(f"Total:  {invoice_data['currency']}{invoice_data['total']}")
print(f"Items:  {len(invoice_data['line_items'])}")

Expected output:

Vendor: Hetzner Online GmbH
Total:  EUR47.40
Items:  3

Screenshot Bug Reporter

# bug_reporter.py — analyse error screenshots and generate bug reports
import ollama
import subprocess
from datetime import datetime

def screenshot_and_report(description: str = "Current screen state") -> str:
    """Take a screenshot and generate a bug report."""
    # Take screenshot (requires scrot: sudo apt-get install scrot)
    screenshot_path = f"/tmp/screenshot_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
    subprocess.run(["scrot", screenshot_path], check=True)

    prompt = f"""Analyse this screenshot taken while: {description}

Generate a structured bug report with:
1. **Summary**: One-sentence description of the issue
2. **Observed behaviour**: What is shown in the screenshot
3. **Expected behaviour**: What should be happening
4. **Steps to reproduce**: Based on the visible UI state
5. **Severity**: Low / Medium / High / Critical"""

    response = ollama.chat(
        model="llama4:scout",
        messages=[{"role": "user", "content": prompt, "images": [screenshot_path]}]
    )
    return response["message"]["content"]

# Generate a bug report from the current screen
report = screenshot_and_report("Clicking the submit button on the user registration form")
print(report)

Accessibility Alt-Text Generator

# alt_text.py — generate accessibility descriptions for images
import ollama
from pathlib import Path

def generate_alt_text(image_path: str, context: str = "") -> str:
    """Generate an accessibility alt text description for an image."""
    context_prompt = f"This image appears on a webpage about: {context}. " if context else ""

    prompt = f"""{context_prompt}Write a concise, descriptive alt text for this image suitable for screen readers.
Requirements:
- Maximum 125 characters
- Describe the essential content and context
- Do not start with "Image of" or "Photo of"
- Focus on what matters for someone who cannot see the image"""

    response = ollama.chat(
        model="gemma3:12b",    # Faster for simple descriptions
        messages=[{"role": "user", "content": prompt, "images": [image_path]}]
    )
    return response["message"]["content"].strip()

# Process a directory of images
image_dir = Path("./website-images")
for img in image_dir.glob("*.{jpg,jpeg,png,webp}"):
    alt = generate_alt_text(str(img), context="server infrastructure")
    print(f"{img.name}: {alt}")

Expected output:

server-rack.jpg: Data center server rack with blinking LED indicators, cable management system, and cooling vents visible in a dimly lit facility
dashboard-screenshot.png: Grafana monitoring dashboard showing CPU usage at 34% and memory at 2.1GB, with green status indicators for all services

Part 5: Sovereignty Verification

echo "=== VISION MODEL SOVEREIGNTY AUDIT ==="

echo ""
echo "[ Vision model loaded in Ollama ]"
ollama list | grep -E "llama4|gemma3|llava" | awk '{printf "  ✓ %s (%s)\n", $1, $3" "$4}'

echo ""
echo "[ Test: analyse image with zero external connections ]"
# Monitor connections WHILE running inference
python3 -c "
import ollama, subprocess, threading, time

connections_found = []

def monitor():
    for _ in range(10):
        result = subprocess.run(
            ['ss', '-tnp', 'state', 'established'],
            capture_output=True, text=True
        )
        for line in result.stdout.splitlines():
            if 'ollama' in line and '127.0.0.1' not in line and '::1' not in line:
                connections_found.append(line.strip())
        time.sleep(0.5)

t = threading.Thread(target=monitor, daemon=True)
t.start()

# Run actual vision inference
response = ollama.chat(
    model='llama4:scout',
    messages=[{'role': 'user', 'content': 'What colour is the sky?',
               'images': ['/usr/share/pixmaps/ubuntu-logo.png']}]
)
t.join(timeout=5)

if connections_found:
    print('✗ External connections detected:')
    for c in connections_found:
        print(f'  {c}')
else:
    print('✓ Zero external connections during vision inference')
    print(f'  Response: {response[\"message\"][\"content\"][:60]}...')
" 2>/dev/null || echo "  (test requires llama4:scout and a PNG file)"

Expected output:

=== VISION MODEL SOVEREIGNTY AUDIT ===

[ Vision model loaded in Ollama ]
  ✓ llama4:scout (12 GB  2 days ago)
  ✓ gemma3:12b   (8.1 GB 3 days ago)

[ Test: analyse image with zero external connections ]
✓ Zero external connections during vision inference
  Response: The Ubuntu logo primarily features orange and purple/aubergine colour...

Troubleshooting

`Error: model does not support images`

Cause: Trying to pass images to a text-only model. Fix: Use a vision-capable model: llama4:scout, gemma3:12b, gemma3:27b, llava:13b, or llava:34b. Check: ollama show MODEL | grep vision.

Image too large — response truncated or slow

Cause: High-resolution images consume a large portion of the context window. Fix: Resize before sending: from PIL import Image; img = Image.open('large.jpg'); img.thumbnail((1024, 1024)); img.save('resized.jpg'). Most visual information is preserved at 1024px.

`base64` encoding error with large images

Fix: Use file path instead of base64: "images": ["/absolute/path/to/image.jpg"] — Ollama handles the encoding internally.

Conclusion

Local vision-language models via Ollama make image understanding a private, zero-cost capability. Llama 4 Scout handles the full range of vision tasks — screenshot analysis, document OCR, chart interpretation, invoice parsing — at quality levels that were cloud-only six months ago. Your images stay on your hardware.

Connect this to Open WebUI for a chat interface with image upload, or to LangChain and LangGraph with Ollama to build agents that can see and analyse visual inputs as part of multi-step workflows.

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

>_ 12 Apr | 18 min | Dev Corner

🟡Intermediate

Deploy a complete local AI stack: Ollama 5.x, Open WebUI, and pgvector: on Ubuntu 24.04. Zero cloud. Zero API costs. Full commands, and tested output.

By Divya Prakash

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

>_ 17 Apr | 16 min | Dev Corner

🟢Beginner

Install Ollama 5.x on Ubuntu, macOS, and Windows. Pull and run Llama 4, Qwen3, Gemma 3, and Mistral locally. REST API setup, GPU acceleration, Open WebUI.

By Marcus Thorne

Open WebUI: Install and Configure Your Local ChatGPT Alternative (2026)

>_ 5 Feb | 16 min | Dev Corner

🟢Beginner

Install Open WebUI on Ubuntu 24.04 to get a ChatGPT-style interface for local Ollama models. Covers Docker setup, model management, RAG with documents, multi-user config, and HTTPS with Nginx.

By Kofi Mensah

#vision-models #multimodal-ai #ollama #local-ai #llava #llama4 #ubuntu #2026

Run Vision Models Locally with Ollama 2026: Images + LLMs

Key Takeaways

Introduction

Part 1: Pull a Vision Model

Part 2: CLI Usage

Part 3: Python API Integration

Part 4: Practical Use Cases

Invoice and Receipt Parser

Screenshot Bug Reporter

Accessibility Alt-Text Generator

Part 5: Sovereignty Verification

Troubleshooting

`Error: model does not support images`

Image too large — response truncated or slow

`base64` encoding error with large images

Conclusion

People Also Ask

Which Ollama vision model is best in 2026?

Can local vision models read handwriting and printed text (OCR)?

How does local vision AI compare to GPT-4V?

Further Reading

About the Author

Further Reading

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

Open WebUI: Install and Configure Your Local ChatGPT Alternative (2026)

Comments

Linux systemd Service Management 2026: systemctl and journalctl

Local Multimodal AI 2026: Expert Guide to Vision + Text Pipelines with Llama 4 & Qwen2-VL

Linux Package Management 2026: apt, dpkg & snap on Ubuntu 24.04

Linux Command Line Basics 2026: 50 Essential Commands

Linux Server Hardening 2026: CIS Benchmark on Ubuntu 24.04

Recently Visited

Key Takeaways

Introduction

Part 1: Pull a Vision Model

Part 2: CLI Usage

Part 3: Python API Integration

Part 4: Practical Use Cases

Invoice and Receipt Parser

Screenshot Bug Reporter

Accessibility Alt-Text Generator

Part 5: Sovereignty Verification

Troubleshooting

Error: model does not support images

Image too large — response truncated or slow

base64 encoding error with large images

Conclusion

People Also Ask

Which Ollama vision model is best in 2026?

Can local vision models read handwriting and printed text (OCR)?

How does local vision AI compare to GPT-4V?

Further Reading

Get the Sovereign Stack Playbook

You're in — welcome to the community!

Related Questions Answered in This Article

About the Author

Further Reading

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

Open WebUI: Install and Configure Your Local ChatGPT Alternative (2026)

Get the Sovereign Stack Playbook

You're in — welcome!

Comments

Recently Visited

`Error: model does not support images`

`base64` encoding error with large images