Dev Corner Python Python — AI Integrations

Python + Ollama 2026: Build Local AI Apps Without Cloud APIs

99 / 100

🟡Intermediate

Use the Ollama Python SDK to build sovereign AI applications. Covers async inference, streaming, structured JSON output, vision, embeddings, and OpenAI SDK migration.

Current

By Anju Kushwaha ✓

Feb 28, 2026

17 min

20 min

Python + Ollama 2026: Build Local AI Apps Without Cloud APIs

Article Roadmap

Key Takeaways

The Ollama Python SDK connects to a local Ollama instance in four lines — 'import ollama; r = ollama.chat(model="qwen3:14b", messages=[{"role":"user","content":"hello"}]); print(r["message"]["content"])' — no API key, no account, no usage limit.
Streaming with stream=True returns a generator of token chunks — iterate over it with 'for chunk in ollama.chat(..., stream=True): print(chunk["message"]["content"], end="")' for real-time output as the model generates.
Structured JSON output via format='json' combined with a Pydantic schema in the system prompt produces type-validated Python objects from LLM responses with no manual parsing.
Ollama is OpenAI SDK-compatible — replace 'OpenAI(api_key=...)' with 'OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")' to migrate existing OpenAI code to local inference with a one-line change.

Key Takeaways

Four lines to hello world: pip install ollama, call ollama.chat(), read response['message']['content']. No API key, no cloud.
stream=True for real-time output: Iterate over the generator to receive tokens as they generate — essential for any user-facing interface.
format="json" for structured data: Combine with a Pydantic schema description in the system prompt to receive validated Python objects.
OpenAI SDK compatibility: Ollama’s API is OpenAI-compatible — existing OpenAI code migrates to local inference by changing the base URL and API key.

Introduction

Direct Answer: How do I use Python to call local Ollama models without cloud APIs in 2026?

Install with pip install ollama. Ensure Ollama is running (ollama serve) with a model pulled (ollama pull qwen3:14b). Basic usage: import ollama; r = ollama.chat(model='qwen3:14b', messages=[{'role':'user','content':'Hello'}]); print(r['message']['content']). For streaming: for chunk in ollama.chat(..., stream=True): print(chunk['message']['content'], end='', flush=True). For JSON output: add format='json' and describe your schema in the system prompt, then json.loads(r['message']['content']). For async: from ollama import AsyncClient; async with AsyncClient() as c: r = await c.chat(...). The SDK runs on Ubuntu 24.04 and macOS with the same code — all inference happens locally via the running Ollama daemon on port 11434, with zero external API calls.

Setup

pip install ollama pydantic --break-system-packages
python3 -c "import ollama; print('Ollama SDK:', ollama.__version__)"
ollama list | head -3

Expected output:

Ollama SDK: 0.4.7
NAME              SIZE
qwen3:14b         9.3 GB

Part 1: Basic Chat and Multi-Turn Conversation

import ollama

# Single turn
r = ollama.chat(
    model="qwen3:14b",
    messages=[
        {"role": "system", "content": "Reply in one sentence maximum."},
        {"role": "user", "content": "What is Docker?"}
    ]
)
print(r["message"]["content"])
print(f"Speed: {r['eval_count'] / (r['eval_duration']/1e9):.1f} tok/s")

Expected output:

Docker is a containerisation platform that packages applications into portable, isolated containers.
Speed: 31.8 tok/s

# Multi-turn conversation — accumulate messages manually
messages = []

def chat(user_msg: str) -> str:
    messages.append({"role": "user", "content": user_msg})
    r = ollama.chat(model="qwen3:14b", messages=messages)
    reply = r["message"]["content"]
    messages.append({"role": "assistant", "content": reply})
    return reply

print(chat("My server has 8GB RAM. What shared_buffers should I set in PostgreSQL?"))
print(chat("And effective_cache_size?"))

Expected output:

Set shared_buffers = 2GB (25% of 8GB RAM).
Set effective_cache_size = 6GB (75% of RAM) — this is a query planner hint, not actual allocation.

Part 2: Streaming

import ollama

def stream_response(prompt: str, model: str = "qwen3:14b") -> str:
    full = ""
    for chunk in ollama.chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True
    ):
        token = chunk["message"]["content"]
        print(token, end="", flush=True)
        full += token
    print()
    return full

stream_response(
    "Write a Python function to read a CSV and return the top N rows by a column value."
)

Expected output (tokens appear in real time):

import csv

def top_n_rows(filepath: str, column: str, n: int = 10) -> list[dict]:
    """Return top N rows sorted descending by column value."""
    with open(filepath, newline='') as f:
        rows = list(csv.DictReader(f))
    return sorted(rows, key=lambda r: float(r[column]), reverse=True)[:n]

Part 3: Structured JSON Output with Pydantic

import ollama, json
from pydantic import BaseModel, Field
from typing import List, Literal

class SecurityIssue(BaseModel):
    line: int
    severity: Literal["critical", "high", "medium", "low"]
    description: str
    fix: str

class CodeReview(BaseModel):
    safe_to_deploy: bool
    score: int = Field(ge=0, le=100)
    issues: List[SecurityIssue]
    summary: str

def review_code(code: str) -> CodeReview:
    schema = json.dumps(CodeReview.model_json_schema(), indent=2)
    r = ollama.chat(
        model="qwen3:14b",
        messages=[
            {"role": "system", "content": f"Review code for security. Return ONLY JSON matching:\n{schema}"},
            {"role": "user", "content": code}
        ],
        format="json"
    )
    return CodeReview.model_validate_json(r["message"]["content"])

result = review_code("""
def get_user(uid: str):
    return db.execute(f"SELECT * FROM users WHERE id = '{uid}'")
""")

print(f"Safe: {result.safe_to_deploy} | Score: {result.score}/100")
for issue in result.issues:
    print(f"  [{issue.severity.upper()}] L{issue.line}: {issue.description}")
    print(f"    Fix: {issue.fix}")

Expected output:

Safe: False | Score: 20/100
  [CRITICAL] L2: SQL injection via f-string interpolation of user input
    Fix: db.execute("SELECT * FROM users WHERE id = ?", (uid,))

Part 4: Async Batch Processing

import asyncio
from ollama import AsyncClient

async def classify(client, text: str) -> str:
    r = await client.chat(
        model="qwen3:14b",
        messages=[
            {"role": "system", "content": "Classify sentiment: POSITIVE, NEGATIVE, or NEUTRAL. One word only."},
            {"role": "user", "content": text}
        ]
    )
    return r["message"]["content"].strip()

async def batch_classify(texts: list[str]) -> list[str]:
    async with AsyncClient() as client:
        return await asyncio.gather(*[classify(client, t) for t in texts])

reviews = [
    "Amazing product, fast delivery!",
    "Broken on arrival, terrible support.",
    "Works as described, nothing special.",
]

sentiments = asyncio.run(batch_classify(reviews))
for rev, sent in zip(reviews, sentiments):
    print(f"  [{sent:8s}] {rev}")

Expected output:

  [POSITIVE] Amazing product, fast delivery!
  [NEGATIVE] Broken on arrival, terrible support.
  [NEUTRAL ] Works as described, nothing special.

Part 5: Vision — Analyse Images

import ollama

# Pass image file path — Ollama handles encoding
r = ollama.chat(
    model="llama4:scout",
    messages=[{
        "role": "user",
        "content": "What error is shown in this screenshot? Give the exact message.",
        "images": ["/tmp/error-screenshot.png"]
    }]
)
print(r["message"]["content"])

Part 6: Embeddings

import ollama, numpy as np

def embed(text: str) -> list[float]:
    return ollama.embeddings(model="nomic-embed-text:v1.5", prompt=text)["embedding"]

def cosine_sim(a, b) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

docs = [
    "Redis is an in-memory data store used for caching",
    "PostgreSQL is a relational database",
    "Docker packages apps into containers",
]

query = "What should I use for a fast cache?"
q_vec = embed(query)
ranked = sorted([(cosine_sim(q_vec, embed(d)), d) for d in docs], reverse=True)

for score, doc in ranked:
    print(f"  {score:.3f}  {doc}")

Expected output:

  0.887  Redis is an in-memory data store used for caching
  0.742  PostgreSQL is a relational database
  0.618  Docker packages apps into containers

Part 7: OpenAI SDK Migration

# Before (OpenAI cloud):
# from openai import OpenAI
# client = OpenAI(api_key="sk-...")

# After (local Ollama) — change two values:
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"   # ignored by Ollama but required by the SDK
)

r = client.chat.completions.create(
    model="qwen3:14b",
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(r.choices[0].message.content)

Expected output:

All existing OpenAI SDK code works unchanged — models, temperature, max_tokens, system prompts. Zero per-query cost.

Sovereignty Verification

# Monitor outbound connections during inference
python3 -c "
import ollama, subprocess, threading, time

found = []
def monitor():
    for _ in range(8):
        r = subprocess.run(['ss','-tnp','state','established'],
                           capture_output=True, text=True)
        for line in r.stdout.splitlines():
            if 'python' in line.lower() and '127.0.0.1' not in line:
                found.append(line)
        time.sleep(0.5)

t = threading.Thread(target=monitor, daemon=True)
t.start()
ollama.chat(model='qwen3:14b', messages=[{'role':'user','content':'ping'}])
t.join(timeout=5)
print('External connections:', found if found else 'None — fully sovereign ✓')
"

Expected output:

External connections: None — fully sovereign ✓

Troubleshooting

`ConnectionError: http://localhost:11434`

Ollama is not running. Fix: ollama serve & or sudo systemctl start ollama.

Model produces garbled JSON despite `format="json"`

Add explicit instructions: "Return ONLY a JSON object. No markdown fences, no explanation." Smaller models (7B) need more explicit format constraints than 14B+.

Async tasks all waiting instead of running concurrently

Ollama processes one request at a time internally — asyncio.gather() sends requests concurrently but Ollama queues them. True parallelism requires multiple Ollama instances on different ports.

Conclusion

The Ollama Python SDK makes local AI integration indistinguishable from cloud API integration — same patterns, same code structure, zero per-query cost, and full sovereignty. The OpenAI compatibility layer means existing code migrates with a one-line URL change.

See LangChain and LangGraph with Ollama for orchestrating these calls into multi-step agent workflows, and Prompt Engineering Guide 2026 for techniques that maximise output quality from these local models.

Part 8: Local Caching and Inference Efficiency

Make local Ollama apps responsive by caching model responses and embeddings.

8.1 Response caching

For repeated queries or common prompts, cache the full model output locally in Redis or a filesystem cache. Use a simple hash of the prompt and conversation context as the cache key.

import hashlib, json
from pathlib import Path

cache_dir = Path('/var/cache/ollama')
cache_dir.mkdir(parents=True, exist_ok=True)

def cache_key(prompt: str, messages: list[dict]) -> str:
    payload = json.dumps({'prompt': prompt, 'messages': messages}, sort_keys=True)
    return hashlib.sha256(payload.encode('utf-8')).hexdigest()

8.2 Embedding caching

Compute embeddings once per text chunk and store them in a local cache. This saves the repeat cost of embedding the same content or prompt fragments.

8.3 Session-based storage

For chat applications, preserve conversation history in a local database instead of sending the entire history every request. Store only the summary or key facts when the history grows long.

Part 9: Hallucination Mitigation and Validation

Local models can still hallucinate. Build validation around the output.

9.1 Grounding with local data

Provide the model with local context and enforce source-based answers whenever possible. If the model cannot answer from the context, instruct it to say so.

9.2 Output filtering

Validate structured outputs with Pydantic or JSON schema. Reject responses that do not parse cleanly.

9.3 Human review and fallback

For high-risk workflows, include a human review step. If the model is uncertain, escalate to a human operator.

Part 10: Multi-Model Orchestration

Use multiple local models for different tasks.

10.1 Small models for classification and extraction

Use compact models for fast classification, extraction, or reranking. Reserve larger models for generation.

10.2 Local orchestrators

A simple orchestrator can route tasks:

classification → small 7B model
summarisation → 14B model
creative generation → 32B model

10.3 Model loading strategy

If your server has limited VRAM, load only one model at a time or use multiple Ollama instances on different ports. For example, 11434 for qwen3:14b and 11435 for llama4:scout.

Part 11: Observability and Local Debugging

Track performance and errors for local inference.

11.1 Request timing

Log inference start/end times and token counts. This helps you spot slow responses.

11.2 Local inference metrics

Expose metrics for:

requests per minute
average latency
token usage
error rate

11.3 Debug logs

Capture the prompt, the model name, and the response length. Do not store sensitive prompt content unless it is sanitized.

Part 12: Security and Operational Controls

A local AI stack is a local service and needs local security controls.

12.1 Network restrictions

Bind Ollama to localhost or a private interface. If you need remote access, use an SSH tunnel or VPN rather than exposing the service publicly.

12.2 Authentication

If the Ollama service is accessed from other local apps, add a reverse proxy with basic auth or token auth.

12.3 Secrets management

Keep API keys and secrets in environment variables or local secret files with strict permissions. Do not bake them into code or public repo files.

Part 13: Deployment Considerations

Deploying local Ollama apps is like deploying any other local service.

13.1 Use Docker Compose for local stacks

A small stack might include:

Ollama server container
app container
database or cache container

13.2 Systemd service for local host

If you are not using containers, run the app and Ollama as systemd services. Ensure they restart on failure and have proper logging.

13.3 Version pinning

Pin Ollama server and SDK versions in your deployment scripts. This avoids accidental breaking changes when the local inference stack is updated.

Part 14: Final Design Guidance

The best local Ollama integration is simple, observable, and auditable. Keep the same developer workflow as cloud-based code, but replace external endpoints with http://localhost:11434/v1. Build with the same patterns you already know — chat loops, async processing, streaming, structured output — and make the local service the trusted runtime instead of a remote cloud API.

Local AI should feel familiar to developers, but it should also preserve sovereignty by keeping requests, models, and logs on your infrastructure.

Part 15: Local Prompt Storage and Conversation History

When building a chat app, store user context locally, not in the model prompt history.

15.1 Session summarisation

If the chat history grows large, summarise earlier conversation segments and include only the summary in the prompt. This preserves context without exhausting the input window.

summary = ollama.chat(model='qwen3:14b', messages=[
    {'role':'system','content':'Summarise the following conversation in one paragraph.'},
    {'role':'user','content': long_history}
]).message['content']

15.2 Privacy-first storage

Store conversation history in a local encrypted database if the content is sensitive. Use AES encryption with a locally managed key.

15.3 Pinning important facts

Keep a local facts table for verified user details or project-specific assertions. Add those facts to the prompt explicitly rather than relying on the model to remember them.

Part 16: Local Agent Patterns

Build local autonomous workflows with multiple steps.

16.1 Tool calling and action execution

Use the model to generate structured tool calls and execute them locally.

tools = [
    {
        'type': 'function',
        'function': {
            'name': 'search_docs',
            'description': 'Search local docs for a query.',
            'parameters': {
                'type': 'object',
                'properties': {
                    'query': {'type':'string'}
                },
                'required':['query']
            }
        }
    }
]

Then handle the tool call in Python and return the result to the model.

16.2 Chaining local tools

Chain tools for a multi-step local agent. For example:

Search local docs
Extract relevant facts
Generate a summary

This keeps the whole workflow on-premises.

Part 17: Model Fallbacks and Resilience

Provide fallback behaviour if the local model is unavailable.

17.1 Multi-model fallback

If the 14B model is busy or out of memory, fall back to a 7B model for quick responses. Keep the main model as the primary responder.

17.2 Degraded mode

If the model service is down, return a friendly message and offer a cached answer or a manual support channel.

Part 18: Local Access Control and Audit

Protect the local inference service from unauthorized use.

18.1 API key around localhost

If you expose the service only locally, still require a token. This prevents malicious local processes from using the service unintentionally.

18.2 Request logging and audit trails

Log request IDs, model names, and prompt lengths. Do not log the full prompt if it contains sensitive data, but log enough metadata for debugging.

Part 19: Packaging the Local AI App

Make the app easy to install on a host.

19.1 Python package distribution

Use pyproject.toml with entry points so the app can be installed as a package.

[project]
name = "local-ai-app"
version = "0.1.0"

[project.scripts]
run-ai = "local_ai_app.main:cli"

19.2 Docker packaging

Package the app with a local Ollama service in the same Compose stack. Keep the image small and pin the runtime.

19.3 Dev and prod parity

Use the same docker-compose.yml for both dev and prod, with environment overrides for secrets and volumes.

Part 20: Final Local AI Integration Checklist

A local Ollama integration should behave like any other production service: predictable, auditable, and maintainable. Keep the same quality standards you use for backend services, and the local AI stack becomes a trusted part of your sovereign infrastructure.

Part 21: Local API Design

When you build a local Ollama integration, design the API as if it were a public service.

21.1 Stable input/output contracts

Define stable JSON contracts for the local service. Use OpenAPI or a simple schema file to describe the expected request and response shapes.

21.2 Error handling

Return structured error details for local inference failures. For example:

{
  "error": "model_unavailable",
  "message": "Local Ollama service is not running",
  "retry_after": 120
}

21.3 Version negotiation

If you support multiple local models, include a version field in the API response so clients can adapt.

Part 22: Local Agent Examples

Build local agents that use Ollama to drive small workflows.

22.1 Document search assistant

A local agent can:

embed a query
retrieve relevant docs from a local vector store
ask Ollama to summarise the top documents

This keeps the entire search and generation flow on-premises.

22.2 Local ticket triage

Use Ollama to classify incoming support tickets and assign priority. Keep the classification model local and the ticket metadata in your own database.

22.3 Automated report generation

Use Ollama to generate a local daily summary from log data or status dashboards. The generator runs entirely on your host and outputs the report to a local file.

Part 23: Model Lifecycle and Updates

Local inference models need lifecycle management.

23.1 Model retirement

When a model is replaced, keep the old model available for rollback for a short period. Do not delete it immediately.

23.2 Quality validation after updates

Whenever you update Ollama or a model version, run a small validation suite of prompts to ensure output quality has not regressed.

23.3 Dependency lifecycle

Pin Python dependencies and test them regularly. Use pip list --outdated in a controlled environment, not directly on a production host.

Part 24: Self-Hosted Trust Boundaries

A local Ollama service is a trust boundary in your environment.

24.1 Local vs remote data

Keep sensitive data inside the trust boundary. When the local service processes a prompt, assume the prompt is sensitive and protect it accordingly.

24.2 Auditing prompt usage

Log metadata about prompts without storing the full content, unless explicitly required. This gives you the ability to audit usage patterns without exposing sensitive text.

24.3 Revocation and secrets

If local keys or tokens are compromised, rotate them immediately and restart the service. Treat local secrets with the same care as remote secrets.

Part 25: Final Operational Checklist

local Ollama service has a stable API contract
request and response schemas are versioned
inference errors are returned in structured form
model updates are validated with a local test suite
agents and workflows are built as composable steps
prompts are audited for sensitive content exposure
logs capture service health without leaking secrets
local inference is treated as infrastructure, not a prototype

A self-hosted Ollama integration is successful when it feels like a local service that can be operated, monitored, and maintained by the team without relying on external cloud APIs. That is the true meaning of sovereign AI infrastructure.

Part 26: Local Performance Monitoring

Performance monitoring should be built into the application, not added later.

26.1 Latency metrics

Capture request latency, model response time, and tokenization time separately. This helps isolate whether a slowdown is in the model, the network, or the client.

26.2 Throughput and QPS

For local services, track queries per second and concurrency. If throughput drops, investigate whether the local model is saturating the CPU/GPU or whether the service has reached file descriptor limits.

26.3 Resource usage

Log memory and CPU usage for the local Ollama process. If memory usage trends upward over time, investigate memory leaks or cache growth.

Part 27: Final System Safety Checklist

performance metrics are captured for every deployment
fallback paths are defined for low-memory conditions
API contracts are stable and documented
prompts are audited for sensitive data leakage
local secrets are rotated and stored securely
model files are checksummed and validated
developer diagnostics are available locally
monitoring alerts are configured for service health

Local Ollama integrations are production-grade when they include observability, resilience, and safety checks. Treat the local model like any critical backend service, and you can maintain trust in a fully self-hosted AI stack.

Part 28: Developer Productivity Tips

Keep your local Ollama integration easy to iterate on.

28.1 Reusable prompt library

Store reusable prompt snippets in a local library so developers can compose prompts consistently. This reduces prompt drift and improves maintainability.

28.2 Local dev tooling

Provide a small CLI or notebook that developers can use to query the local model and inspect outputs. This makes debugging much faster than guessing at hidden prompt behaviour.

CrewAI Tutorial 2026: Multi-Agent Systems with Local Ollama

>_ 15 May | 24 min | Dev Corner

🟡Intermediate

Build sovereign multi-agent crews with CrewAI and local Ollama models. Covers role-based agents, task delegation, crew orchestration, tool integration.

By Kofi Mensah

Private Document Q&A with pgvector: 100% Local RAG Pipeline 2026

>_ 17 Apr | 18 min | Dev Corner

🟡Intermediate

Build a fully local RAG pipeline in Python 2026. Ollama embeddings, pgvector 0.8 HNSW search, and Llama 4 Scout for document Q&A. No OpenAI. No cloud.

By Marcus Thorne

RAG Tutorial 2026: Build a Local Retrieval-Augmented Generation Pipeline

>_ 4 Mar | 18 min | Dev Corner

🟡Intermediate

Build a sovereign RAG pipeline from scratch with Ollama, pgvector, and Python. Covers document chunking, embedding generation, vector search, context injection, and RAGAS evaluation.

By Kofi Mensah

#python #ollama #local-ai #sdk #pydantic #streaming #dev-corner #2026

Key Takeaways

Introduction

Setup

Part 1: Basic Chat and Multi-Turn Conversation

Part 2: Streaming

Part 3: Structured JSON Output with Pydantic

Part 4: Async Batch Processing

Part 5: Vision — Analyse Images

Part 6: Embeddings

Part 7: OpenAI SDK Migration

Sovereignty Verification

Troubleshooting

ConnectionError: http://localhost:11434

Model produces garbled JSON despite format="json"

Async tasks all waiting instead of running concurrently

Conclusion

People Also Ask

Is the Ollama Python SDK the same as the OpenAI Python SDK?

Can I run multiple models concurrently with Ollama?

Part 8: Local Caching and Inference Efficiency

8.1 Response caching

8.2 Embedding caching

8.3 Session-based storage

Part 9: Hallucination Mitigation and Validation

9.1 Grounding with local data

9.2 Output filtering

9.3 Human review and fallback

Part 10: Multi-Model Orchestration

10.1 Small models for classification and extraction

10.2 Local orchestrators

10.3 Model loading strategy

Part 11: Observability and Local Debugging

11.1 Request timing

11.2 Local inference metrics

11.3 Debug logs

Part 12: Security and Operational Controls

12.1 Network restrictions

12.2 Authentication

12.3 Secrets management

Part 13: Deployment Considerations

13.1 Use Docker Compose for local stacks

13.2 Systemd service for local host

13.3 Version pinning

Part 14: Final Design Guidance

Part 15: Local Prompt Storage and Conversation History

15.1 Session summarisation

15.2 Privacy-first storage

15.3 Pinning important facts

Part 16: Local Agent Patterns

16.1 Tool calling and action execution

16.2 Chaining local tools

Part 17: Model Fallbacks and Resilience

17.1 Multi-model fallback

17.2 Degraded mode

Part 18: Local Access Control and Audit

18.1 API key around localhost

18.2 Request logging and audit trails

Part 19: Packaging the Local AI App

19.1 Python package distribution

19.2 Docker packaging

19.3 Dev and prod parity

Part 20: Final Local AI Integration Checklist

Part 21: Local API Design

21.1 Stable input/output contracts

21.2 Error handling

21.3 Version negotiation

Part 22: Local Agent Examples

22.1 Document search assistant

22.2 Local ticket triage

22.3 Automated report generation

Part 23: Model Lifecycle and Updates

23.1 Model retirement

23.2 Quality validation after updates

23.3 Dependency lifecycle

Part 24: Self-Hosted Trust Boundaries

24.1 Local vs remote data

24.2 Auditing prompt usage

24.3 Revocation and secrets

Part 25: Final Operational Checklist

Part 26: Local Performance Monitoring

`ConnectionError: http://localhost:11434`

Model produces garbled JSON despite `format="json"`