Key Takeaways
- Four lines to hello world:
pip install ollama, callollama.chat(), readresponse['message']['content']. No API key, no cloud. stream=Truefor real-time output: Iterate over the generator to receive tokens as they generate — essential for any user-facing interface.format="json"for structured data: Combine with a Pydantic schema description in the system prompt to receive validated Python objects.- OpenAI SDK compatibility: Ollama’s API is OpenAI-compatible — existing OpenAI code migrates to local inference by changing the base URL and API key.
Introduction
Direct Answer: How do I use Python to call local Ollama models without cloud APIs in 2026?
Install with pip install ollama. Ensure Ollama is running (ollama serve) with a model pulled (ollama pull qwen3:14b). Basic usage: import ollama; r = ollama.chat(model='qwen3:14b', messages=[{'role':'user','content':'Hello'}]); print(r['message']['content']). For streaming: for chunk in ollama.chat(..., stream=True): print(chunk['message']['content'], end='', flush=True). For JSON output: add format='json' and describe your schema in the system prompt, then json.loads(r['message']['content']). For async: from ollama import AsyncClient; async with AsyncClient() as c: r = await c.chat(...). The SDK runs on Ubuntu 24.04 and macOS with the same code — all inference happens locally via the running Ollama daemon on port 11434, with zero external API calls.
Setup
pip install ollama pydantic --break-system-packages
python3 -c "import ollama; print('Ollama SDK:', ollama.__version__)"
ollama list | head -3
Expected output:
Ollama SDK: 0.4.7
NAME SIZE
qwen3:14b 9.3 GB
Part 1: Basic Chat and Multi-Turn Conversation
import ollama
# Single turn
r = ollama.chat(
model="qwen3:14b",
messages=[
{"role": "system", "content": "Reply in one sentence maximum."},
{"role": "user", "content": "What is Docker?"}
]
)
print(r["message"]["content"])
print(f"Speed: {r['eval_count'] / (r['eval_duration']/1e9):.1f} tok/s")
Expected output:
Docker is a containerisation platform that packages applications into portable, isolated containers.
Speed: 31.8 tok/s
# Multi-turn conversation — accumulate messages manually
messages = []
def chat(user_msg: str) -> str:
messages.append({"role": "user", "content": user_msg})
r = ollama.chat(model="qwen3:14b", messages=messages)
reply = r["message"]["content"]
messages.append({"role": "assistant", "content": reply})
return reply
print(chat("My server has 8GB RAM. What shared_buffers should I set in PostgreSQL?"))
print(chat("And effective_cache_size?"))
Expected output:
Set shared_buffers = 2GB (25% of 8GB RAM).
Set effective_cache_size = 6GB (75% of RAM) — this is a query planner hint, not actual allocation.
Part 2: Streaming
import ollama
def stream_response(prompt: str, model: str = "qwen3:14b") -> str:
full = ""
for chunk in ollama.chat(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True
):
token = chunk["message"]["content"]
print(token, end="", flush=True)
full += token
print()
return full
stream_response(
"Write a Python function to read a CSV and return the top N rows by a column value."
)
Expected output (tokens appear in real time):
import csv
def top_n_rows(filepath: str, column: str, n: int = 10) -> list[dict]:
"""Return top N rows sorted descending by column value."""
with open(filepath, newline='') as f:
rows = list(csv.DictReader(f))
return sorted(rows, key=lambda r: float(r[column]), reverse=True)[:n]
Part 3: Structured JSON Output with Pydantic
import ollama, json
from pydantic import BaseModel, Field
from typing import List, Literal
class SecurityIssue(BaseModel):
line: int
severity: Literal["critical", "high", "medium", "low"]
description: str
fix: str
class CodeReview(BaseModel):
safe_to_deploy: bool
score: int = Field(ge=0, le=100)
issues: List[SecurityIssue]
summary: str
def review_code(code: str) -> CodeReview:
schema = json.dumps(CodeReview.model_json_schema(), indent=2)
r = ollama.chat(
model="qwen3:14b",
messages=[
{"role": "system", "content": f"Review code for security. Return ONLY JSON matching:\n{schema}"},
{"role": "user", "content": code}
],
format="json"
)
return CodeReview.model_validate_json(r["message"]["content"])
result = review_code("""
def get_user(uid: str):
return db.execute(f"SELECT * FROM users WHERE id = '{uid}'")
""")
print(f"Safe: {result.safe_to_deploy} | Score: {result.score}/100")
for issue in result.issues:
print(f" [{issue.severity.upper()}] L{issue.line}: {issue.description}")
print(f" Fix: {issue.fix}")
Expected output:
Safe: False | Score: 20/100
[CRITICAL] L2: SQL injection via f-string interpolation of user input
Fix: db.execute("SELECT * FROM users WHERE id = ?", (uid,))
Part 4: Async Batch Processing
import asyncio
from ollama import AsyncClient
async def classify(client, text: str) -> str:
r = await client.chat(
model="qwen3:14b",
messages=[
{"role": "system", "content": "Classify sentiment: POSITIVE, NEGATIVE, or NEUTRAL. One word only."},
{"role": "user", "content": text}
]
)
return r["message"]["content"].strip()
async def batch_classify(texts: list[str]) -> list[str]:
async with AsyncClient() as client:
return await asyncio.gather(*[classify(client, t) for t in texts])
reviews = [
"Amazing product, fast delivery!",
"Broken on arrival, terrible support.",
"Works as described, nothing special.",
]
sentiments = asyncio.run(batch_classify(reviews))
for rev, sent in zip(reviews, sentiments):
print(f" [{sent:8s}] {rev}")
Expected output:
[POSITIVE] Amazing product, fast delivery!
[NEGATIVE] Broken on arrival, terrible support.
[NEUTRAL ] Works as described, nothing special.
Part 5: Vision — Analyse Images
import ollama
# Pass image file path — Ollama handles encoding
r = ollama.chat(
model="llama4:scout",
messages=[{
"role": "user",
"content": "What error is shown in this screenshot? Give the exact message.",
"images": ["/tmp/error-screenshot.png"]
}]
)
print(r["message"]["content"])
Part 6: Embeddings
import ollama, numpy as np
def embed(text: str) -> list[float]:
return ollama.embeddings(model="nomic-embed-text:v1.5", prompt=text)["embedding"]
def cosine_sim(a, b) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
docs = [
"Redis is an in-memory data store used for caching",
"PostgreSQL is a relational database",
"Docker packages apps into containers",
]
query = "What should I use for a fast cache?"
q_vec = embed(query)
ranked = sorted([(cosine_sim(q_vec, embed(d)), d) for d in docs], reverse=True)
for score, doc in ranked:
print(f" {score:.3f} {doc}")
Expected output:
0.887 Redis is an in-memory data store used for caching
0.742 PostgreSQL is a relational database
0.618 Docker packages apps into containers
Part 7: OpenAI SDK Migration
# Before (OpenAI cloud):
# from openai import OpenAI
# client = OpenAI(api_key="sk-...")
# After (local Ollama) — change two values:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # ignored by Ollama but required by the SDK
)
r = client.chat.completions.create(
model="qwen3:14b",
messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(r.choices[0].message.content)
Expected output:
4
All existing OpenAI SDK code works unchanged — models, temperature, max_tokens, system prompts. Zero per-query cost.
Sovereignty Verification
# Monitor outbound connections during inference
python3 -c "
import ollama, subprocess, threading, time
found = []
def monitor():
for _ in range(8):
r = subprocess.run(['ss','-tnp','state','established'],
capture_output=True, text=True)
for line in r.stdout.splitlines():
if 'python' in line.lower() and '127.0.0.1' not in line:
found.append(line)
time.sleep(0.5)
t = threading.Thread(target=monitor, daemon=True)
t.start()
ollama.chat(model='qwen3:14b', messages=[{'role':'user','content':'ping'}])
t.join(timeout=5)
print('External connections:', found if found else 'None — fully sovereign ✓')
"
Expected output:
External connections: None — fully sovereign ✓
Troubleshooting
ConnectionError: http://localhost:11434
Ollama is not running. Fix: ollama serve & or sudo systemctl start ollama.
Model produces garbled JSON despite format="json"
Add explicit instructions: "Return ONLY a JSON object. No markdown fences, no explanation." Smaller models (7B) need more explicit format constraints than 14B+.
Async tasks all waiting instead of running concurrently
Ollama processes one request at a time internally — asyncio.gather() sends requests concurrently but Ollama queues them. True parallelism requires multiple Ollama instances on different ports.
Conclusion
The Ollama Python SDK makes local AI integration indistinguishable from cloud API integration — same patterns, same code structure, zero per-query cost, and full sovereignty. The OpenAI compatibility layer means existing code migrates with a one-line URL change.
See LangChain and LangGraph with Ollama for orchestrating these calls into multi-step agent workflows, and Prompt Engineering Guide 2026 for techniques that maximise output quality from these local models.
People Also Ask
Is the Ollama Python SDK the same as the OpenAI Python SDK?
They are different libraries but share a similar API design. The ollama SDK (pip install ollama) has Ollama-specific features like the images parameter for vision models and exposes Ollama-specific response fields (eval_count, eval_duration). The openai SDK (pip install openai) works with Ollama when base_url is pointed at localhost:11434/v1 — use this approach when migrating existing OpenAI code. For new projects, the native ollama SDK is recommended for access to all Ollama features.
Can I run multiple models concurrently with Ollama?
Ollama queues requests to a single loaded model. To run two models concurrently, start two Ollama instances on different ports: OLLAMA_HOST=0.0.0.0:11435 ollama serve. Then instantiate separate AsyncClient objects pointing to different ports. Alternatively, Ollama 0.5+ can load multiple models simultaneously if you have sufficient VRAM — it auto-manages model swapping based on available memory.
Part 8: Local Caching and Inference Efficiency
Make local Ollama apps responsive by caching model responses and embeddings.
8.1 Response caching
For repeated queries or common prompts, cache the full model output locally in Redis or a filesystem cache. Use a simple hash of the prompt and conversation context as the cache key.
import hashlib, json
from pathlib import Path
cache_dir = Path('/var/cache/ollama')
cache_dir.mkdir(parents=True, exist_ok=True)
def cache_key(prompt: str, messages: list[dict]) -> str:
payload = json.dumps({'prompt': prompt, 'messages': messages}, sort_keys=True)
return hashlib.sha256(payload.encode('utf-8')).hexdigest()
8.2 Embedding caching
Compute embeddings once per text chunk and store them in a local cache. This saves the repeat cost of embedding the same content or prompt fragments.
8.3 Session-based storage
For chat applications, preserve conversation history in a local database instead of sending the entire history every request. Store only the summary or key facts when the history grows long.
Part 9: Hallucination Mitigation and Validation
Local models can still hallucinate. Build validation around the output.
9.1 Grounding with local data
Provide the model with local context and enforce source-based answers whenever possible. If the model cannot answer from the context, instruct it to say so.
9.2 Output filtering
Validate structured outputs with Pydantic or JSON schema. Reject responses that do not parse cleanly.
9.3 Human review and fallback
For high-risk workflows, include a human review step. If the model is uncertain, escalate to a human operator.
Part 10: Multi-Model Orchestration
Use multiple local models for different tasks.
10.1 Small models for classification and extraction
Use compact models for fast classification, extraction, or reranking. Reserve larger models for generation.
10.2 Local orchestrators
A simple orchestrator can route tasks:
- classification → small 7B model
- summarisation → 14B model
- creative generation → 32B model
10.3 Model loading strategy
If your server has limited VRAM, load only one model at a time or use multiple Ollama instances on different ports. For example, 11434 for qwen3:14b and 11435 for llama4:scout.
Part 11: Observability and Local Debugging
Track performance and errors for local inference.
11.1 Request timing
Log inference start/end times and token counts. This helps you spot slow responses.
11.2 Local inference metrics
Expose metrics for:
- requests per minute
- average latency
- token usage
- error rate
11.3 Debug logs
Capture the prompt, the model name, and the response length. Do not store sensitive prompt content unless it is sanitized.
Part 12: Security and Operational Controls
A local AI stack is a local service and needs local security controls.
12.1 Network restrictions
Bind Ollama to localhost or a private interface. If you need remote access, use an SSH tunnel or VPN rather than exposing the service publicly.
12.2 Authentication
If the Ollama service is accessed from other local apps, add a reverse proxy with basic auth or token auth.
12.3 Secrets management
Keep API keys and secrets in environment variables or local secret files with strict permissions. Do not bake them into code or public repo files.
Part 13: Deployment Considerations
Deploying local Ollama apps is like deploying any other local service.
13.1 Use Docker Compose for local stacks
A small stack might include:
- Ollama server container
- app container
- database or cache container
13.2 Systemd service for local host
If you are not using containers, run the app and Ollama as systemd services. Ensure they restart on failure and have proper logging.
13.3 Version pinning
Pin Ollama server and SDK versions in your deployment scripts. This avoids accidental breaking changes when the local inference stack is updated.
Part 14: Final Design Guidance
The best local Ollama integration is simple, observable, and auditable. Keep the same developer workflow as cloud-based code, but replace external endpoints with http://localhost:11434/v1. Build with the same patterns you already know — chat loops, async processing, streaming, structured output — and make the local service the trusted runtime instead of a remote cloud API.
Local AI should feel familiar to developers, but it should also preserve sovereignty by keeping requests, models, and logs on your infrastructure.
Part 15: Local Prompt Storage and Conversation History
When building a chat app, store user context locally, not in the model prompt history.
15.1 Session summarisation
If the chat history grows large, summarise earlier conversation segments and include only the summary in the prompt. This preserves context without exhausting the input window.
summary = ollama.chat(model='qwen3:14b', messages=[
{'role':'system','content':'Summarise the following conversation in one paragraph.'},
{'role':'user','content': long_history}
]).message['content']
15.2 Privacy-first storage
Store conversation history in a local encrypted database if the content is sensitive. Use AES encryption with a locally managed key.
15.3 Pinning important facts
Keep a local facts table for verified user details or project-specific assertions. Add those facts to the prompt explicitly rather than relying on the model to remember them.
Part 16: Local Agent Patterns
Build local autonomous workflows with multiple steps.
16.1 Tool calling and action execution
Use the model to generate structured tool calls and execute them locally.
tools = [
{
'type': 'function',
'function': {
'name': 'search_docs',
'description': 'Search local docs for a query.',
'parameters': {
'type': 'object',
'properties': {
'query': {'type':'string'}
},
'required':['query']
}
}
}
]
Then handle the tool call in Python and return the result to the model.
16.2 Chaining local tools
Chain tools for a multi-step local agent. For example:
- Search local docs
- Extract relevant facts
- Generate a summary
This keeps the whole workflow on-premises.
Part 17: Model Fallbacks and Resilience
Provide fallback behaviour if the local model is unavailable.
17.1 Multi-model fallback
If the 14B model is busy or out of memory, fall back to a 7B model for quick responses. Keep the main model as the primary responder.
17.2 Degraded mode
If the model service is down, return a friendly message and offer a cached answer or a manual support channel.
Part 18: Local Access Control and Audit
Protect the local inference service from unauthorized use.
18.1 API key around localhost
If you expose the service only locally, still require a token. This prevents malicious local processes from using the service unintentionally.
18.2 Request logging and audit trails
Log request IDs, model names, and prompt lengths. Do not log the full prompt if it contains sensitive data, but log enough metadata for debugging.
Part 19: Packaging the Local AI App
Make the app easy to install on a host.
19.1 Python package distribution
Use pyproject.toml with entry points so the app can be installed as a package.
[project]
name = "local-ai-app"
version = "0.1.0"
[project.scripts]
run-ai = "local_ai_app.main:cli"
19.2 Docker packaging
Package the app with a local Ollama service in the same Compose stack. Keep the image small and pin the runtime.
19.3 Dev and prod parity
Use the same docker-compose.yml for both dev and prod, with environment overrides for secrets and volumes.
Part 20: Final Local AI Integration Checklist
- local Ollama service is version pinned
- model responses are cached when appropriate
- structured output is validated with Pydantic
- async workflows are implemented with
AsyncClient - prompt history is summarised for long conversations
- security and auth are applied even for localhost services
- model fallbacks exist for low-memory conditions
- logs and metrics are captured without leaking sensitive content
- deployment scripts are documented and reproducible
- local agent tools are safe and audited
A local Ollama integration should behave like any other production service: predictable, auditable, and maintainable. Keep the same quality standards you use for backend services, and the local AI stack becomes a trusted part of your sovereign infrastructure.
Part 21: Local API Design
When you build a local Ollama integration, design the API as if it were a public service.
21.1 Stable input/output contracts
Define stable JSON contracts for the local service. Use OpenAPI or a simple schema file to describe the expected request and response shapes.
21.2 Error handling
Return structured error details for local inference failures. For example:
{
"error": "model_unavailable",
"message": "Local Ollama service is not running",
"retry_after": 120
}
21.3 Version negotiation
If you support multiple local models, include a version field in the API response so clients can adapt.
Part 22: Local Agent Examples
Build local agents that use Ollama to drive small workflows.
22.1 Document search assistant
A local agent can:
- embed a query
- retrieve relevant docs from a local vector store
- ask Ollama to summarise the top documents
This keeps the entire search and generation flow on-premises.
22.2 Local ticket triage
Use Ollama to classify incoming support tickets and assign priority. Keep the classification model local and the ticket metadata in your own database.
22.3 Automated report generation
Use Ollama to generate a local daily summary from log data or status dashboards. The generator runs entirely on your host and outputs the report to a local file.
Part 23: Model Lifecycle and Updates
Local inference models need lifecycle management.
23.1 Model retirement
When a model is replaced, keep the old model available for rollback for a short period. Do not delete it immediately.
23.2 Quality validation after updates
Whenever you update Ollama or a model version, run a small validation suite of prompts to ensure output quality has not regressed.
23.3 Dependency lifecycle
Pin Python dependencies and test them regularly. Use pip list --outdated in a controlled environment, not directly on a production host.
Part 24: Self-Hosted Trust Boundaries
A local Ollama service is a trust boundary in your environment.
24.1 Local vs remote data
Keep sensitive data inside the trust boundary. When the local service processes a prompt, assume the prompt is sensitive and protect it accordingly.
24.2 Auditing prompt usage
Log metadata about prompts without storing the full content, unless explicitly required. This gives you the ability to audit usage patterns without exposing sensitive text.
24.3 Revocation and secrets
If local keys or tokens are compromised, rotate them immediately and restart the service. Treat local secrets with the same care as remote secrets.
Part 25: Final Operational Checklist
- local Ollama service has a stable API contract
- request and response schemas are versioned
- inference errors are returned in structured form
- model updates are validated with a local test suite
- agents and workflows are built as composable steps
- prompts are audited for sensitive content exposure
- logs capture service health without leaking secrets
- local inference is treated as infrastructure, not a prototype
A self-hosted Ollama integration is successful when it feels like a local service that can be operated, monitored, and maintained by the team without relying on external cloud APIs. That is the true meaning of sovereign AI infrastructure.
Part 26: Local Performance Monitoring
Performance monitoring should be built into the application, not added later.
26.1 Latency metrics
Capture request latency, model response time, and tokenization time separately. This helps isolate whether a slowdown is in the model, the network, or the client.
26.2 Throughput and QPS
For local services, track queries per second and concurrency. If throughput drops, investigate whether the local model is saturating the CPU/GPU or whether the service has reached file descriptor limits.
26.3 Resource usage
Log memory and CPU usage for the local Ollama process. If memory usage trends upward over time, investigate memory leaks or cache growth.
Part 27: Final System Safety Checklist
- performance metrics are captured for every deployment
- fallback paths are defined for low-memory conditions
- API contracts are stable and documented
- prompts are audited for sensitive data leakage
- local secrets are rotated and stored securely
- model files are checksummed and validated
- developer diagnostics are available locally
- monitoring alerts are configured for service health
Local Ollama integrations are production-grade when they include observability, resilience, and safety checks. Treat the local model like any critical backend service, and you can maintain trust in a fully self-hosted AI stack.
Part 28: Developer Productivity Tips
Keep your local Ollama integration easy to iterate on.
28.1 Reusable prompt library
Store reusable prompt snippets in a local library so developers can compose prompts consistently. This reduces prompt drift and improves maintainability.
28.2 Local dev tooling
Provide a small CLI or notebook that developers can use to query the local model and inspect outputs. This makes debugging much faster than guessing at hidden prompt behaviour.
Further Reading
- How to Install Ollama and Run LLMs Locally — install and configure Ollama before using this SDK
- LangChain and LangGraph with Ollama — build agent pipelines on top of this SDK
- Prompt Engineering Guide 2026 — maximise output quality from local models
- Best Local LLM Models for Coding in 2026 — choose the right model for your use case
Tested on: Ubuntu 24.04 LTS (RTX 4090), macOS Sequoia 15.4 (M3 Max 64GB). Ollama SDK 0.4.7, Ollama 0.5.12. Last verified: April 28, 2026.