Key Takeaways
- One SDK, any backend: The OpenAI Python SDK works with Ollama, vLLM, and llama.cpp — change only
base_urlandapi_key. - LiteLLM for multi-provider routing: Single endpoint routing to Ollama, vLLM, cloud fallbacks, with logging and cost tracking.
- Streaming for UX: Always use
stream=Truefor interactive chat — users see tokens as they generate. - Function calling is sovereign: Tool use works with local Llama 4 Scout and Qwen3 14B — no cloud needed.
Introduction
Direct Answer: How do I use the OpenAI Python SDK with local Ollama models in 2026?
Install: pip install openai. Start Ollama with a model: ollama pull qwen3:14b. Connect: from openai import OpenAI; client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"). Call as normal: response = client.chat.completions.create(model="qwen3:14b", messages=[{"role":"user","content":"Hello"}]). For streaming: add stream=True and iterate chunks. For function calling: pass a tools list and handle tool_calls in the response. Only base_url and api_key differ from the production OpenAI configuration — the rest of your application is unchanged.
Part 1: Basic OpenAI SDK with Ollama
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama"
)
response = client.chat.completions.create(
model="qwen3:14b",
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is a Docker volume in one sentence?"}
],
temperature=0,
max_tokens=100
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} prompt + {response.usage.completion_tokens} completion")
Expected output:
A Docker volume is persistent storage managed by Docker that exists independently of containers, surviving restarts and deletions.
Tokens: 42 prompt + 28 completion
Part 2: Streaming
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
def stream_chat(prompt: str, model: str = "qwen3:14b") -> str:
full = ""
with client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
full += delta
print()
return full
stream_chat("Write a Python context manager that times a code block.")
Expected output (tokens appear in real time):
import time
from contextlib import contextmanager
@contextmanager
def timer(label: str = "Elapsed"):
start = time.perf_counter()
try:
yield
finally:
print(f"{label}: {time.perf_counter() - start:.4f}s")
with timer("My operation"):
time.sleep(0.5)
# My operation: 0.5003s
Part 3: Function Calling
import json
from openai import OpenAI
import subprocess
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
tools = [{
"type": "function",
"function": {
"name": "get_disk_usage",
"description": "Get disk usage for a directory",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Directory path to check"}
},
"required": ["path"]
}
}
}]
def execute_tool(name: str, args: dict) -> str:
if name == "get_disk_usage":
r = subprocess.run(["du", "-sh", args["path"]], capture_output=True, text=True)
return r.stdout.strip() or r.stderr.strip()
return f"Unknown: {name}"
def run_agent(user_msg: str) -> str:
messages = [{"role": "user", "content": user_msg}]
while True:
response = client.chat.completions.create(
model="llama4:scout",
messages=messages,
tools=tools,
tool_choice="auto"
)
msg = response.choices[0].message
messages.append(msg)
if not msg.tool_calls:
return msg.content
for tc in msg.tool_calls:
args = json.loads(tc.function.arguments)
result = execute_tool(tc.function.name, args)
print(f" [Tool] {tc.function.name}({args}) -> {result[:60]}")
messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})
answer = run_agent("How much space is /var/log using?")
print(f"\nAnswer: {answer}")
Expected output:
[Tool] get_disk_usage({'path': '/var/log'}) -> 1.2G /var/log
Answer: /var/log is using 1.2GB of disk space.
Part 4: LiteLLM Proxy for Multi-Model Routing
pip install litellm --break-system-packages
# Start proxy — routes to local Ollama
litellm --model ollama/qwen3:14b --port 4000 &
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"ollama/qwen3:14b","messages":[{"role":"user","content":"ping"}]}' \
| python3 -c "import json,sys; print(json.load(sys.stdin)['choices'][0]['message']['content'])"
Expected output: pong
# litellm_config.yaml — production multi-model routing
model_list:
- model_name: fast
litellm_params:
model: ollama/qwen3:7b
api_base: http://localhost:11434
- model_name: smart
litellm_params:
model: ollama/qwen3:14b
api_base: http://localhost:11434
- model_name: vision
litellm_params:
model: ollama/llama4:scout
api_base: http://localhost:11434
general_settings:
master_key: sk-your-proxy-key
request_timeout: 300
litellm --config litellm_config.yaml --port 4000
# All models now available via single endpoint
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer sk-your-proxy-key"
Part 5: Migration Pattern — OpenAI to Sovereign
# BEFORE — OpenAI cloud
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(model="gpt-4o", messages=[...])
# AFTER — Sovereign (change 3 values, nothing else)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1", # Add
api_key="ollama" # Change
)
response = client.chat.completions.create(
model="qwen3:14b", # Change model name
messages=[...] # Unchanged
)
# All other code: unchanged
Troubleshooting
Connection refused to Ollama
Ollama is not running. Fix: ollama serve or sudo systemctl start ollama.
Model name not found
Ollama model names are case-sensitive and include the tag: qwen3:14b not Qwen3-14B. Check with ollama list.
LiteLLM Bad Gateway
The backend (Ollama/vLLM) isn’t running or the api_base is wrong in the config. Test the backend directly with curl before routing through LiteLLM.
Conclusion
The OpenAI SDK is the universal client for sovereign LLM APIs in 2026 — change base_url to reach Ollama, vLLM, or llama.cpp with zero other code changes. LiteLLM adds multi-model routing, fallbacks, and logging. The same function calling and streaming patterns work locally as with the cloud API.
See Self-Host an LLM API Server 2026 for setting up the server this SDK connects to, and AI Agent Design Patterns 2026 for building agents with function calling.
People Also Ask
Is there a performance difference between the Ollama SDK and OpenAI SDK pointing at Ollama?
Negligible — both use HTTP/JSON over the same localhost connection. The native ollama Python SDK has slightly less overhead (no HTTP keep-alive setup per request) and access to Ollama-specific features like the images parameter for vision models. The OpenAI SDK is better for portability — code works identically against OpenAI, Anthropic (via LiteLLM), and Ollama. For new sovereign-only code, use the native ollama SDK; for code that might need to switch providers, use the OpenAI SDK.
Can I use LiteLLM to fall back to OpenAI if local GPU is busy?
Yes — LiteLLM supports fallback routing: litellm_params: fallbacks: [{model: gpt-4o}]. If the primary local model fails or times out, requests automatically route to the fallback. This is useful for graceful degradation in production, but be aware that fallback requests do leave your machine and incur OpenAI costs.
Part 2: Installing and Running Ollama Locally
The easiest sovereign path is to run Ollama on your local Ubuntu host. Install with the official installer or package manager and pull the model you need.
2.1 Install Ollama
curl https://ollama.ai/install | sh
ollama version
2.2 Pull a model
ollama pull qwen3:14b
2.3 Start the Ollama server
ollama serve --port 11434
Verify the local endpoint:
curl http://localhost:11434/v1/models
If you are using a GPU, make sure the model supports your hardware and that the driver stack is configured correctly. For CPU-only environments, llama.cpp or vLLM may be more efficient.
Part 3: LiteLLM Multi-Provider Proxy
LiteLLM lets you run a unified OpenAI-compatible API on top of multiple local backends. This is ideal when you want the same client code to work with Ollama, vLLM, and local hosted cloud providers.
3.1 Install LiteLLM
pip install litellm --break-system-packages
3.2 Start the proxy for Ollama
litellm --model ollama/qwen3:14b --port 4000
3.3 Use the same OpenAI SDK code
client = OpenAI(base_url='http://localhost:4000/v1', api_key='secret')
LiteLLM can also route to a mix of local and cloud backends, which is useful for hybrid sovereignty: keep sensitive requests local while routing low-risk requests to a trusted fallback.
Part 4: Streaming and Real-Time Output
Streaming improves user experience in chat apps and reduces wait times for long responses.
with client.chat.completions.create(
model='qwen3:14b',
messages=[{'role': 'user', 'content': 'Explain local LLM routing.'}],
stream=True,
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end='', flush=True)
Handle partial chunks gracefully in the UI. If the stream ends unexpectedly, retry once and log the failure locally.
Part 5: Function Calling with Local Models
Function calling lets the model request external tools while keeping the overall flow local. This is one of the strongest sovereign patterns because the model does not need external APIs to perform useful tasks.
5.1 Define tools as JSON schemas
tools=[{
'type': 'function',
'function': {
'name': 'lookup_user',
'description': 'Get details for a local user by username',
'parameters': {
'type': 'object',
'properties': {
'username': {'type': 'string', 'description': 'The login name of the user'}
},
'required': ['username'],
},
},
}]
5.2 Execute tool calls locally
from openai import OpenAI
import json
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(
model='llama4:scout',
messages=[{'role': 'user', 'content': 'Look up user alice in the local directory.'}],
tools=tools,
)
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = lookup_user(args['username'])
print(result)
This keeps the knowledge boundary local and ensures that tool execution is auditable.
Part 6: Token Counting and Cost Awareness
Even when the model is local, token efficiency matters for latency and memory. Use the SDK to estimate token usage.
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(
model='qwen3:14b',
messages=[{'role': 'user', 'content': 'Summarize the server health checklist.'}],
max_tokens=150,
)
print(response.usage)
Track prompt and completion tokens when you are doing local capacity planning. For on-prem inference, token usage affects GPU memory and throughput just as it affects cloud cost.
Part 7: Multi-Model Routing and Fallbacks
A local proxy can route requests to different models based on intent, size, or sensitivity.
Create a lightweight routing table in LiteLLM or your own proxy:
- small, fast models for short prompts
- larger high-quality models for long-form generation
- a specific model for code tasks
- a privacy-preserving local model for confidential queries
This keeps responses in the right performance and privacy tier without changing the application code.
Part 8: Secure Local API Configuration
Use environment variables and local secrets to avoid hard-coding endpoints.
export OPENAI_API_BASE='http://localhost:11434/v1'
export OPENAI_API_KEY='ollama'
In Python:
import os
client = OpenAI(base_url=os.environ['OPENAI_API_BASE'], api_key=os.environ['OPENAI_API_KEY'])
Use a .env file with local permissions and exclude it from Git. This keeps the local endpoint and key under your control.
Part 9: Model Deployment and Scalability
For production local inference, run Ollama or vLLM on a dedicated inference server. Use systemd or Docker to supervise the process and restart it on failure.
Example systemd unit:
[Unit]
Description=Ollama inference service
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve --port 11434
Restart=on-failure
User=ollama
[Install]
WantedBy=multi-user.target
This gives you a reliable local endpoint that can be monitored and restarted automatically.
Part 10: Troubleshooting Local SDK Compatibility
If the OpenAI SDK fails with a local backend, troubleshoot:
- verify
base_urlends with/v1 - ensure
api_keyis accepted by the local server - check the local server logs for decoding or schema errors
- confirm the model name is available with
GET /v1/models
Common errors include missing Content-Type: application/json or using a path that the proxy does not support.
Part 11: Logging, Audit, and Sovereign Traceability
Log every inference request and response metadata locally. Do not log full user prompts if they contain sensitive data; instead log request IDs, model names, and token counts.
A log record might include:
- timestamp
- model
- prompt length
- completion length
- user ID (if applicable)
- tool usage
- latency
This local audit trail is part of your sovereign AI infrastructure.
Part 12: Offline and Air-Gapped Operation
For a truly air-gapped deployment, ensure your model files and SDK dependencies are all available locally. Use a private package mirror or a local wheelhouse for Python packages.
If you need to update models, transfer them via secure media and verify checksums before importing. This preserves the sovereignty of the environment.
Part 13: Local Security Hardening for LLM APIs
Protect the local API with a firewall and host-based access controls. Only expose the local endpoint to trusted clients.
If using a proxy, bind it to localhost or a private network interface:
litellm --model ollama/qwen3:14b --host 127.0.0.1 --port 4000
Then use SSH tunnels or local NAT rules to grant access to approved hosts.
Part 14: Integration Patterns for Local AI Apps
Use the same OpenAI-compatible client code for:
- chatbots
- summarization services
- extraction tools
- code assistants
Switch the backend by changing base_url and api_key only. This makes your app portable and sovereign across deployments.
Part 15: Local Model Maintenance and Updates
Keep track of model versions and update them intentionally. Record the model digest or version in your metadata:
ollama list
ollama info qwen3:14b
When you upgrade a model, run validation prompts and compare responses to ensure the new model still meets your requirements.
Part 16: Further Reading
Part 17: Token Cost and Throughput for Local Models
Even though a local model incurs no cloud bill, token usage still affects performance and memory. Smaller prompts and concise instructions reduce latency.
17.1 Token estimation
Use the OpenAI SDK’s token accounting or a tokenizer library to estimate usage before inference.
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
response = client.chat.completions.create(
model='qwen3:14b',
messages=[{'role': 'user', 'content': 'What is a Docker volume?'}],
max_tokens=100
)
print(response.usage)
17.2 Prompt optimization
Prefer shorter prompts with explicit instructions. Use system messages to set behavior and avoid repeating context in each request.
Part 18: Multi-Model Rerouting Policies
Use LiteLLM or a custom router to send prompts to different models based on intent.
Example policy:
- classification tasks -> small, fast model
- summarization -> medium model
- coding -> specialized code-capable model
- confidential queries -> local privacy-first model
This improves resource utilization while preserving local sovereignty.
Part 19: Tool Security and Operation Boundaries
When the LLM calls local tools, clearly define the boundary between natural language and executable actions. Keep tool definitions minimal and audit the tool output.
Example tool safety note:
- accept only string input parameters
- sanitize paths and filenames
- do not expose secrets through tool output
A safe tool interface is one of the most critical parts of sovereign function calling.
Part 20: Model Versioning and Metadata
Track the exact model version in your metadata and logs.
print(client.models.retrieve('qwen3:14b'))
Store the model identifier, server version, and local runtime details in the inference logs. This makes it possible to trace a generated response back to the exact model instance.
Part 21: Logging and Observability
Log inference requests, errors, and response times locally.
A simple JSON log entry:
{
"timestamp": "2026-05-22T12:00:00Z",
"model": "qwen3:14b",
"tokens": 143,
"latency_ms": 360,
"user_id": "alice"
}
Do not log sensitive prompt content unless you have consent and secure storage.
Part 22: Local Cache and Repeatable Responses
For repeated queries, use a local prompt cache. If the same query has already been answered, return the cached response instead of calling the model again.
This reduces load and provides deterministic behavior for repeated lookups.
Part 23: AI Safety and Guardrails
Implement guardrails such as prompt templates, model instructions, and output validation. For local AI, the guardrails should be part of your codebase, not an external service.
Example validation:
output = response.choices[0].message.content
if len(output) > 2000:
raise ValueError('Response too long')
If the model is asked to generate code, validate that the output only contains allowed constructs.
Part 24: Offline Dependency Management
When deploying local LLM services, ensure all Python dependencies are available without external network access.
Create a local wheelhouse:
pip download -r requirements.txt -d /tmp/wheelhouse
Then install from the wheelhouse on the air-gapped host.
Part 25: Local API Authentication and Access Control
Add simple token-based authentication to the local OpenAI-compatible API.
If you are using LiteLLM or Ollama as a service, protect the endpoint with a reverse proxy and local API tokens. Store tokens in an environment file with restrictive permissions.
Part 26: Metrics for Local Inference
Collect core metrics:
- requests per minute
- average latency
- token consumption
- error rate
- tool usage rate
These metrics help you scale the local model server and detect regressions.
Part 27: Fallback Strategies and Degraded Mode
Define a degraded mode when the local model is unavailable. For example, return a cached response or a short fallback message rather than an error.
This keeps the application usable even if the local AI backend is restarting.
Part 28: Documentation for Developers
Include a local AI API guide in your repo with:
- how to configure
base_url - model names and supported backends
- sample Python code
- streaming usage examples
- function calling patterns
- token accounting notes
Developer documentation is essential for a sovereign system because it reduces dependence on external vendor docs.
Part 29: Local Audit and Compliance
If the local AI system handles regulated data, apply the same audit controls as for any other local service. Keep logs, access records, and model usage metadata in a secure repository.
Part 30: Closing the Local AI Loop
A fully sovereign local AI stack is not just the model. It is the data path, the API, the tooling, the logs, and the governance. Keep all those pieces under local version control and with clear operational ownership.
Part 31: Advanced Prompt Composition
Design prompts that separate instructions, context, and examples. This makes behavior more predictable and easier to debug.
Example prompt structure:
You are a local knowledge assistant.
Context: [user data or facts]
Examples: [show correct behavior]
Task: [what the model should do]
A consistent prompt structure also makes local logs easier to inspect and compare.
Part 32: Local Re-Ranking and Chaining
Combine an LLM with local deterministic logic for better results. Use the model to generate candidates and then rerank or filter them with a rule engine.
This hybrid approach preserves sovereignty while improving precision.
Part 33: Data Privacy and Sensitive Prompts
Filter or redact sensitive user inputs before sending them to the model, even if the model is local. Treat the local inference server as a sensitive resource.
For example, strip personally identifiable information (PII) when generating summaries or explanations.
Part 34: Multi-Tenancy and Access Policies
If the same local LLM service supports multiple users or teams, enforce strict access policies. Use separate API keys or tokens for each tenant and log requests by tenant identity.
Keep model configuration and prompt policies consistent across tenants to avoid drift.
Part 35: Testing the OpenAI-Compatible API
Create a local test suite for the API compatibility layer. Verify that requests shaped for OpenAI are accepted and that responses match the expected formats.
Key tests include:
- model listing endpoint returns consistent shapes
- completion responses include
id,choices, andusage - function call metadata is returned correctly
- error responses conform to OpenAI error schema
Part 36: Error Handling and Retries in Client Code
Client applications should handle transient failures gracefully.
for attempt in range(3):
try:
return client.chat.completions.create(...)
except Exception as exc:
if attempt == 2:
raise
sleep(1)
Log the failure reasons locally and expose meaningful error messages to developers.
Part 37: Local Model Lifecycle Management
Rotate models intentionally. Keep a deprecation schedule for older models and retire them once newer models prove better in your validation tests.
This avoids accumulating stale, unmaintained model instances.
Part 38: Final Local AI Production Checklist
- local API is OpenAI-compatible
- model metadata is versioned
- prompts are structured and audited
- logs and metrics are captured locally
- tool calling interfaces are limited and safe
- authentication protects the service
- fallback strategies exist
- sensitive inputs are redacted
- multi-tenant usage is isolated
- restore procedures for model service exist
Part 39: Data Serialization and Result Formats
When your API returns model output, standardize the format so downstream services can consume it reliably.
A JSON response should clearly separate metadata and generated content:
{
"id": "resp-123",
"object": "chat.completion",
"model": "qwen3:14b",
"usage": {"prompt_tokens": 42, "completion_tokens": 118},
"choices": [{"message": {"role": "assistant", "content": "..."}}]
}
Having stable response contracts makes client applications more resilient.
Part 40: Local Model Catalog and Inventory
Maintain a catalog of available models with details such as:
- name and version
- hardware requirements
- intended use cases
- date installed
- notes on quality and performance
A model catalog helps you choose the right backend and avoid deploying models that are outdated or unsupported.
Part 41: Cost and Resource Monitoring for Inference
Even if there is no external cost, local inference consumes compute. Track GPU or CPU utilization, memory pressure, and disk I/O during inference loads.
Use this data to decide when to scale up hardware, add more inference hosts, or change model sizes.
Part 42: Troubleshooting Local Model Deployment
Common local deployment issues include:
- model file not found
- insufficient GPU memory
- port conflicts
- API route mismatches
Keep a troubleshooting checklist and logs handy so you can resolve issues quickly.
Part 43: Local Service Hardening and Exposure Control
Only expose the AI API on private interfaces. If you need remote developer access, use SSH tunnels or a local VPN rather than opening the model port publicly.
This aligns with sovereign deployment principles and reduces attack surface.
Part 44: Final Configuration and Documentation
Document the expected configuration files, environment variables, and startup commands. A well-documented local AI service is easier to maintain and hand over.
Further Reading
- Self-Host an LLM API Server 2026 — set up the API servers this guide connects to
- AI Agent Design Patterns 2026 — build agents using function calling
- Python + Ollama 2026 — native Ollama SDK patterns
Tested on: Ubuntu 24.04 LTS (RTX 4090). openai 1.50.2, LiteLLM 1.50.4, Ollama 0.5.12. Last verified: May 1, 2026.