Vucense
Dev Corner Generative AI & LLMs LLM APIs & SDKs

OpenAI-Compatible LLM APIs 2026: Ollama, vLLM & LiteLLM Guide

🟡Intermediate

Use OpenAI-compatible APIs with sovereign local models. Covers Ollama API, vLLM server, LiteLLM proxy for multi-model routing, streaming responses, function calling, and token counting.

OpenAI-Compatible LLM APIs 2026: Ollama, vLLM & LiteLLM Guide
Article Roadmap

Key Takeaways

  • One SDK, any backend: The OpenAI Python SDK works with Ollama, vLLM, and llama.cpp — change only base_url and api_key.
  • LiteLLM for multi-provider routing: Single endpoint routing to Ollama, vLLM, cloud fallbacks, with logging and cost tracking.
  • Streaming for UX: Always use stream=True for interactive chat — users see tokens as they generate.
  • Function calling is sovereign: Tool use works with local Llama 4 Scout and Qwen3 14B — no cloud needed.

Introduction

Direct Answer: How do I use the OpenAI Python SDK with local Ollama models in 2026?

Install: pip install openai. Start Ollama with a model: ollama pull qwen3:14b. Connect: from openai import OpenAI; client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"). Call as normal: response = client.chat.completions.create(model="qwen3:14b", messages=[{"role":"user","content":"Hello"}]). For streaming: add stream=True and iterate chunks. For function calling: pass a tools list and handle tool_calls in the response. Only base_url and api_key differ from the production OpenAI configuration — the rest of your application is unchanged.


Part 1: Basic OpenAI SDK with Ollama

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="qwen3:14b",
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "What is a Docker volume in one sentence?"}
    ],
    temperature=0,
    max_tokens=100
)

print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} prompt + {response.usage.completion_tokens} completion")

Expected output:

A Docker volume is persistent storage managed by Docker that exists independently of containers, surviving restarts and deletions.
Tokens: 42 prompt + 28 completion

Part 2: Streaming

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

def stream_chat(prompt: str, model: str = "qwen3:14b") -> str:
    full = ""
    with client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True
    ) as stream:
        for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                print(delta, end="", flush=True)
                full += delta
    print()
    return full

stream_chat("Write a Python context manager that times a code block.")

Expected output (tokens appear in real time):

import time
from contextlib import contextmanager

@contextmanager
def timer(label: str = "Elapsed"):
    start = time.perf_counter()
    try:
        yield
    finally:
        print(f"{label}: {time.perf_counter() - start:.4f}s")

with timer("My operation"):
    time.sleep(0.5)
# My operation: 0.5003s

Part 3: Function Calling

import json
from openai import OpenAI
import subprocess

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [{
    "type": "function",
    "function": {
        "name": "get_disk_usage",
        "description": "Get disk usage for a directory",
        "parameters": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "Directory path to check"}
            },
            "required": ["path"]
        }
    }
}]

def execute_tool(name: str, args: dict) -> str:
    if name == "get_disk_usage":
        r = subprocess.run(["du", "-sh", args["path"]], capture_output=True, text=True)
        return r.stdout.strip() or r.stderr.strip()
    return f"Unknown: {name}"

def run_agent(user_msg: str) -> str:
    messages = [{"role": "user", "content": user_msg}]
    while True:
        response = client.chat.completions.create(
            model="llama4:scout",
            messages=messages,
            tools=tools,
            tool_choice="auto"
        )
        msg = response.choices[0].message
        messages.append(msg)
        if not msg.tool_calls:
            return msg.content
        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            result = execute_tool(tc.function.name, args)
            print(f"  [Tool] {tc.function.name}({args}) -> {result[:60]}")
            messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})

answer = run_agent("How much space is /var/log using?")
print(f"\nAnswer: {answer}")

Expected output:

  [Tool] get_disk_usage({'path': '/var/log'}) -> 1.2G    /var/log

Answer: /var/log is using 1.2GB of disk space.

Part 4: LiteLLM Proxy for Multi-Model Routing

pip install litellm --break-system-packages

# Start proxy — routes to local Ollama
litellm --model ollama/qwen3:14b --port 4000 &

curl http://localhost:4000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"ollama/qwen3:14b","messages":[{"role":"user","content":"ping"}]}' \
    | python3 -c "import json,sys; print(json.load(sys.stdin)['choices'][0]['message']['content'])"

Expected output: pong

# litellm_config.yaml — production multi-model routing
model_list:
  - model_name: fast
    litellm_params:
      model: ollama/qwen3:7b
      api_base: http://localhost:11434

  - model_name: smart
    litellm_params:
      model: ollama/qwen3:14b
      api_base: http://localhost:11434

  - model_name: vision
    litellm_params:
      model: ollama/llama4:scout
      api_base: http://localhost:11434

general_settings:
  master_key: sk-your-proxy-key
  request_timeout: 300
litellm --config litellm_config.yaml --port 4000

# All models now available via single endpoint
curl http://localhost:4000/v1/models \
    -H "Authorization: Bearer sk-your-proxy-key"

Part 5: Migration Pattern — OpenAI to Sovereign

# BEFORE — OpenAI cloud
from openai import OpenAI
client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(model="gpt-4o", messages=[...])

# AFTER — Sovereign (change 3 values, nothing else)
from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:11434/v1",   # Add
    api_key="ollama"                         # Change
)
response = client.chat.completions.create(
    model="qwen3:14b",                       # Change model name
    messages=[...]                           # Unchanged
)
# All other code: unchanged

Troubleshooting

Connection refused to Ollama

Ollama is not running. Fix: ollama serve or sudo systemctl start ollama.

Model name not found

Ollama model names are case-sensitive and include the tag: qwen3:14b not Qwen3-14B. Check with ollama list.

LiteLLM Bad Gateway

The backend (Ollama/vLLM) isn’t running or the api_base is wrong in the config. Test the backend directly with curl before routing through LiteLLM.


Conclusion

The OpenAI SDK is the universal client for sovereign LLM APIs in 2026 — change base_url to reach Ollama, vLLM, or llama.cpp with zero other code changes. LiteLLM adds multi-model routing, fallbacks, and logging. The same function calling and streaming patterns work locally as with the cloud API.

See Self-Host an LLM API Server 2026 for setting up the server this SDK connects to, and AI Agent Design Patterns 2026 for building agents with function calling.


People Also Ask

Is there a performance difference between the Ollama SDK and OpenAI SDK pointing at Ollama?

Negligible — both use HTTP/JSON over the same localhost connection. The native ollama Python SDK has slightly less overhead (no HTTP keep-alive setup per request) and access to Ollama-specific features like the images parameter for vision models. The OpenAI SDK is better for portability — code works identically against OpenAI, Anthropic (via LiteLLM), and Ollama. For new sovereign-only code, use the native ollama SDK; for code that might need to switch providers, use the OpenAI SDK.

Can I use LiteLLM to fall back to OpenAI if local GPU is busy?

Yes — LiteLLM supports fallback routing: litellm_params: fallbacks: [{model: gpt-4o}]. If the primary local model fails or times out, requests automatically route to the fallback. This is useful for graceful degradation in production, but be aware that fallback requests do leave your machine and incur OpenAI costs.


Part 2: Installing and Running Ollama Locally

The easiest sovereign path is to run Ollama on your local Ubuntu host. Install with the official installer or package manager and pull the model you need.

2.1 Install Ollama

curl https://ollama.ai/install | sh
ollama version

2.2 Pull a model

ollama pull qwen3:14b

2.3 Start the Ollama server

ollama serve --port 11434

Verify the local endpoint:

curl http://localhost:11434/v1/models

If you are using a GPU, make sure the model supports your hardware and that the driver stack is configured correctly. For CPU-only environments, llama.cpp or vLLM may be more efficient.

Part 3: LiteLLM Multi-Provider Proxy

LiteLLM lets you run a unified OpenAI-compatible API on top of multiple local backends. This is ideal when you want the same client code to work with Ollama, vLLM, and local hosted cloud providers.

3.1 Install LiteLLM

pip install litellm --break-system-packages

3.2 Start the proxy for Ollama

litellm --model ollama/qwen3:14b --port 4000

3.3 Use the same OpenAI SDK code

client = OpenAI(base_url='http://localhost:4000/v1', api_key='secret')

LiteLLM can also route to a mix of local and cloud backends, which is useful for hybrid sovereignty: keep sensitive requests local while routing low-risk requests to a trusted fallback.

Part 4: Streaming and Real-Time Output

Streaming improves user experience in chat apps and reduces wait times for long responses.

with client.chat.completions.create(
    model='qwen3:14b',
    messages=[{'role': 'user', 'content': 'Explain local LLM routing.'}],
    stream=True,
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end='', flush=True)

Handle partial chunks gracefully in the UI. If the stream ends unexpectedly, retry once and log the failure locally.

Part 5: Function Calling with Local Models

Function calling lets the model request external tools while keeping the overall flow local. This is one of the strongest sovereign patterns because the model does not need external APIs to perform useful tasks.

5.1 Define tools as JSON schemas

tools=[{
    'type': 'function',
    'function': {
        'name': 'lookup_user',
        'description': 'Get details for a local user by username',
        'parameters': {
            'type': 'object',
            'properties': {
                'username': {'type': 'string', 'description': 'The login name of the user'}
            },
            'required': ['username'],
        },
    },
}]

5.2 Execute tool calls locally

from openai import OpenAI
import json

client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

response = client.chat.completions.create(
    model='llama4:scout',
    messages=[{'role': 'user', 'content': 'Look up user alice in the local directory.'}],
    tools=tools,
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    result = lookup_user(args['username'])
    print(result)

This keeps the knowledge boundary local and ensures that tool execution is auditable.

Part 6: Token Counting and Cost Awareness

Even when the model is local, token efficiency matters for latency and memory. Use the SDK to estimate token usage.

from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

response = client.chat.completions.create(
    model='qwen3:14b',
    messages=[{'role': 'user', 'content': 'Summarize the server health checklist.'}],
    max_tokens=150,
)
print(response.usage)

Track prompt and completion tokens when you are doing local capacity planning. For on-prem inference, token usage affects GPU memory and throughput just as it affects cloud cost.

Part 7: Multi-Model Routing and Fallbacks

A local proxy can route requests to different models based on intent, size, or sensitivity.

Create a lightweight routing table in LiteLLM or your own proxy:

  • small, fast models for short prompts
  • larger high-quality models for long-form generation
  • a specific model for code tasks
  • a privacy-preserving local model for confidential queries

This keeps responses in the right performance and privacy tier without changing the application code.

Part 8: Secure Local API Configuration

Use environment variables and local secrets to avoid hard-coding endpoints.

export OPENAI_API_BASE='http://localhost:11434/v1'
export OPENAI_API_KEY='ollama'

In Python:

import os
client = OpenAI(base_url=os.environ['OPENAI_API_BASE'], api_key=os.environ['OPENAI_API_KEY'])

Use a .env file with local permissions and exclude it from Git. This keeps the local endpoint and key under your control.

Part 9: Model Deployment and Scalability

For production local inference, run Ollama or vLLM on a dedicated inference server. Use systemd or Docker to supervise the process and restart it on failure.

Example systemd unit:

[Unit]
Description=Ollama inference service
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/ollama serve --port 11434
Restart=on-failure
User=ollama

[Install]
WantedBy=multi-user.target

This gives you a reliable local endpoint that can be monitored and restarted automatically.

Part 10: Troubleshooting Local SDK Compatibility

If the OpenAI SDK fails with a local backend, troubleshoot:

  • verify base_url ends with /v1
  • ensure api_key is accepted by the local server
  • check the local server logs for decoding or schema errors
  • confirm the model name is available with GET /v1/models

Common errors include missing Content-Type: application/json or using a path that the proxy does not support.

Part 11: Logging, Audit, and Sovereign Traceability

Log every inference request and response metadata locally. Do not log full user prompts if they contain sensitive data; instead log request IDs, model names, and token counts.

A log record might include:

  • timestamp
  • model
  • prompt length
  • completion length
  • user ID (if applicable)
  • tool usage
  • latency

This local audit trail is part of your sovereign AI infrastructure.

Part 12: Offline and Air-Gapped Operation

For a truly air-gapped deployment, ensure your model files and SDK dependencies are all available locally. Use a private package mirror or a local wheelhouse for Python packages.

If you need to update models, transfer them via secure media and verify checksums before importing. This preserves the sovereignty of the environment.

Part 13: Local Security Hardening for LLM APIs

Protect the local API with a firewall and host-based access controls. Only expose the local endpoint to trusted clients.

If using a proxy, bind it to localhost or a private network interface:

litellm --model ollama/qwen3:14b --host 127.0.0.1 --port 4000

Then use SSH tunnels or local NAT rules to grant access to approved hosts.

Part 14: Integration Patterns for Local AI Apps

Use the same OpenAI-compatible client code for:

  • chatbots
  • summarization services
  • extraction tools
  • code assistants

Switch the backend by changing base_url and api_key only. This makes your app portable and sovereign across deployments.

Part 15: Local Model Maintenance and Updates

Keep track of model versions and update them intentionally. Record the model digest or version in your metadata:

ollama list
ollama info qwen3:14b

When you upgrade a model, run validation prompts and compare responses to ensure the new model still meets your requirements.


Part 16: Further Reading

Part 17: Token Cost and Throughput for Local Models

Even though a local model incurs no cloud bill, token usage still affects performance and memory. Smaller prompts and concise instructions reduce latency.

17.1 Token estimation

Use the OpenAI SDK’s token accounting or a tokenizer library to estimate usage before inference.

from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')

response = client.chat.completions.create(
    model='qwen3:14b',
    messages=[{'role': 'user', 'content': 'What is a Docker volume?'}],
    max_tokens=100
)
print(response.usage)

17.2 Prompt optimization

Prefer shorter prompts with explicit instructions. Use system messages to set behavior and avoid repeating context in each request.

Part 18: Multi-Model Rerouting Policies

Use LiteLLM or a custom router to send prompts to different models based on intent.

Example policy:

  • classification tasks -> small, fast model
  • summarization -> medium model
  • coding -> specialized code-capable model
  • confidential queries -> local privacy-first model

This improves resource utilization while preserving local sovereignty.

Part 19: Tool Security and Operation Boundaries

When the LLM calls local tools, clearly define the boundary between natural language and executable actions. Keep tool definitions minimal and audit the tool output.

Example tool safety note:

  • accept only string input parameters
  • sanitize paths and filenames
  • do not expose secrets through tool output

A safe tool interface is one of the most critical parts of sovereign function calling.

Part 20: Model Versioning and Metadata

Track the exact model version in your metadata and logs.

print(client.models.retrieve('qwen3:14b'))

Store the model identifier, server version, and local runtime details in the inference logs. This makes it possible to trace a generated response back to the exact model instance.

Part 21: Logging and Observability

Log inference requests, errors, and response times locally.

A simple JSON log entry:

{
  "timestamp": "2026-05-22T12:00:00Z",
  "model": "qwen3:14b",
  "tokens": 143,
  "latency_ms": 360,
  "user_id": "alice"
}

Do not log sensitive prompt content unless you have consent and secure storage.

Part 22: Local Cache and Repeatable Responses

For repeated queries, use a local prompt cache. If the same query has already been answered, return the cached response instead of calling the model again.

This reduces load and provides deterministic behavior for repeated lookups.

Part 23: AI Safety and Guardrails

Implement guardrails such as prompt templates, model instructions, and output validation. For local AI, the guardrails should be part of your codebase, not an external service.

Example validation:

output = response.choices[0].message.content
if len(output) > 2000:
    raise ValueError('Response too long')

If the model is asked to generate code, validate that the output only contains allowed constructs.

Part 24: Offline Dependency Management

When deploying local LLM services, ensure all Python dependencies are available without external network access.

Create a local wheelhouse:

pip download -r requirements.txt -d /tmp/wheelhouse

Then install from the wheelhouse on the air-gapped host.

Part 25: Local API Authentication and Access Control

Add simple token-based authentication to the local OpenAI-compatible API.

If you are using LiteLLM or Ollama as a service, protect the endpoint with a reverse proxy and local API tokens. Store tokens in an environment file with restrictive permissions.

Part 26: Metrics for Local Inference

Collect core metrics:

  • requests per minute
  • average latency
  • token consumption
  • error rate
  • tool usage rate

These metrics help you scale the local model server and detect regressions.

Part 27: Fallback Strategies and Degraded Mode

Define a degraded mode when the local model is unavailable. For example, return a cached response or a short fallback message rather than an error.

This keeps the application usable even if the local AI backend is restarting.

Part 28: Documentation for Developers

Include a local AI API guide in your repo with:

  • how to configure base_url
  • model names and supported backends
  • sample Python code
  • streaming usage examples
  • function calling patterns
  • token accounting notes

Developer documentation is essential for a sovereign system because it reduces dependence on external vendor docs.

Part 29: Local Audit and Compliance

If the local AI system handles regulated data, apply the same audit controls as for any other local service. Keep logs, access records, and model usage metadata in a secure repository.

Part 30: Closing the Local AI Loop

A fully sovereign local AI stack is not just the model. It is the data path, the API, the tooling, the logs, and the governance. Keep all those pieces under local version control and with clear operational ownership.


Part 31: Advanced Prompt Composition

Design prompts that separate instructions, context, and examples. This makes behavior more predictable and easier to debug.

Example prompt structure:

You are a local knowledge assistant.
Context: [user data or facts]
Examples: [show correct behavior]
Task: [what the model should do]

A consistent prompt structure also makes local logs easier to inspect and compare.

Part 32: Local Re-Ranking and Chaining

Combine an LLM with local deterministic logic for better results. Use the model to generate candidates and then rerank or filter them with a rule engine.

This hybrid approach preserves sovereignty while improving precision.

Part 33: Data Privacy and Sensitive Prompts

Filter or redact sensitive user inputs before sending them to the model, even if the model is local. Treat the local inference server as a sensitive resource.

For example, strip personally identifiable information (PII) when generating summaries or explanations.

Part 34: Multi-Tenancy and Access Policies

If the same local LLM service supports multiple users or teams, enforce strict access policies. Use separate API keys or tokens for each tenant and log requests by tenant identity.

Keep model configuration and prompt policies consistent across tenants to avoid drift.

Part 35: Testing the OpenAI-Compatible API

Create a local test suite for the API compatibility layer. Verify that requests shaped for OpenAI are accepted and that responses match the expected formats.

Key tests include:

  • model listing endpoint returns consistent shapes
  • completion responses include id, choices, and usage
  • function call metadata is returned correctly
  • error responses conform to OpenAI error schema

Part 36: Error Handling and Retries in Client Code

Client applications should handle transient failures gracefully.

for attempt in range(3):
    try:
        return client.chat.completions.create(...)
    except Exception as exc:
        if attempt == 2:
            raise
        sleep(1)

Log the failure reasons locally and expose meaningful error messages to developers.

Part 37: Local Model Lifecycle Management

Rotate models intentionally. Keep a deprecation schedule for older models and retire them once newer models prove better in your validation tests.

This avoids accumulating stale, unmaintained model instances.

Part 38: Final Local AI Production Checklist

  • local API is OpenAI-compatible
  • model metadata is versioned
  • prompts are structured and audited
  • logs and metrics are captured locally
  • tool calling interfaces are limited and safe
  • authentication protects the service
  • fallback strategies exist
  • sensitive inputs are redacted
  • multi-tenant usage is isolated
  • restore procedures for model service exist

Part 39: Data Serialization and Result Formats

When your API returns model output, standardize the format so downstream services can consume it reliably.

A JSON response should clearly separate metadata and generated content:

{
  "id": "resp-123",
  "object": "chat.completion",
  "model": "qwen3:14b",
  "usage": {"prompt_tokens": 42, "completion_tokens": 118},
  "choices": [{"message": {"role": "assistant", "content": "..."}}]
}

Having stable response contracts makes client applications more resilient.

Part 40: Local Model Catalog and Inventory

Maintain a catalog of available models with details such as:

  • name and version
  • hardware requirements
  • intended use cases
  • date installed
  • notes on quality and performance

A model catalog helps you choose the right backend and avoid deploying models that are outdated or unsupported.

Part 41: Cost and Resource Monitoring for Inference

Even if there is no external cost, local inference consumes compute. Track GPU or CPU utilization, memory pressure, and disk I/O during inference loads.

Use this data to decide when to scale up hardware, add more inference hosts, or change model sizes.

Part 42: Troubleshooting Local Model Deployment

Common local deployment issues include:

  • model file not found
  • insufficient GPU memory
  • port conflicts
  • API route mismatches

Keep a troubleshooting checklist and logs handy so you can resolve issues quickly.

Part 43: Local Service Hardening and Exposure Control

Only expose the AI API on private interfaces. If you need remote developer access, use SSH tunnels or a local VPN rather than opening the model port publicly.

This aligns with sovereign deployment principles and reduces attack surface.

Part 44: Final Configuration and Documentation

Document the expected configuration files, environment variables, and startup commands. A well-documented local AI service is easier to maintain and hand over.


Further Reading

Tested on: Ubuntu 24.04 LTS (RTX 4090). openai 1.50.2, LiteLLM 1.50.4, Ollama 0.5.12. Last verified: May 1, 2026.

Kofi Mensah

About the Author

Inference Economics & Hardware Architect

Electrical Engineer | Hardware Systems Architect | 8+ Years in GPU/AI Optimization | ARM & x86 Specialist

Kofi Mensah is a hardware architect and AI infrastructure specialist focused on optimizing inference costs for on-device and local-first AI deployments. With expertise in CPU/GPU architectures, Kofi analyzes real-world performance trade-offs between commercial cloud AI services and sovereign, self-hosted models running on consumer and enterprise hardware (Apple Silicon, NVIDIA, AMD, custom ARM systems). He quantifies the total cost of ownership for AI infrastructure and evaluates which deployment models (cloud, hybrid, on-device) make economic sense for different workloads and use cases. Kofi's technical analysis covers model quantization, inference optimization techniques (llama.cpp, vLLM), and hardware acceleration for language models, vision models, and multimodal systems. At Vucense, Kofi provides detailed cost analysis and performance benchmarks to help developers understand the real economics of sovereign AI.

View Profile

Further Reading

All Dev Corner

Comments