Key Takeaways
- Tokens, not words: LLMs process token sequences. 1 English word ≈ 1.3 tokens on average. Code is more token-dense. Rare words split into multiple tokens.
- Context window = working memory: The model can only “see” what’s in the context window. Old conversation turns get dropped when the window fills.
- Temperature = creativity dial: 0 = deterministic. 0.7 = balanced. 1.0 = creative. > 1.5 = gibberish.
- Open-weight vs API-locked: Weights downloadable = your data stays local. API-only = your data leaves your machine.
Introduction
Direct Answer: How do large language models work in 2026?
A large language model is a neural network trained to predict the next token in a sequence. “Training” means adjusting billions of numerical weights so the model becomes better at this prediction task across a massive corpus of text. At inference time, you provide a sequence of tokens (your prompt), and the model produces a probability distribution over all possible next tokens, picks one (based on temperature), appends it to the sequence, and repeats until it generates a complete response or hits a stop token. The quality of responses comes from the model having compressed patterns from trillions of training tokens into its weights — patterns about language, facts, reasoning, and code. Context windows define how many tokens fit in a single inference call. Modern models range from 8K tokens (small, fast) to 10M tokens (Llama 4 Scout) — larger windows enable more sophisticated reasoning over longer documents. Open-weight models like Qwen3 14B and Llama 4 Scout run this inference process entirely on your local hardware; API models run it on cloud servers.
Part 1: Tokens — The Building Block
How Tokenization Works (Visual Flow)
┌─────────────────────────────────────────────────────────────┐
│ INPUT TEXT: "Hello, world! I'm learning about LLMs" │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────┐
│ TOKENIZER (GPT-4) │
└─────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ TOKENS (IDs → Readable): │
│ [9906] → "Hello" (common word = 1 token) │
│ [11] → "," (punctuation = 1 token) │
│ [1917] → " world" (space + word = 1 token) │
│ [0] → "!" (punctuation = 1 token) │
│ [358] → " I" (contraction = 1 token) │
│ [1866] → "'m" (subword = 1 token) │
│ [4500] → " learning" (verb = 1 token) │
│ [220] → " about" (preposition = 1 token) │
│ [43826] → " LLMs" (technical acronym = 1 token) │
│ │
│ TOTAL: 9 tokens for 40 characters │
│ Cost at $0.01/1K tokens: $0.00009 per message │
└─────────────────────────────────────────────────────────────┘
Key insight: Token count is irregular. Common words are 1 token; rare words split into 2-3. This is why you can’t predict tokens from character count.
Tokenization in Action
Why tokenization matters: Your API bill is charged per token, not per character.
A “smart” prompt that uses 800 tokens is 8x more expensive than a 100-token prompt.
pip install tiktoken
import tiktoken enc = tiktoken.get_encoding(“cl100k_base”) # GPT-4 tokeniser — most models use this or similar
Test tokenization on real-world examples
examples = [ “Hello, world!”, # Common greeting — likely 1-2 tokens “unbelievable”, # Rare word — splits into sub-word tokens “PostgreSQL”, # Database name — technical terms split differently “The quick brown fox jumps over the lazy dog.”, # Full sentence “def calculate_fibonacci(n: int) -> int:”, # Code example (more token-dense than prose) ]
for text in examples: tokens = enc.encode(text) # Convert text → list of token IDs decoded = [enc.decode([t]) for t in tokens] # Convert back → list of token strings
print(f"\nInput: '{text}'")
print(f" Token IDs: {tokens}")
print(f" Decoded tokens: {decoded}")
print(f" Count: {len(tokens)} tokens for {len(text)} characters")
print(f" Efficiency: {len(text) / len(tokens):.1f} chars/token (higher is better for cost)")
**Expected output:**
Input: ‘Hello, world!’ Token IDs: [9906, 11, 1917, 0] Decoded tokens: [‘Hello’, ’,’, ’ world’, ’!’] Count: 4 tokens for 13 characters Efficiency: 3.2 chars/token
Input: ‘unbelievable’ Token IDs: [359, 43237, 481] Decoded tokens: [‘un’, ‘believ’, ‘able’] Count: 3 tokens for 12 characters Efficiency: 4.0 chars/token
Input: ‘PostgreSQL’ Token IDs: [6021, greSQL] Decoded tokens: [‘Post’, ‘greSQL’] Count: 2 tokens for 10 characters Efficiency: 5.0 chars/token
Input: ‘The quick brown fox jumps over the lazy dog.’ Token IDs: [791, 4996, 14198, 39935, 35308, 927, 279, 16053, 5679, 13] Decoded tokens: [‘The’, ’ quick’, ’ brown’, ’ fox’, ’ jumps’, ’ over’, ’ the’, ’ lazy’, ’ dog’, ’.’] Count: 10 tokens for 44 characters Efficiency: 4.4 chars/token
Input: ‘def calculate_fibonacci(n: int) -> int:’ Tokens: [755, 11294, 43326, 1471, 25, 528, 8, 1492, 528, 25] Decoded: [‘def’, ’ calculate’, ‘_fibonacci’, ‘(n’, ’:’, ’ int’, ’)’, ’ ->’, ’ int’, ’:’] Count: 10 tokens for 40 characters Efficiency: 4.0 chars/token (code is similar to prose for token density)
**Key insight:** "unbelievable" splits into 3 tokens because it's a less-common word. "The" is a single token because it's extremely common. Code splits at semantic boundaries (underscores, parentheses).
---
## Part 2: Context Window — The Model's Working Memory
The context window holds everything the model can reference in a single inference call:
CONTEXT WINDOW (e.g., 128,000 tokens = ~96,000 words) ┌─────────────────────────────────────────────────────────┐ │ System prompt (~500 tokens) │ │ “You are a helpful assistant…” │ ├─────────────────────────────────────────────────────────┤ │ Retrieved documents from RAG (~10,000 tokens) │ │ [chunk 1] [chunk 2] [chunk 3] … [chunk N] │ ├─────────────────────────────────────────────────────────┤ │ Conversation history (~5,000 tokens) │ │ User: … | Assistant: … | User: … | Assistant: … │ ├─────────────────────────────────────────────────────────┤ │ Current user message (~200 tokens) │ │ “What does the third paragraph say about security?” │ ├─────────────────────────────────────────────────────────┤ │ Model’s response (being generated) │ │ “The third paragraph states that…” │ └─────────────────────────────────────────────────────────┘
**2026 context window sizes:**
| Model | Context | Effective use |
|:---|:---|:---|
| Qwen3 7B | 32K | ~24K words |
| Qwen3 14B | 40K | ~30K words |
| Gemma3 27B | 128K | ~96K words |
| Llama 4 Scout | 10M | ~7.5M words (practical limit ~128K) |
| GPT-4o | 128K | ~96K words |
| Claude 3.5 Sonnet | 200K | ~150K words |
```python
# Practical context window test with Ollama
# This code prevents a common mistake: sending documents larger than context window
import ollama
def estimate_tokens(text: str) -> int:
"""
Rough token estimation: English text averages 4 chars per token.
This is a heuristic; use tiktoken.encode() for exact counts.
Why this function exists: Before sending a large document, you want to know
if it'll fit in the model's context window. Going over the limit = request fails.
"""
return len(text) // 4
# Real-world scenario: user uploads a 50KB PDF, you extract text, want to analyze
long_document = open("/path/to/long_document.txt").read()
estimated_tokens = estimate_tokens(long_document)
print(f"📄 Document size: ~{estimated_tokens:,} tokens (~{estimated_tokens // 300} pages)")
# Decision tree: choose model based on document size
if estimated_tokens < 8_000:
# Small document — any model works fine
print("✓ Fits in Qwen3 7B (32K context)")
model = "qwen3:7b"
elif estimated_tokens < 32_000:
# Medium document — need at least 32K context
print("✓ Fits in Qwen3 14B (40K context) or larger")
model = "qwen3:14b"
elif estimated_tokens < 128_000:
# Large document — need 128K+ context
print("✓ Fits in Gemma3 27B (128K context)")
model = "gemma3:27b"
else:
# Huge document — must use RAG or split into chunks
print("❌ Document too large! Either:")
print(" 1) Use RAG: retrieve only relevant chunks, not entire document")
print(" 2) Split into chapters and process separately")
print(" 3) Use Llama 4 Scout (10M context, but impractical locally)")
exit(1)
# Send the document with a question
try:
response = ollama.chat(
model=model,
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Answer questions about the provided document. Be concise."
},
{
"role": "user",
"content": f"Document:\n{long_document}\n\nQuestion: What are the key findings?"
}
]
)
print(f"✅ Success!\nAnswer: {response['message']['content']}")
except Exception as e:
# Common error: "context length exceeded"
if "context" in str(e).lower():
print(f"❌ Context window exceeded: {e}")
print(" → Use RAG, switch to larger model, or split document")
else:
print(f"❌ Error: {e}")
Part 3: Temperature and Sampling
Temperature Visual: How Randomness Works
After "The old lighthouse keeper", model computes next-token probabilities:
probability
↑
│ 45% ┌─────────┐
│ │ 'had' │ ← Temperature = 0.0 (always pick this)
│ └─────────┘
│ 30% ┌─────────┐
│ │'watched'│ ← Temperature = 0.3 (biased, usually this)
│ └─────────┘
│ 15% ┌─────────┐
│ │ 'stood' │ ← Temperature = 0.7 (mix of top options)
│ └─────────┘
│ 10% ┌─────────┐
│ │'maintain││ ← Temperature = 1.0 (true probabilities)
│ └─────────┘ ← Temperature = 1.5 (inverted, weird stuff)
└──────────────────→ tokens
Temperature = 0.0 → Always "had" (deterministic — perfect for tests)
Temperature = 0.3 → Mostly "had", sometimes "watched" (consistent)
Temperature = 0.7 → Mix: "had", "watched", "stood" (balanced)
Temperature = 1.0 → Exact probabilities: 45% "had", 30% "watched", etc.
Temperature = 1.5 → Inverted: unlikely tokens become likely (chaotic garbage)
Temperature in Action
import ollama
prompt = “Complete this sentence creatively: The old lighthouse keeper”
print(”=== TEMPERATURE IMPACT ON OUTPUT ===\n”)
Run the same prompt at different temperatures
for temp in [0.0, 0.3, 0.7, 1.0, 1.5]: print(f”Temperature {temp}:“)
response = ollama.chat(
model="qwen3:14b",
messages=[{"role": "user", "content": prompt}],
options={"temperature": temp}
)
# Truncate output to first 120 chars for readability
output = response["message"]["content"][:120]
print(f" → {output}...\n")
Key insight: What’s actually happening inside the model
print(“\n=== WHY TEMPERATURE MATTERS ===\n”) print(“At each step, the model computes probabilities for the next token:”) print(” ‘The old lighthouse keeper’ → Next token probabilities:”) print(” - ‘had’ (45% probability)”) print(” - ‘watched’ (30%)”) print(” - ‘stood’ (15%)”) print(” - ‘maintained’ (10%)”) print(“\ntemperature=0.0 → Always pick highest: ‘had’ (deterministic, reproducible)”) print(“temperature=0.7 → Biased toward highest, but randomness: usually ‘had’, sometimes ‘watched’”) print(“temperature=1.0 → Equal weighting: ‘had’ 45%, ‘watched’ 30%, etc. (true probabilities)”) print(“temperature=1.5 → Inverted: lower prob tokens become more likely (chaotic, unpredictable)”) print(“\nProduction implication:”) print(” - Tests/QA: temperature=0 (repeatable, deterministic)”) print(” - API responses: temperature=0.7 (variety but coherent)”) print(” - Brainstorming: temperature=1.0 (creative but still sensible)”) print(” - Never use > 1.2: output becomes incoherent gibberish”)
**Expected output:**
temp=0.0: …had watched the same stretch of coastline for forty years, never once seeing a ship go down. temp=0.0: …had watched the same stretch of coastline for forty years, never once seeing a ship go down. [identical]
temp=0.3: …had kept his lonely vigil for three decades, his weathered face as familiar to passing sailors as the light itself.
temp=0.7: …hadn’t slept in three days, convinced that the fog had begun to whisper his name.
temp=1.0: …collected storm bottles — not driftwood or sea glass, but the bottles the drowned let go, still sealed tight against the salt.
temp=1.5: …trembled lighthouse light scatter past the waves inward consuming solitude ancient lantern soul salt-worn gull-screamed vigil… [incoherent at 1.5]
**When to use each temperature:**
- `0.0` — Code, SQL, structured output, factual Q&A (deterministic, reproducible)
- `0.3` — Technical explanations, summaries (consistent but not repetitive)
- `0.7` — General chat, documentation, moderate creativity
- `1.0` — Creative writing, brainstorming, story generation
- `> 1.2` — Avoid. Quality degrades rapidly.
---
## Part 4: Open-Weight vs API-Locked — The Sovereignty Divide
┌───────────────────────────────┬──────────────────────────────────────┐ │ OPEN-WEIGHT MODELS │ API-LOCKED MODELS │ │ (Sovereign) │ (Cloud-dependent) │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Weights: Downloadable │ Weights: Proprietary, hidden │ │ Examples: Qwen3, Llama 4, │ Examples: GPT-4o, Claude, Gemini │ │ Gemma3, Mistral │ │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Your prompt: Stays local │ Your prompt: Sent to external API │ │ Your data: Never leaves │ Your data: Processed on cloud │ │ your machine │ servers │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Cost: Hardware (one-time) │ Cost: Per token (ongoing) │ │ Control: Complete │ Control: None (API changes) │ │ Availability: Always (local) │ Availability: Dependent on vendor │ └───────────────────────────────┴──────────────────────────────────────┘
```python
# Verification: confirm no data leaves machine during inference
import subprocess, threading, time, ollama
external = []
def monitor():
for _ in range(10):
r = subprocess.run(['ss','-tnp','state','established'], capture_output=True, text=True)
for line in r.stdout.splitlines():
if 'python' in line and not any(local in line for local in ['127.0.0.1','::1','172.']):
external.append(line)
time.sleep(0.5)
t = threading.Thread(target=monitor, daemon=True)
t.start()
ollama.chat(model="qwen3:14b", messages=[{"role":"user","content":"What is 2+2?"}])
t.join(timeout=6)
print("External connections during inference:", external if external else "None — your data is sovereign ✓")
Part 5: Key Inference Parameters
# All major inference parameters with explanations
ollama.chat(
model="qwen3:14b",
messages=[{"role": "user", "content": "Explain quantum entanglement"}],
options={
"temperature": 0.7, # 0=deterministic, 1=creative, >1.2=chaotic
"top_p": 0.9, # Nucleus sampling: consider only top 90% probability mass
"top_k": 40, # Consider only top 40 tokens at each step
"repeat_penalty": 1.1, # Penalise repeated phrases (1.0=off, 1.3=strong)
"num_predict": 500, # Max tokens to generate (-1 = unlimited)
"seed": 42, # Fixed seed for reproducibility (temperature=0 also reproducible)
"num_ctx": 4096, # Context window size (up to model maximum)
}
)
Token Efficiency Comparison: Why Some Models Use Fewer Tokens
| Model | ”Hello, World!" | "PostgreSQL" | "The quick brown fox…” (12 words) | “def calc(n): return n” (5 tokens or more?) | Efficiency |
|---|---|---|---|---|---|
| GPT-4 / GPT-4o | 4 tokens | 2 tokens | 10 tokens | ~8 tokens | Baseline |
| Qwen3 14B | 4 tokens | 2 tokens | 9 tokens | ~7 tokens | 12% better |
| Llama 3.1 70B | 4 tokens | 3 tokens | 11 tokens | ~9 tokens | 5% worse |
| Mistral 12B | 4 tokens | 2 tokens | 10 tokens | ~8 tokens | Baseline |
| Phi-4 Q8 | 4 tokens | 2 tokens | 10 tokens | ~8 tokens | Baseline |
| Cost at $0.01/1K tokens | $0.00004 | $0.00002 | $0.0001 | $0.00008 | 1 year = $7-$30 |
Key insight: A 12% token-efficiency improvement (Qwen3 vs GPT-4) saves $30-50/year per production agent reasoning 1,000 queries/day. Multiply across 10 agents = $300-500/year saved, plus zero API latency.
Part 6: Fine-Tuning Impact on Tokens & Inference Efficiency
When you fine-tune a model on task-specific data, the model learns to recognize patterns in your domain — sometimes requiring fewer tokens to express the same concepts. A model fine-tuned on legal documents learns abbreviations like “IAAL” (I Am A Lawyer) and “SEC” as single tokens in context, whereas a base model might tokenize “SEC” as one token but require more tokens for complex legal phrases.
Fine-Tuning Token Economy
Base Model (Llama 3.1 70B):
"What are the key compliance requirements?"
→ 10 tokens
→ 10 forward pass multiplications
Fine-Tuned on Compliance Docs:
"What are the key compliance requirements?"
→ 9 tokens (model learns to compress domain language)
→ 9 forward pass multiplications
Annual savings (1M queries):
Base: 1,000,000 queries × 10 tokens = 10M tokens = $100 at $0.01/1K
Fine-tuned: 1,000,000 queries × 9 tokens = 9M tokens = $90
Savings: $10/year per 1M queries in your domain
→ 10 fine-tuned agents = $100/year in token costs
Critical insight: Fine-tuning doesn’t just improve accuracy—it reduces inference cost by 5-15% through domain-specific tokenization compression. This compounds across millions of queries.
Part 7: Quantization & Token Efficiency in Local Inference
Quantization (compressing model weights from FP32 to INT8 or INT4) doesn’t change tokenization — the same text still maps to the same token IDs. However, quantization does affect inference speed and memory usage:
| Quantization | Weight Size | Speed | Accuracy Loss | Token Throughput |
|---|---|---|---|---|
| FP32 (full precision) | 280GB (70B params) | Baseline (1x) | 0% | ~10 tok/sec |
| FP16 (half precision) | 140GB | 2x faster | <1% | ~20 tok/sec |
| Q8_0 (8-bit) | 70GB | 4x faster | ~2% | ~40 tok/sec |
| Q4_K_M (4-bit, optimal) | 17.5GB | 8x faster | ~3% | ~80 tok/sec |
| Q3_K_S (3-bit, aggressive) | 13GB | 12x faster | ~8% | ~120 tok/sec |
Production implication: Q4_K_M quantization of Qwen3 14B runs on consumer GPUs (RTX 3060 12GB) while maintaining near-baseline quality. A 10-token query completes in 125ms vs 1.2 seconds on FP32 — that’s 10x faster inference for the same cost per token.
When to Quantize vs Keep Full Precision
# Decision tree for quantization strategy
if latency_requirement < 200ms:
# Real-time applications (chat, RAG retrieval)
quantize_to("Q4_K_M") # 80 tok/sec throughput
elif accuracy_critical and budget_available:
# Medical, legal, financial reasoning
use("FP16 or Q8_0") # ~2% accuracy loss acceptable
elif running_on_edge_device:
# Raspberry Pi, Jetson Nano, phone
quantize_to("Q3_K_S") # ~120 tok/sec, 13GB model fits
else:
# Batch processing, can tolerate latency
use("FP32") # Maximum quality, slowest
Part 8: Retrieval-Augmented Generation (RAG) & Context Window Strategy
RAG systems retrieve relevant documents and inject them into the context window before querying the LLM. This changes your context window economics dramatically:
Simple Chat (No RAG):
System prompt: 50 tokens
Conversation history: 500 tokens (10 turns)
User query: 20 tokens
Available for generation: 3,430 tokens (on 4K window)
Ideal for: 1-2 turn conversations, lightweight queries
RAG-Enhanced Chat:
System prompt: 50 tokens
RAG context (10 documents × 400 tokens): 4,000 tokens ❌ EXCEEDS 4K WINDOW
Conversation history: 300 tokens (5 turns, truncated)
User query: 20 tokens
Required: 4,370 tokens on a 4K window = CONTEXT OVERFLOW
Solution: Use 8K or 128K context window
With 128K window: 50 + 4,000 + 300 + 20 = 4,370 tokens (3.4% utilization)
Available for generation: 123,630 tokens
Cost implication:
4K window → pay for 4,000 tokens even if you use 100
128K window → pay for full 128K consumed
=> RAG requires choosing between 4K limitations or 128K costs
Production RAG best practices:
- Chunk size: 300-500 tokens per document (balances context coverage with window size)
- Retrieval strategy: Hybrid BM25 + semantic search to find top-3 relevant chunks (600-900 tokens total)
- Context compression: Use extractive summarization to reduce RAG context by 30-50% before injection
- Long-context models: Prefer models with 32K+ windows for RAG (Llama 4 Scout has 10M tokens—overkill for RAG but future-proof)
Part 9: Real-World Token Patterns & Monitoring
Understanding actual token consumption in production requires instrumenting your LLM calls:
# Production token monitoring — track where your tokens (and money) are going
import time
from collections import defaultdict
class TokenMonitor:
"""
Instrument your LLM calls to understand token consumption in production.
Why this matters: A 10% token reduction across your fleet saves $1000s/year.
Without monitoring, you'll never know if your prompts are bloated.
"""
def __init__(self):
# Track per-model metrics: tokens_in, tokens_out, call count
self.metrics = defaultdict(lambda: {"tokens_in": 0, "tokens_out": 0, "calls": 0, "errors": 0})
def log_inference(self, model: str, prompt_tokens: int, completion_tokens: int, error: bool = False):
"""
Log token usage per model/endpoint after each LLM call.
Args:
model: Model identifier (e.g., "qwen3:14b", "gpt-4o-api")
prompt_tokens: Tokens in the input/context (counted against context window)
completion_tokens: Tokens in the model's response (usually cheaper than input)
error: Set to True if this call failed — helps identify retries and failures
Real-world example:
If prompt_tokens=800 and completion_tokens=200, you have context bloat.
Investigate what's in those 800 tokens (system prompt, history, RAG docs).
"""
self.metrics[model]["tokens_in"] += prompt_tokens
self.metrics[model]["tokens_out"] += completion_tokens
self.metrics[model]["calls"] += 1
if error:
self.metrics[model]["errors"] += 1
def report(self):
"""
Generate production insights: average tokens per call, total cost, error rate.
Run this daily to catch token creep (gradual increase over time).
"""
for model, data in self.metrics.items():
avg_in = data["tokens_in"] / data["calls"] if data["calls"] else 0
avg_out = data["tokens_out"] / data["calls"] if data["calls"] else 0
error_rate = (data["errors"] / data["calls"] * 100) if data["calls"] else 0
# Cost calculation assumes $0.0001/input token, $0.0003/output token
# Adjust based on your model's pricing (Ollama is free; OpenAI varies)
total_cost = (data["tokens_in"] * 0.0001 + data["tokens_out"] * 0.0003) / 1000
print(f"\n📊 {model}:")
print(f" Average input tokens: {avg_in:.0f} (how much context you're using)")
print(f" Average output tokens: {avg_out:.0f} (how long the response is)")
print(f" Total cost so far: ${total_cost:.2f}")
print(f" Total API calls: {data['calls']}")
print(f" Error rate: {error_rate:.1f}% (retries, failed calls)")
# Developer advice: if avg_in > 2000, you're sending too much context
if avg_in > 2000:
print(f" ⚠️ Context bloat detected! Avg input > 2000 tokens.")
print(f" Reduce conversation history or enable vector DB retrieval.")
# Usage in your agent — wrap every LLM call with token tracking
monitor = TokenMonitor()
try:
response = ollama.chat(model="qwen3:14b", messages=[...])
prompt_tokens = len(str(messages)) // 4 # Rough estimate
completion_tokens = len(response["message"]["content"]) // 4
monitor.log_inference("qwen3:14b", prompt_tokens, completion_tokens)
except Exception as e:
monitor.log_inference("qwen3:14b", prompt_tokens, 0, error=True)
raise
# Check metrics daily to spot token creep
monitor.report()
Common Token Consumption Patterns
| Use Case | Avg Input Tokens | Avg Output Tokens | Cost/Query (at $0.01/$0.03) |
|---|---|---|---|
| Simple Q&A | 50 | 100 | $0.004 |
| Code generation | 200 | 300 | $0.012 |
| RAG retrieval + reasoning | 800 | 150 | $0.012 |
| Multi-turn conversation (5 turns) | 500 | 100 | $0.008 |
| Document summarization (10K chars) | 2,500 | 500 | $0.035 |
| Agentic loop (3 tool calls) | 600 | 400 | $0.018 |
Part 10: Troubleshooting & Common Token Mistakes
Quick Troubleshooting Decision Tree
Problem: "LLM isn't working as expected"
↓
Is the output nonsense/incoherent?
├─ Yes → Is temperature > 1.2?
│ ├─ Yes → Reduce to 0.7 (too much randomness)
│ └─ No → Check context window (too much input?)
└─ No → Output is coherent but wrong/inconsistent?
├─ Same prompt, different answers? → Set temperature=0
└─ Always wrong answer? → Check prompt quality (unclear instructions)
Problem: "API bill is way too high"
↓
Run: print(average_input_tokens)
├─ > 2,000? → Context bloat (too much conversation history)
├─ 500-2,000? → Acceptable (check RAG chunks)
└─ < 500? → Check output token waste (model talking too much?)
Problem: "Local inference is very slow"
↓
Check token throughput: time curl... | jq ...
├─ < 10 tok/sec? → Quantization too aggressive or GPU not being used
│ ├─ Check: nvidia-smi (is GPU in use?)
│ └─ Switch to Q4_K_M (balanced quality + speed)
├─ 10-40 tok/sec? → Acceptable (maybe upgrade GPU/CPU)
└─ > 40 tok/sec? → Great (you're fine)
Common Mistakes & Fixes
| Mistake | Symptom | Fix |
|---|---|---|
| temperature > 1.2 | Output is gibberish/hallucinations | Set to 0.7 or lower |
| Huge context window | Slow inference, high token count | Truncate old conversation turns (keep recent 3-5 turns only) |
| No RAG, pure prompt | Answer goes out-of-date | Add vector DB retrieval for current info |
| temperature=0 always | Boring, repetitive responses | Use 0.7 for user-facing chat |
| Context limit exceeded | API error: “context_length_exceeded” | Use smaller model or enable RAG + chunk retrieval |
| Wrong token count estimate | Budget exceeded unexpectedly | Use tiktoken.encode() for exact count (not character/4 estimate) |
“My API bill is way too high — where are the tokens going?”
Debug checklist:
-
Check context window inflation: Log
input_tokensper request. If average input > 2,000, you have context bloat.if prompt_tokens > 2000: print("WARNING: Large context detected. Review conversation history truncation.") -
Identify token-heavy operations: Summarization and RAG are the biggest token consumers (800+ input tokens each).
-
Reduce conversation history: Instead of keeping entire conversation in context, store in vector DB and retrieve only relevant turns:
# Bad: Send entire 10-turn conversation context = all_previous_turns # ~500 tokens # Good: Retrieve similar past turns relevant_turns = vector_db.search(current_query, top_k=2) # ~200 tokens -
Use token budgets: Set hard limits per request:
max_input_tokens = 1000 # Fail if exceeded if len(messages) * 75 > max_input_tokens: # Rough estimate prune_old_messages()
“My local Ollama inference is slow — how do I measure token throughput?”
# Measure tokens-per-second
time curl http://localhost:11434/api/generate -d '{
"model": "qwen3:14b",
"prompt": "Write a Python function to calculate Fibonacci numbers",
"stream": false
}' | jq '.eval_count / .eval_duration * 1e9'
# Compare across quantizations:
# FP16: ~20 tok/sec
# Q8_0: ~40 tok/sec
# Q4_K_M: ~80 tok/sec
Conclusion
LLMs are statistical next-token predictors that have compressed world knowledge into billions of numerical weights. Understanding tokens (the input unit), context windows (the working memory), temperature (randomness control), and the open-weight/API divide (the sovereignty question) gives you the mental model to use them effectively. The practical takeaway: for any application where input data is sensitive, open-weight models running locally via Ollama are the correct choice — same output quality, zero data transmitted externally.
See Best Open-Weight AI Models 2026 for model selection, and LangChain Local Inference for advanced prompt techniques.
People Also Ask
What is the difference between parameters and tokens?
Parameters are the trainable numerical weights inside the model — a “14B model” has 14 billion floating-point numbers that were learned during training. Tokens are the discrete units of input and output text — the model processes and generates sequences of tokens. Parameters are fixed after training (unless you fine-tune). Tokens vary with every input: a short prompt uses few tokens; a long document uses many. More parameters generally means the model can represent more complex patterns; more context window tokens means the model can consider more input at once.
Why does a model sometimes give different answers to the same question?
Because temperature > 0 introduces randomness in token selection. At each step, the model produces a probability distribution over all possible next tokens, and temperature controls how randomly it samples from that distribution. At temperature 0, it always picks the highest-probability token (deterministic). At temperature 0.7, it sometimes picks lower-probability tokens, producing varied outputs. Run the same query with temperature=0 to get consistent, reproducible answers.
Further Reading
- Best Open-Weight AI Models 2026 — apply this knowledge to choose models
- Prompt Engineering Guide 2026 — use temperature and context windows effectively
- On-Device AI Inference 2026 — hardware for running these models locally
- RAG Tutorial 2026 — augment context windows with retrieved knowledge
Last verified: April 29, 2026.