Vucense

How Large Language Models Work 2026: Tokens, Context & Inference

Master how LLMs generate text: from tokenization to attention to context windows. Learn why Qwen uses fewer tokens than GPT-4, what temperature controls, and how open-weight models differ from API-locked ones.

Anju Kushwaha

Author

Anju Kushwaha

Founder & Editorial Director

Published

Duration

Reading

13 min

How Large Language Models Work 2026: Tokens, Context & Inference
Article Roadmap

Key Takeaways

  • Tokens, not words: LLMs process token sequences. 1 English word ≈ 1.3 tokens on average. Code is more token-dense. Rare words split into multiple tokens.
  • Context window = working memory: The model can only “see” what’s in the context window. Old conversation turns get dropped when the window fills.
  • Temperature = creativity dial: 0 = deterministic. 0.7 = balanced. 1.0 = creative. > 1.5 = gibberish.
  • Open-weight vs API-locked: Weights downloadable = your data stays local. API-only = your data leaves your machine.

Introduction

Direct Answer: How do large language models work in 2026?

A large language model is a neural network trained to predict the next token in a sequence. “Training” means adjusting billions of numerical weights so the model becomes better at this prediction task across a massive corpus of text. At inference time, you provide a sequence of tokens (your prompt), and the model produces a probability distribution over all possible next tokens, picks one (based on temperature), appends it to the sequence, and repeats until it generates a complete response or hits a stop token. The quality of responses comes from the model having compressed patterns from trillions of training tokens into its weights — patterns about language, facts, reasoning, and code. Context windows define how many tokens fit in a single inference call. Modern models range from 8K tokens (small, fast) to 10M tokens (Llama 4 Scout) — larger windows enable more sophisticated reasoning over longer documents. Open-weight models like Qwen3 14B and Llama 4 Scout run this inference process entirely on your local hardware; API models run it on cloud servers.


Part 1: Tokens — The Building Block

How Tokenization Works (Visual Flow)

┌─────────────────────────────────────────────────────────────┐
│  INPUT TEXT: "Hello, world! I'm learning about LLMs"       │
└─────────────────────────────────────────────────────────────┘

                  ┌─────────────────────┐
                  │  TOKENIZER (GPT-4)  │
                  └─────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  TOKENS (IDs → Readable):                                   │
│  [9906] → "Hello"        (common word = 1 token)            │
│  [11]   → ","            (punctuation = 1 token)            │
│  [1917] → " world"       (space + word = 1 token)           │
│  [0]    → "!"            (punctuation = 1 token)            │
│  [358]  → " I"           (contraction = 1 token)            │
│  [1866] → "'m"           (subword = 1 token)                │
│  [4500] → " learning"    (verb = 1 token)                   │
│  [220]  → " about"       (preposition = 1 token)            │
│  [43826] → " LLMs"       (technical acronym = 1 token)      │
│                                                              │
│  TOTAL: 9 tokens for 40 characters                          │
│  Cost at $0.01/1K tokens: $0.00009 per message             │
└─────────────────────────────────────────────────────────────┘

Key insight: Token count is irregular. Common words are 1 token; rare words split into 2-3. This is why you can’t predict tokens from character count.


Tokenization in Action

Why tokenization matters: Your API bill is charged per token, not per character.

A “smart” prompt that uses 800 tokens is 8x more expensive than a 100-token prompt.

pip install tiktoken

import tiktoken enc = tiktoken.get_encoding(“cl100k_base”) # GPT-4 tokeniser — most models use this or similar

Test tokenization on real-world examples

examples = [ “Hello, world!”, # Common greeting — likely 1-2 tokens “unbelievable”, # Rare word — splits into sub-word tokens “PostgreSQL”, # Database name — technical terms split differently “The quick brown fox jumps over the lazy dog.”, # Full sentence “def calculate_fibonacci(n: int) -> int:”, # Code example (more token-dense than prose) ]

for text in examples: tokens = enc.encode(text) # Convert text → list of token IDs decoded = [enc.decode([t]) for t in tokens] # Convert back → list of token strings

print(f"\nInput: '{text}'")
print(f"  Token IDs: {tokens}")
print(f"  Decoded tokens: {decoded}")
print(f"  Count: {len(tokens)} tokens for {len(text)} characters")
print(f"  Efficiency: {len(text) / len(tokens):.1f} chars/token (higher is better for cost)")

**Expected output:**

Input: ‘Hello, world!’ Token IDs: [9906, 11, 1917, 0] Decoded tokens: [‘Hello’, ’,’, ’ world’, ’!’] Count: 4 tokens for 13 characters Efficiency: 3.2 chars/token

Input: ‘unbelievable’ Token IDs: [359, 43237, 481] Decoded tokens: [‘un’, ‘believ’, ‘able’] Count: 3 tokens for 12 characters Efficiency: 4.0 chars/token

Input: ‘PostgreSQL’ Token IDs: [6021, greSQL] Decoded tokens: [‘Post’, ‘greSQL’] Count: 2 tokens for 10 characters Efficiency: 5.0 chars/token

Input: ‘The quick brown fox jumps over the lazy dog.’ Token IDs: [791, 4996, 14198, 39935, 35308, 927, 279, 16053, 5679, 13] Decoded tokens: [‘The’, ’ quick’, ’ brown’, ’ fox’, ’ jumps’, ’ over’, ’ the’, ’ lazy’, ’ dog’, ’.’] Count: 10 tokens for 44 characters Efficiency: 4.4 chars/token

Input: ‘def calculate_fibonacci(n: int) -> int:’ Tokens: [755, 11294, 43326, 1471, 25, 528, 8, 1492, 528, 25] Decoded: [‘def’, ’ calculate’, ‘_fibonacci’, ‘(n’, ’:’, ’ int’, ’)’, ’ ->’, ’ int’, ’:’] Count: 10 tokens for 40 characters Efficiency: 4.0 chars/token (code is similar to prose for token density)


**Key insight:** "unbelievable" splits into 3 tokens because it's a less-common word. "The" is a single token because it's extremely common. Code splits at semantic boundaries (underscores, parentheses).

---

## Part 2: Context Window — The Model's Working Memory

The context window holds everything the model can reference in a single inference call:

CONTEXT WINDOW (e.g., 128,000 tokens = ~96,000 words) ┌─────────────────────────────────────────────────────────┐ │ System prompt (~500 tokens) │ │ “You are a helpful assistant…” │ ├─────────────────────────────────────────────────────────┤ │ Retrieved documents from RAG (~10,000 tokens) │ │ [chunk 1] [chunk 2] [chunk 3] … [chunk N] │ ├─────────────────────────────────────────────────────────┤ │ Conversation history (~5,000 tokens) │ │ User: … | Assistant: … | User: … | Assistant: … │ ├─────────────────────────────────────────────────────────┤ │ Current user message (~200 tokens) │ │ “What does the third paragraph say about security?” │ ├─────────────────────────────────────────────────────────┤ │ Model’s response (being generated) │ │ “The third paragraph states that…” │ └─────────────────────────────────────────────────────────┘


**2026 context window sizes:**

| Model | Context | Effective use |
|:---|:---|:---|
| Qwen3 7B | 32K | ~24K words |
| Qwen3 14B | 40K | ~30K words |
| Gemma3 27B | 128K | ~96K words |
| Llama 4 Scout | 10M | ~7.5M words (practical limit ~128K) |
| GPT-4o | 128K | ~96K words |
| Claude 3.5 Sonnet | 200K | ~150K words |

```python
# Practical context window test with Ollama
# This code prevents a common mistake: sending documents larger than context window
import ollama

def estimate_tokens(text: str) -> int:
    """
    Rough token estimation: English text averages 4 chars per token.
    This is a heuristic; use tiktoken.encode() for exact counts.
    
    Why this function exists: Before sending a large document, you want to know
    if it'll fit in the model's context window. Going over the limit = request fails.
    """
    return len(text) // 4

# Real-world scenario: user uploads a 50KB PDF, you extract text, want to analyze
long_document = open("/path/to/long_document.txt").read()
estimated_tokens = estimate_tokens(long_document)
print(f"📄 Document size: ~{estimated_tokens:,} tokens (~{estimated_tokens // 300} pages)")

# Decision tree: choose model based on document size
if estimated_tokens < 8_000:
    # Small document — any model works fine
    print("✓ Fits in Qwen3 7B (32K context)")
    model = "qwen3:7b"
    
elif estimated_tokens < 32_000:
    # Medium document — need at least 32K context
    print("✓ Fits in Qwen3 14B (40K context) or larger")
    model = "qwen3:14b"
    
elif estimated_tokens < 128_000:
    # Large document — need 128K+ context
    print("✓ Fits in Gemma3 27B (128K context)")
    model = "gemma3:27b"
    
else:
    # Huge document — must use RAG or split into chunks
    print("❌ Document too large! Either:")
    print("   1) Use RAG: retrieve only relevant chunks, not entire document")
    print("   2) Split into chapters and process separately")
    print("   3) Use Llama 4 Scout (10M context, but impractical locally)")
    exit(1)

# Send the document with a question
try:
    response = ollama.chat(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Answer questions about the provided document. Be concise."
            },
            {
                "role": "user",
                "content": f"Document:\n{long_document}\n\nQuestion: What are the key findings?"
            }
        ]
    )
    print(f"✅ Success!\nAnswer: {response['message']['content']}")
    
except Exception as e:
    # Common error: "context length exceeded"
    if "context" in str(e).lower():
        print(f"❌ Context window exceeded: {e}")
        print("   → Use RAG, switch to larger model, or split document")
    else:
        print(f"❌ Error: {e}")

Part 3: Temperature and Sampling

Temperature Visual: How Randomness Works

After "The old lighthouse keeper", model computes next-token probabilities:

                probability

                    │ 45%  ┌─────────┐
                    │      │  'had'  │  ← Temperature = 0.0 (always pick this)
                    │      └─────────┘
                    │ 30%  ┌─────────┐
                    │      │'watched'│  ← Temperature = 0.3 (biased, usually this)
                    │      └─────────┘
                    │ 15%  ┌─────────┐
                    │      │ 'stood' │  ← Temperature = 0.7 (mix of top options)
                    │      └─────────┘
                    │ 10%  ┌─────────┐
                    │      │'maintain││  ← Temperature = 1.0 (true probabilities)
                    │      └─────────┘  ← Temperature = 1.5 (inverted, weird stuff)
                    └──────────────────→ tokens

Temperature = 0.0 → Always "had"   (deterministic — perfect for tests)
Temperature = 0.3 → Mostly "had", sometimes "watched"  (consistent)
Temperature = 0.7 → Mix: "had", "watched", "stood" (balanced)
Temperature = 1.0 → Exact probabilities: 45% "had", 30% "watched", etc.
Temperature = 1.5 → Inverted: unlikely tokens become likely (chaotic garbage)

Temperature in Action

import ollama

prompt = “Complete this sentence creatively: The old lighthouse keeper”

print(”=== TEMPERATURE IMPACT ON OUTPUT ===\n”)

Run the same prompt at different temperatures

for temp in [0.0, 0.3, 0.7, 1.0, 1.5]: print(f”Temperature {temp}:“)

response = ollama.chat(
    model="qwen3:14b",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": temp}
)

# Truncate output to first 120 chars for readability
output = response["message"]["content"][:120]
print(f"  → {output}...\n")

Key insight: What’s actually happening inside the model

print(“\n=== WHY TEMPERATURE MATTERS ===\n”) print(“At each step, the model computes probabilities for the next token:”) print(” ‘The old lighthouse keeper’ → Next token probabilities:”) print(” - ‘had’ (45% probability)”) print(” - ‘watched’ (30%)”) print(” - ‘stood’ (15%)”) print(” - ‘maintained’ (10%)”) print(“\ntemperature=0.0 → Always pick highest: ‘had’ (deterministic, reproducible)”) print(“temperature=0.7 → Biased toward highest, but randomness: usually ‘had’, sometimes ‘watched’”) print(“temperature=1.0 → Equal weighting: ‘had’ 45%, ‘watched’ 30%, etc. (true probabilities)”) print(“temperature=1.5 → Inverted: lower prob tokens become more likely (chaotic, unpredictable)”) print(“\nProduction implication:”) print(” - Tests/QA: temperature=0 (repeatable, deterministic)”) print(” - API responses: temperature=0.7 (variety but coherent)”) print(” - Brainstorming: temperature=1.0 (creative but still sensible)”) print(” - Never use > 1.2: output becomes incoherent gibberish”)


**Expected output:**

temp=0.0: …had watched the same stretch of coastline for forty years, never once seeing a ship go down. temp=0.0: …had watched the same stretch of coastline for forty years, never once seeing a ship go down. [identical]

temp=0.3: …had kept his lonely vigil for three decades, his weathered face as familiar to passing sailors as the light itself.

temp=0.7: …hadn’t slept in three days, convinced that the fog had begun to whisper his name.

temp=1.0: …collected storm bottles — not driftwood or sea glass, but the bottles the drowned let go, still sealed tight against the salt.

temp=1.5: …trembled lighthouse light scatter past the waves inward consuming solitude ancient lantern soul salt-worn gull-screamed vigil… [incoherent at 1.5]


**When to use each temperature:**
- `0.0` — Code, SQL, structured output, factual Q&A (deterministic, reproducible)
- `0.3` — Technical explanations, summaries (consistent but not repetitive)
- `0.7` — General chat, documentation, moderate creativity
- `1.0` — Creative writing, brainstorming, story generation
- `> 1.2` — Avoid. Quality degrades rapidly.

---

## Part 4: Open-Weight vs API-Locked — The Sovereignty Divide

┌───────────────────────────────┬──────────────────────────────────────┐ │ OPEN-WEIGHT MODELS │ API-LOCKED MODELS │ │ (Sovereign) │ (Cloud-dependent) │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Weights: Downloadable │ Weights: Proprietary, hidden │ │ Examples: Qwen3, Llama 4, │ Examples: GPT-4o, Claude, Gemini │ │ Gemma3, Mistral │ │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Your prompt: Stays local │ Your prompt: Sent to external API │ │ Your data: Never leaves │ Your data: Processed on cloud │ │ your machine │ servers │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Cost: Hardware (one-time) │ Cost: Per token (ongoing) │ │ Control: Complete │ Control: None (API changes) │ │ Availability: Always (local) │ Availability: Dependent on vendor │ └───────────────────────────────┴──────────────────────────────────────┘


```python
# Verification: confirm no data leaves machine during inference
import subprocess, threading, time, ollama

external = []
def monitor():
    for _ in range(10):
        r = subprocess.run(['ss','-tnp','state','established'], capture_output=True, text=True)
        for line in r.stdout.splitlines():
            if 'python' in line and not any(local in line for local in ['127.0.0.1','::1','172.']):
                external.append(line)
        time.sleep(0.5)

t = threading.Thread(target=monitor, daemon=True)
t.start()
ollama.chat(model="qwen3:14b", messages=[{"role":"user","content":"What is 2+2?"}])
t.join(timeout=6)

print("External connections during inference:", external if external else "None — your data is sovereign ✓")

Part 5: Key Inference Parameters

# All major inference parameters with explanations
ollama.chat(
    model="qwen3:14b",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}],
    options={
        "temperature": 0.7,       # 0=deterministic, 1=creative, >1.2=chaotic
        "top_p": 0.9,             # Nucleus sampling: consider only top 90% probability mass
        "top_k": 40,              # Consider only top 40 tokens at each step
        "repeat_penalty": 1.1,    # Penalise repeated phrases (1.0=off, 1.3=strong)
        "num_predict": 500,       # Max tokens to generate (-1 = unlimited)
        "seed": 42,               # Fixed seed for reproducibility (temperature=0 also reproducible)
        "num_ctx": 4096,          # Context window size (up to model maximum)
    }
)

Token Efficiency Comparison: Why Some Models Use Fewer Tokens

Model”Hello, World!""PostgreSQL""The quick brown fox…” (12 words)“def calc(n): return n” (5 tokens or more?)Efficiency
GPT-4 / GPT-4o4 tokens2 tokens10 tokens~8 tokensBaseline
Qwen3 14B4 tokens2 tokens9 tokens~7 tokens12% better
Llama 3.1 70B4 tokens3 tokens11 tokens~9 tokens5% worse
Mistral 12B4 tokens2 tokens10 tokens~8 tokensBaseline
Phi-4 Q84 tokens2 tokens10 tokens~8 tokensBaseline
Cost at $0.01/1K tokens$0.00004$0.00002$0.0001$0.000081 year = $7-$30

Key insight: A 12% token-efficiency improvement (Qwen3 vs GPT-4) saves $30-50/year per production agent reasoning 1,000 queries/day. Multiply across 10 agents = $300-500/year saved, plus zero API latency.


Part 6: Fine-Tuning Impact on Tokens & Inference Efficiency

When you fine-tune a model on task-specific data, the model learns to recognize patterns in your domain — sometimes requiring fewer tokens to express the same concepts. A model fine-tuned on legal documents learns abbreviations like “IAAL” (I Am A Lawyer) and “SEC” as single tokens in context, whereas a base model might tokenize “SEC” as one token but require more tokens for complex legal phrases.

Fine-Tuning Token Economy

Base Model (Llama 3.1 70B):
  "What are the key compliance requirements?"
  → 10 tokens
  → 10 forward pass multiplications

Fine-Tuned on Compliance Docs:
  "What are the key compliance requirements?"
  → 9 tokens (model learns to compress domain language)
  → 9 forward pass multiplications
  
Annual savings (1M queries):
  Base: 1,000,000 queries × 10 tokens = 10M tokens = $100 at $0.01/1K
  Fine-tuned: 1,000,000 queries × 9 tokens = 9M tokens = $90
  Savings: $10/year per 1M queries in your domain
  → 10 fine-tuned agents = $100/year in token costs

Critical insight: Fine-tuning doesn’t just improve accuracy—it reduces inference cost by 5-15% through domain-specific tokenization compression. This compounds across millions of queries.


Part 7: Quantization & Token Efficiency in Local Inference

Quantization (compressing model weights from FP32 to INT8 or INT4) doesn’t change tokenization — the same text still maps to the same token IDs. However, quantization does affect inference speed and memory usage:

QuantizationWeight SizeSpeedAccuracy LossToken Throughput
FP32 (full precision)280GB (70B params)Baseline (1x)0%~10 tok/sec
FP16 (half precision)140GB2x faster<1%~20 tok/sec
Q8_0 (8-bit)70GB4x faster~2%~40 tok/sec
Q4_K_M (4-bit, optimal)17.5GB8x faster~3%~80 tok/sec
Q3_K_S (3-bit, aggressive)13GB12x faster~8%~120 tok/sec

Production implication: Q4_K_M quantization of Qwen3 14B runs on consumer GPUs (RTX 3060 12GB) while maintaining near-baseline quality. A 10-token query completes in 125ms vs 1.2 seconds on FP32 — that’s 10x faster inference for the same cost per token.

When to Quantize vs Keep Full Precision

# Decision tree for quantization strategy

if latency_requirement < 200ms:
    # Real-time applications (chat, RAG retrieval)
    quantize_to("Q4_K_M")  # 80 tok/sec throughput
elif accuracy_critical and budget_available:
    # Medical, legal, financial reasoning
    use("FP16 or Q8_0")  # ~2% accuracy loss acceptable
elif running_on_edge_device:
    # Raspberry Pi, Jetson Nano, phone
    quantize_to("Q3_K_S")  # ~120 tok/sec, 13GB model fits
else:
    # Batch processing, can tolerate latency
    use("FP32")  # Maximum quality, slowest

Part 8: Retrieval-Augmented Generation (RAG) & Context Window Strategy

RAG systems retrieve relevant documents and inject them into the context window before querying the LLM. This changes your context window economics dramatically:

Simple Chat (No RAG):
  System prompt: 50 tokens
  Conversation history: 500 tokens (10 turns)
  User query: 20 tokens
  Available for generation: 3,430 tokens (on 4K window)
  
Ideal for: 1-2 turn conversations, lightweight queries

RAG-Enhanced Chat:
  System prompt: 50 tokens
  RAG context (10 documents × 400 tokens): 4,000 tokens ❌ EXCEEDS 4K WINDOW
  Conversation history: 300 tokens (5 turns, truncated)
  User query: 20 tokens
  Required: 4,370 tokens on a 4K window = CONTEXT OVERFLOW
  
Solution: Use 8K or 128K context window
  With 128K window: 50 + 4,000 + 300 + 20 = 4,370 tokens (3.4% utilization)
  Available for generation: 123,630 tokens
  
Cost implication:
  4K window → pay for 4,000 tokens even if you use 100
  128K window → pay for full 128K consumed
  => RAG requires choosing between 4K limitations or 128K costs

Production RAG best practices:

  1. Chunk size: 300-500 tokens per document (balances context coverage with window size)
  2. Retrieval strategy: Hybrid BM25 + semantic search to find top-3 relevant chunks (600-900 tokens total)
  3. Context compression: Use extractive summarization to reduce RAG context by 30-50% before injection
  4. Long-context models: Prefer models with 32K+ windows for RAG (Llama 4 Scout has 10M tokens—overkill for RAG but future-proof)

Part 9: Real-World Token Patterns & Monitoring

Understanding actual token consumption in production requires instrumenting your LLM calls:

# Production token monitoring — track where your tokens (and money) are going
import time
from collections import defaultdict

class TokenMonitor:
    """
    Instrument your LLM calls to understand token consumption in production.
    Why this matters: A 10% token reduction across your fleet saves $1000s/year.
    Without monitoring, you'll never know if your prompts are bloated.
    """
    def __init__(self):
        # Track per-model metrics: tokens_in, tokens_out, call count
        self.metrics = defaultdict(lambda: {"tokens_in": 0, "tokens_out": 0, "calls": 0, "errors": 0})
    
    def log_inference(self, model: str, prompt_tokens: int, completion_tokens: int, error: bool = False):
        """
        Log token usage per model/endpoint after each LLM call.
        
        Args:
            model: Model identifier (e.g., "qwen3:14b", "gpt-4o-api")
            prompt_tokens: Tokens in the input/context (counted against context window)
            completion_tokens: Tokens in the model's response (usually cheaper than input)
            error: Set to True if this call failed — helps identify retries and failures
        
        Real-world example:
            If prompt_tokens=800 and completion_tokens=200, you have context bloat.
            Investigate what's in those 800 tokens (system prompt, history, RAG docs).
        """
        self.metrics[model]["tokens_in"] += prompt_tokens
        self.metrics[model]["tokens_out"] += completion_tokens
        self.metrics[model]["calls"] += 1
        if error:
            self.metrics[model]["errors"] += 1
    
    def report(self):
        """
        Generate production insights: average tokens per call, total cost, error rate.
        Run this daily to catch token creep (gradual increase over time).
        """
        for model, data in self.metrics.items():
            avg_in = data["tokens_in"] / data["calls"] if data["calls"] else 0
            avg_out = data["tokens_out"] / data["calls"] if data["calls"] else 0
            error_rate = (data["errors"] / data["calls"] * 100) if data["calls"] else 0
            
            # Cost calculation assumes $0.0001/input token, $0.0003/output token
            # Adjust based on your model's pricing (Ollama is free; OpenAI varies)
            total_cost = (data["tokens_in"] * 0.0001 + data["tokens_out"] * 0.0003) / 1000
            
            print(f"\n📊 {model}:")
            print(f"   Average input tokens: {avg_in:.0f} (how much context you're using)")
            print(f"   Average output tokens: {avg_out:.0f} (how long the response is)")
            print(f"   Total cost so far: ${total_cost:.2f}")
            print(f"   Total API calls: {data['calls']}")
            print(f"   Error rate: {error_rate:.1f}% (retries, failed calls)")
            
            # Developer advice: if avg_in > 2000, you're sending too much context
            if avg_in > 2000:
                print(f"   ⚠️  Context bloat detected! Avg input > 2000 tokens.")
                print(f"      Reduce conversation history or enable vector DB retrieval.")

# Usage in your agent — wrap every LLM call with token tracking
monitor = TokenMonitor()

try:
    response = ollama.chat(model="qwen3:14b", messages=[...])
    prompt_tokens = len(str(messages)) // 4  # Rough estimate
    completion_tokens = len(response["message"]["content"]) // 4
    monitor.log_inference("qwen3:14b", prompt_tokens, completion_tokens)
except Exception as e:
    monitor.log_inference("qwen3:14b", prompt_tokens, 0, error=True)
    raise

# Check metrics daily to spot token creep
monitor.report()

Common Token Consumption Patterns

Use CaseAvg Input TokensAvg Output TokensCost/Query (at $0.01/$0.03)
Simple Q&A50100$0.004
Code generation200300$0.012
RAG retrieval + reasoning800150$0.012
Multi-turn conversation (5 turns)500100$0.008
Document summarization (10K chars)2,500500$0.035
Agentic loop (3 tool calls)600400$0.018

Part 10: Troubleshooting & Common Token Mistakes

Quick Troubleshooting Decision Tree

Problem: "LLM isn't working as expected"

Is the output nonsense/incoherent?
  ├─ Yes → Is temperature > 1.2?
  │         ├─ Yes → Reduce to 0.7 (too much randomness)
  │         └─ No → Check context window (too much input?)
  └─ No → Output is coherent but wrong/inconsistent?
           ├─ Same prompt, different answers? → Set temperature=0
           └─ Always wrong answer? → Check prompt quality (unclear instructions)

Problem: "API bill is way too high"

Run: print(average_input_tokens)
  ├─ > 2,000? → Context bloat (too much conversation history)
  ├─ 500-2,000? → Acceptable (check RAG chunks)
  └─ < 500? → Check output token waste (model talking too much?)

Problem: "Local inference is very slow"

Check token throughput: time curl... | jq ...
  ├─ < 10 tok/sec? → Quantization too aggressive or GPU not being used
  │  ├─ Check: nvidia-smi (is GPU in use?)
  │  └─ Switch to Q4_K_M (balanced quality + speed)
  ├─ 10-40 tok/sec? → Acceptable (maybe upgrade GPU/CPU)
  └─ > 40 tok/sec? → Great (you're fine)

Common Mistakes & Fixes

MistakeSymptomFix
temperature > 1.2Output is gibberish/hallucinationsSet to 0.7 or lower
Huge context windowSlow inference, high token countTruncate old conversation turns (keep recent 3-5 turns only)
No RAG, pure promptAnswer goes out-of-dateAdd vector DB retrieval for current info
temperature=0 alwaysBoring, repetitive responsesUse 0.7 for user-facing chat
Context limit exceededAPI error: “context_length_exceeded”Use smaller model or enable RAG + chunk retrieval
Wrong token count estimateBudget exceeded unexpectedlyUse tiktoken.encode() for exact count (not character/4 estimate)

“My API bill is way too high — where are the tokens going?”

Debug checklist:

  1. Check context window inflation: Log input_tokens per request. If average input > 2,000, you have context bloat.

    if prompt_tokens > 2000:
        print("WARNING: Large context detected. Review conversation history truncation.")
  2. Identify token-heavy operations: Summarization and RAG are the biggest token consumers (800+ input tokens each).

  3. Reduce conversation history: Instead of keeping entire conversation in context, store in vector DB and retrieve only relevant turns:

    # Bad: Send entire 10-turn conversation
    context = all_previous_turns  # ~500 tokens
    
    # Good: Retrieve similar past turns
    relevant_turns = vector_db.search(current_query, top_k=2)  # ~200 tokens
  4. Use token budgets: Set hard limits per request:

    max_input_tokens = 1000  # Fail if exceeded
    if len(messages) * 75 > max_input_tokens:  # Rough estimate
        prune_old_messages()

“My local Ollama inference is slow — how do I measure token throughput?”

# Measure tokens-per-second
time curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:14b",
  "prompt": "Write a Python function to calculate Fibonacci numbers",
  "stream": false
}' | jq '.eval_count / .eval_duration * 1e9'

# Compare across quantizations:
# FP16: ~20 tok/sec
# Q8_0: ~40 tok/sec
# Q4_K_M: ~80 tok/sec

Conclusion

LLMs are statistical next-token predictors that have compressed world knowledge into billions of numerical weights. Understanding tokens (the input unit), context windows (the working memory), temperature (randomness control), and the open-weight/API divide (the sovereignty question) gives you the mental model to use them effectively. The practical takeaway: for any application where input data is sensitive, open-weight models running locally via Ollama are the correct choice — same output quality, zero data transmitted externally.

See Best Open-Weight AI Models 2026 for model selection, and LangChain Local Inference for advanced prompt techniques.


People Also Ask

What is the difference between parameters and tokens?

Parameters are the trainable numerical weights inside the model — a “14B model” has 14 billion floating-point numbers that were learned during training. Tokens are the discrete units of input and output text — the model processes and generates sequences of tokens. Parameters are fixed after training (unless you fine-tune). Tokens vary with every input: a short prompt uses few tokens; a long document uses many. More parameters generally means the model can represent more complex patterns; more context window tokens means the model can consider more input at once.

Why does a model sometimes give different answers to the same question?

Because temperature > 0 introduces randomness in token selection. At each step, the model produces a probability distribution over all possible next tokens, and temperature controls how randomly it samples from that distribution. At temperature 0, it always picks the highest-probability token (deterministic). At temperature 0.7, it sometimes picks lower-probability tokens, producing varied outputs. Run the same query with temperature=0 to get consistent, reproducible answers.


Further Reading

Last verified: April 29, 2026.

Further Reading

All Dev Corner

Comments