Dev Corner AI & Intelligence LLM Foundations

How Large Language Models Work 2026: Tokens, Context & Inference

Master how LLMs generate text: from tokenization to attention to context windows. Learn why Qwen uses fewer tokens than GPT-4, what temperature controls, and how open-weight models differ from API-locked ones.

Author

Anju Kushwaha

Founder & Editorial Director

Published

May 17, 2026

Duration

Reading

13 min

How Large Language Models Work 2026: Tokens, Context & Inference

Article Roadmap

Key Takeaways

Tokens are the fundamental unit of LLM input and output — not characters, not words. 'Hello' is one token; 'unbelievable' is three ('un', 'believ', 'able'). GPT-4 and Qwen3 14B process roughly 750 words per 1,000 tokens. Token count determines cost and context window usage.
The context window is the LLM's working memory — everything the model can 'see' at once, including your system prompt, conversation history, retrieved documents, and the current query. Models with 128K context windows can hold roughly a 100,000-word document; models with 1M+ can hold an entire codebase.
Temperature controls randomness: temperature=0 makes the model deterministic and consistent (always picks the highest-probability next token); temperature=1 adds randomness enabling creative variation; values above 1.5 produce incoherent output. Use 0 for code and factual tasks, 0.7 for creative writing.
Open-weight models (Llama 4, Qwen3, Gemma3) have their weights publicly downloadable and run locally — your prompts never leave your machine. API-locked models (GPT-4o, Claude, Gemini) process your data on external servers and may use it to improve future models.

Key Takeaways

Tokens, not words: LLMs process token sequences. 1 English word ≈ 1.3 tokens on average. Code is more token-dense. Rare words split into multiple tokens.
Context window = working memory: The model can only “see” what’s in the context window. Old conversation turns get dropped when the window fills.
Temperature = creativity dial: 0 = deterministic. 0.7 = balanced. 1.0 = creative. > 1.5 = gibberish.
Open-weight vs API-locked: Weights downloadable = your data stays local. API-only = your data leaves your machine.

Introduction

Direct Answer: How do large language models work in 2026?

A large language model is a neural network trained to predict the next token in a sequence. “Training” means adjusting billions of numerical weights so the model becomes better at this prediction task across a massive corpus of text. At inference time, you provide a sequence of tokens (your prompt), and the model produces a probability distribution over all possible next tokens, picks one (based on temperature), appends it to the sequence, and repeats until it generates a complete response or hits a stop token. The quality of responses comes from the model having compressed patterns from trillions of training tokens into its weights — patterns about language, facts, reasoning, and code. Context windows define how many tokens fit in a single inference call. Modern models range from 8K tokens (small, fast) to 10M tokens (Llama 4 Scout) — larger windows enable more sophisticated reasoning over longer documents. Open-weight models like Qwen3 14B and Llama 4 Scout run this inference process entirely on your local hardware; API models run it on cloud servers.

Part 1: Tokens — The Building Block

How Tokenization Works (Visual Flow)

┌─────────────────────────────────────────────────────────────┐
│  INPUT TEXT: "Hello, world! I'm learning about LLMs"       │
└─────────────────────────────────────────────────────────────┘
                            ↓
                  ┌─────────────────────┐
                  │  TOKENIZER (GPT-4)  │
                  └─────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│  TOKENS (IDs → Readable):                                   │
│  [9906] → "Hello"        (common word = 1 token)            │
│  [11]   → ","            (punctuation = 1 token)            │
│  [1917] → " world"       (space + word = 1 token)           │
│  [0]    → "!"            (punctuation = 1 token)            │
│  [358]  → " I"           (contraction = 1 token)            │
│  [1866] → "'m"           (subword = 1 token)                │
│  [4500] → " learning"    (verb = 1 token)                   │
│  [220]  → " about"       (preposition = 1 token)            │
│  [43826] → " LLMs"       (technical acronym = 1 token)      │
│                                                              │
│  TOTAL: 9 tokens for 40 characters                          │
│  Cost at $0.01/1K tokens: $0.00009 per message             │
└─────────────────────────────────────────────────────────────┘

Key insight: Token count is irregular. Common words are 1 token; rare words split into 2-3. This is why you can’t predict tokens from character count.

Tokenization in Action

Why tokenization matters: Your API bill is charged per token, not per character.

A “smart” prompt that uses 800 tokens is 8x more expensive than a 100-token prompt.

pip install tiktoken

import tiktoken enc = tiktoken.get_encoding(“cl100k_base”) # GPT-4 tokeniser — most models use this or similar

Test tokenization on real-world examples

examples = [ “Hello, world!”, # Common greeting — likely 1-2 tokens “unbelievable”, # Rare word — splits into sub-word tokens “PostgreSQL”, # Database name — technical terms split differently “The quick brown fox jumps over the lazy dog.”, # Full sentence “def calculate_fibonacci(n: int) -> int:”, # Code example (more token-dense than prose) ]

for text in examples: tokens = enc.encode(text) # Convert text → list of token IDs decoded = [enc.decode([t]) for t in tokens] # Convert back → list of token strings

print(f"\nInput: '{text}'")
print(f"  Token IDs: {tokens}")
print(f"  Decoded tokens: {decoded}")
print(f"  Count: {len(tokens)} tokens for {len(text)} characters")
print(f"  Efficiency: {len(text) / len(tokens):.1f} chars/token (higher is better for cost)")


**Expected output:**

Input: ‘Hello, world!’ Token IDs: [9906, 11, 1917, 0] Decoded tokens: [‘Hello’, ’,’, ’ world’, ’!’] Count: 4 tokens for 13 characters Efficiency: 3.2 chars/token

Input: ‘unbelievable’ Token IDs: [359, 43237, 481] Decoded tokens: [‘un’, ‘believ’, ‘able’] Count: 3 tokens for 12 characters Efficiency: 4.0 chars/token

Input: ‘PostgreSQL’ Token IDs: [6021, greSQL] Decoded tokens: [‘Post’, ‘greSQL’] Count: 2 tokens for 10 characters Efficiency: 5.0 chars/token

Input: ‘The quick brown fox jumps over the lazy dog.’ Token IDs: [791, 4996, 14198, 39935, 35308, 927, 279, 16053, 5679, 13] Decoded tokens: [‘The’, ’ quick’, ’ brown’, ’ fox’, ’ jumps’, ’ over’, ’ the’, ’ lazy’, ’ dog’, ’.’] Count: 10 tokens for 44 characters Efficiency: 4.4 chars/token

Input: ‘def calculate_fibonacci(n: int) -> int:’ Tokens: [755, 11294, 43326, 1471, 25, 528, 8, 1492, 528, 25] Decoded: [‘def’, ’ calculate’, ‘_fibonacci’, ‘(n’, ’:’, ’ int’, ’)’, ’ ->’, ’ int’, ’:’] Count: 10 tokens for 40 characters Efficiency: 4.0 chars/token (code is similar to prose for token density)


**Key insight:** "unbelievable" splits into 3 tokens because it's a less-common word. "The" is a single token because it's extremely common. Code splits at semantic boundaries (underscores, parentheses).

---

## Part 2: Context Window — The Model's Working Memory

The context window holds everything the model can reference in a single inference call:

CONTEXT WINDOW (e.g., 128,000 tokens = ~96,000 words) ┌─────────────────────────────────────────────────────────┐ │ System prompt (~500 tokens) │ │ “You are a helpful assistant…” │ ├─────────────────────────────────────────────────────────┤ │ Retrieved documents from RAG (~10,000 tokens) │ │ [chunk 1] [chunk 2] [chunk 3] … [chunk N] │ ├─────────────────────────────────────────────────────────┤ │ Conversation history (~5,000 tokens) │ │ User: … | Assistant: … | User: … | Assistant: … │ ├─────────────────────────────────────────────────────────┤ │ Current user message (~200 tokens) │ │ “What does the third paragraph say about security?” │ ├─────────────────────────────────────────────────────────┤ │ Model’s response (being generated) │ │ “The third paragraph states that…” │ └─────────────────────────────────────────────────────────┘


**2026 context window sizes:**

| Model | Context | Effective use |
|:---|:---|:---|
| Qwen3 7B | 32K | ~24K words |
| Qwen3 14B | 40K | ~30K words |
| Gemma3 27B | 128K | ~96K words |
| Llama 4 Scout | 10M | ~7.5M words (practical limit ~128K) |
| GPT-4o | 128K | ~96K words |
| Claude 3.5 Sonnet | 200K | ~150K words |

```python
# Practical context window test with Ollama
# This code prevents a common mistake: sending documents larger than context window
import ollama

def estimate_tokens(text: str) -> int:
    """
    Rough token estimation: English text averages 4 chars per token.
    This is a heuristic; use tiktoken.encode() for exact counts.
    
    Why this function exists: Before sending a large document, you want to know
    if it'll fit in the model's context window. Going over the limit = request fails.
    """
    return len(text) // 4

# Real-world scenario: user uploads a 50KB PDF, you extract text, want to analyze
long_document = open("/path/to/long_document.txt").read()
estimated_tokens = estimate_tokens(long_document)
print(f"📄 Document size: ~{estimated_tokens:,} tokens (~{estimated_tokens // 300} pages)")

# Decision tree: choose model based on document size
if estimated_tokens < 8_000:
    # Small document — any model works fine
    print("✓ Fits in Qwen3 7B (32K context)")
    model = "qwen3:7b"
    
elif estimated_tokens < 32_000:
    # Medium document — need at least 32K context
    print("✓ Fits in Qwen3 14B (40K context) or larger")
    model = "qwen3:14b"
    
elif estimated_tokens < 128_000:
    # Large document — need 128K+ context
    print("✓ Fits in Gemma3 27B (128K context)")
    model = "gemma3:27b"
    
else:
    # Huge document — must use RAG or split into chunks
    print("❌ Document too large! Either:")
    print("   1) Use RAG: retrieve only relevant chunks, not entire document")
    print("   2) Split into chapters and process separately")
    print("   3) Use Llama 4 Scout (10M context, but impractical locally)")
    exit(1)

# Send the document with a question
try:
    response = ollama.chat(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Answer questions about the provided document. Be concise."
            },
            {
                "role": "user",
                "content": f"Document:\n{long_document}\n\nQuestion: What are the key findings?"
            }
        ]
    )
    print(f"✅ Success!\nAnswer: {response['message']['content']}")
    
except Exception as e:
    # Common error: "context length exceeded"
    if "context" in str(e).lower():
        print(f"❌ Context window exceeded: {e}")
        print("   → Use RAG, switch to larger model, or split document")
    else:
        print(f"❌ Error: {e}")

Part 3: Temperature and Sampling

Temperature Visual: How Randomness Works

After "The old lighthouse keeper", model computes next-token probabilities:

                probability
                    ↑
                    │ 45%  ┌─────────┐
                    │      │  'had'  │  ← Temperature = 0.0 (always pick this)
                    │      └─────────┘
                    │ 30%  ┌─────────┐
                    │      │'watched'│  ← Temperature = 0.3 (biased, usually this)
                    │      └─────────┘
                    │ 15%  ┌─────────┐
                    │      │ 'stood' │  ← Temperature = 0.7 (mix of top options)
                    │      └─────────┘
                    │ 10%  ┌─────────┐
                    │      │'maintain││  ← Temperature = 1.0 (true probabilities)
                    │      └─────────┘  ← Temperature = 1.5 (inverted, weird stuff)
                    └──────────────────→ tokens

Temperature = 0.0 → Always "had"   (deterministic — perfect for tests)
Temperature = 0.3 → Mostly "had", sometimes "watched"  (consistent)
Temperature = 0.7 → Mix: "had", "watched", "stood" (balanced)
Temperature = 1.0 → Exact probabilities: 45% "had", 30% "watched", etc.
Temperature = 1.5 → Inverted: unlikely tokens become likely (chaotic garbage)

Temperature in Action

import ollama

prompt = “Complete this sentence creatively: The old lighthouse keeper”

print(”=== TEMPERATURE IMPACT ON OUTPUT ===\n”)

Run the same prompt at different temperatures

for temp in [0.0, 0.3, 0.7, 1.0, 1.5]: print(f”Temperature {temp}:“)

response = ollama.chat(
    model="qwen3:14b",
    messages=[{"role": "user", "content": prompt}],
    options={"temperature": temp}
)

# Truncate output to first 120 chars for readability
output = response["message"]["content"][:120]
print(f"  → {output}...\n")

Key insight: What’s actually happening inside the model

print(“\n=== WHY TEMPERATURE MATTERS ===\n”) print(“At each step, the model computes probabilities for the next token:”) print(” ‘The old lighthouse keeper’ → Next token probabilities:”) print(” - ‘had’ (45% probability)”) print(” - ‘watched’ (30%)”) print(” - ‘stood’ (15%)”) print(” - ‘maintained’ (10%)”) print(“\ntemperature=0.0 → Always pick highest: ‘had’ (deterministic, reproducible)”) print(“temperature=0.7 → Biased toward highest, but randomness: usually ‘had’, sometimes ‘watched’”) print(“temperature=1.0 → Equal weighting: ‘had’ 45%, ‘watched’ 30%, etc. (true probabilities)”) print(“temperature=1.5 → Inverted: lower prob tokens become more likely (chaotic, unpredictable)”) print(“\nProduction implication:”) print(” - Tests/QA: temperature=0 (repeatable, deterministic)”) print(” - API responses: temperature=0.7 (variety but coherent)”) print(” - Brainstorming: temperature=1.0 (creative but still sensible)”) print(” - Never use > 1.2: output becomes incoherent gibberish”)


**Expected output:**

temp=0.0: …had watched the same stretch of coastline for forty years, never once seeing a ship go down. temp=0.0: …had watched the same stretch of coastline for forty years, never once seeing a ship go down. [identical]

temp=0.3: …had kept his lonely vigil for three decades, his weathered face as familiar to passing sailors as the light itself.

temp=0.7: …hadn’t slept in three days, convinced that the fog had begun to whisper his name.

temp=1.0: …collected storm bottles — not driftwood or sea glass, but the bottles the drowned let go, still sealed tight against the salt.

temp=1.5: …trembled lighthouse light scatter past the waves inward consuming solitude ancient lantern soul salt-worn gull-screamed vigil… [incoherent at 1.5]


**When to use each temperature:**
- `0.0` — Code, SQL, structured output, factual Q&A (deterministic, reproducible)
- `0.3` — Technical explanations, summaries (consistent but not repetitive)
- `0.7` — General chat, documentation, moderate creativity
- `1.0` — Creative writing, brainstorming, story generation
- `> 1.2` — Avoid. Quality degrades rapidly.

---

## Part 4: Open-Weight vs API-Locked — The Sovereignty Divide

┌───────────────────────────────┬──────────────────────────────────────┐ │ OPEN-WEIGHT MODELS │ API-LOCKED MODELS │ │ (Sovereign) │ (Cloud-dependent) │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Weights: Downloadable │ Weights: Proprietary, hidden │ │ Examples: Qwen3, Llama 4, │ Examples: GPT-4o, Claude, Gemini │ │ Gemma3, Mistral │ │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Your prompt: Stays local │ Your prompt: Sent to external API │ │ Your data: Never leaves │ Your data: Processed on cloud │ │ your machine │ servers │ ├───────────────────────────────┼──────────────────────────────────────┤ │ Cost: Hardware (one-time) │ Cost: Per token (ongoing) │ │ Control: Complete │ Control: None (API changes) │ │ Availability: Always (local) │ Availability: Dependent on vendor │ └───────────────────────────────┴──────────────────────────────────────┘


```python
# Verification: confirm no data leaves machine during inference
import subprocess, threading, time, ollama

external = []
def monitor():
    for _ in range(10):
        r = subprocess.run(['ss','-tnp','state','established'], capture_output=True, text=True)
        for line in r.stdout.splitlines():
            if 'python' in line and not any(local in line for local in ['127.0.0.1','::1','172.']):
                external.append(line)
        time.sleep(0.5)

t = threading.Thread(target=monitor, daemon=True)
t.start()
ollama.chat(model="qwen3:14b", messages=[{"role":"user","content":"What is 2+2?"}])
t.join(timeout=6)

print("External connections during inference:", external if external else "None — your data is sovereign ✓")

Part 5: Key Inference Parameters

# All major inference parameters with explanations
ollama.chat(
    model="qwen3:14b",
    messages=[{"role": "user", "content": "Explain quantum entanglement"}],
    options={
        "temperature": 0.7,       # 0=deterministic, 1=creative, >1.2=chaotic
        "top_p": 0.9,             # Nucleus sampling: consider only top 90% probability mass
        "top_k": 40,              # Consider only top 40 tokens at each step
        "repeat_penalty": 1.1,    # Penalise repeated phrases (1.0=off, 1.3=strong)
        "num_predict": 500,       # Max tokens to generate (-1 = unlimited)
        "seed": 42,               # Fixed seed for reproducibility (temperature=0 also reproducible)
        "num_ctx": 4096,          # Context window size (up to model maximum)
    }
)

Token Efficiency Comparison: Why Some Models Use Fewer Tokens

Model	”Hello, World!"	"PostgreSQL"	"The quick brown fox…” (12 words)	“def calc(n): return n” (5 tokens or more?)	Efficiency
GPT-4 / GPT-4o	4 tokens	2 tokens	10 tokens	~8 tokens	Baseline
Qwen3 14B	4 tokens	2 tokens	9 tokens	~7 tokens	12% better
Llama 3.1 70B	4 tokens	3 tokens	11 tokens	~9 tokens	5% worse
Mistral 12B	4 tokens	2 tokens	10 tokens	~8 tokens	Baseline
Phi-4 Q8	4 tokens	2 tokens	10 tokens	~8 tokens	Baseline
Cost at $0.01/1K tokens	$0.00004	$0.00002	$0.0001	$0.00008	1 year = $7-$30

Key insight: A 12% token-efficiency improvement (Qwen3 vs GPT-4) saves $30-50/year per production agent reasoning 1,000 queries/day. Multiply across 10 agents = $300-500/year saved, plus zero API latency.

Part 6: Fine-Tuning Impact on Tokens & Inference Efficiency

When you fine-tune a model on task-specific data, the model learns to recognize patterns in your domain — sometimes requiring fewer tokens to express the same concepts. A model fine-tuned on legal documents learns abbreviations like “IAAL” (I Am A Lawyer) and “SEC” as single tokens in context, whereas a base model might tokenize “SEC” as one token but require more tokens for complex legal phrases.

Fine-Tuning Token Economy

Base Model (Llama 3.1 70B):
  "What are the key compliance requirements?"
  → 10 tokens
  → 10 forward pass multiplications

Fine-Tuned on Compliance Docs:
  "What are the key compliance requirements?"
  → 9 tokens (model learns to compress domain language)
  → 9 forward pass multiplications
  
Annual savings (1M queries):
  Base: 1,000,000 queries × 10 tokens = 10M tokens = $100 at $0.01/1K
  Fine-tuned: 1,000,000 queries × 9 tokens = 9M tokens = $90
  Savings: $10/year per 1M queries in your domain
  → 10 fine-tuned agents = $100/year in token costs

Critical insight: Fine-tuning doesn’t just improve accuracy—it reduces inference cost by 5-15% through domain-specific tokenization compression. This compounds across millions of queries.

Part 7: Quantization & Token Efficiency in Local Inference

Quantization (compressing model weights from FP32 to INT8 or INT4) doesn’t change tokenization — the same text still maps to the same token IDs. However, quantization does affect inference speed and memory usage:

Quantization	Weight Size	Speed	Accuracy Loss	Token Throughput
FP32 (full precision)	280GB (70B params)	Baseline (1x)	0%	~10 tok/sec
FP16 (half precision)	140GB	2x faster	<1%	~20 tok/sec
Q8_0 (8-bit)	70GB	4x faster	~2%	~40 tok/sec
Q4_K_M (4-bit, optimal)	17.5GB	8x faster	~3%	~80 tok/sec
Q3_K_S (3-bit, aggressive)	13GB	12x faster	~8%	~120 tok/sec

Production implication: Q4_K_M quantization of Qwen3 14B runs on consumer GPUs (RTX 3060 12GB) while maintaining near-baseline quality. A 10-token query completes in 125ms vs 1.2 seconds on FP32 — that’s 10x faster inference for the same cost per token.

When to Quantize vs Keep Full Precision

# Decision tree for quantization strategy

if latency_requirement < 200ms:
    # Real-time applications (chat, RAG retrieval)
    quantize_to("Q4_K_M")  # 80 tok/sec throughput
elif accuracy_critical and budget_available:
    # Medical, legal, financial reasoning
    use("FP16 or Q8_0")  # ~2% accuracy loss acceptable
elif running_on_edge_device:
    # Raspberry Pi, Jetson Nano, phone
    quantize_to("Q3_K_S")  # ~120 tok/sec, 13GB model fits
else:
    # Batch processing, can tolerate latency
    use("FP32")  # Maximum quality, slowest

Part 8: Retrieval-Augmented Generation (RAG) & Context Window Strategy

RAG systems retrieve relevant documents and inject them into the context window before querying the LLM. This changes your context window economics dramatically:

Simple Chat (No RAG):
  System prompt: 50 tokens
  Conversation history: 500 tokens (10 turns)
  User query: 20 tokens
  Available for generation: 3,430 tokens (on 4K window)
  
Ideal for: 1-2 turn conversations, lightweight queries

RAG-Enhanced Chat:
  System prompt: 50 tokens
  RAG context (10 documents × 400 tokens): 4,000 tokens ❌ EXCEEDS 4K WINDOW
  Conversation history: 300 tokens (5 turns, truncated)
  User query: 20 tokens
  Required: 4,370 tokens on a 4K window = CONTEXT OVERFLOW
  
Solution: Use 8K or 128K context window
  With 128K window: 50 + 4,000 + 300 + 20 = 4,370 tokens (3.4% utilization)
  Available for generation: 123,630 tokens
  
Cost implication:
  4K window → pay for 4,000 tokens even if you use 100
  128K window → pay for full 128K consumed
  => RAG requires choosing between 4K limitations or 128K costs

Production RAG best practices:

Chunk size: 300-500 tokens per document (balances context coverage with window size)
Retrieval strategy: Hybrid BM25 + semantic search to find top-3 relevant chunks (600-900 tokens total)
Context compression: Use extractive summarization to reduce RAG context by 30-50% before injection
Long-context models: Prefer models with 32K+ windows for RAG (Llama 4 Scout has 10M tokens—overkill for RAG but future-proof)

Part 9: Real-World Token Patterns & Monitoring

Understanding actual token consumption in production requires instrumenting your LLM calls:

# Production token monitoring — track where your tokens (and money) are going
import time
from collections import defaultdict

class TokenMonitor:
    """
    Instrument your LLM calls to understand token consumption in production.
    Why this matters: A 10% token reduction across your fleet saves $1000s/year.
    Without monitoring, you'll never know if your prompts are bloated.
    """
    def __init__(self):
        # Track per-model metrics: tokens_in, tokens_out, call count
        self.metrics = defaultdict(lambda: {"tokens_in": 0, "tokens_out": 0, "calls": 0, "errors": 0})
    
    def log_inference(self, model: str, prompt_tokens: int, completion_tokens: int, error: bool = False):
        """
        Log token usage per model/endpoint after each LLM call.
        
        Args:
            model: Model identifier (e.g., "qwen3:14b", "gpt-4o-api")
            prompt_tokens: Tokens in the input/context (counted against context window)
            completion_tokens: Tokens in the model's response (usually cheaper than input)
            error: Set to True if this call failed — helps identify retries and failures
        
        Real-world example:
            If prompt_tokens=800 and completion_tokens=200, you have context bloat.
            Investigate what's in those 800 tokens (system prompt, history, RAG docs).
        """
        self.metrics[model]["tokens_in"] += prompt_tokens
        self.metrics[model]["tokens_out"] += completion_tokens
        self.metrics[model]["calls"] += 1
        if error:
            self.metrics[model]["errors"] += 1
    
    def report(self):
        """
        Generate production insights: average tokens per call, total cost, error rate.
        Run this daily to catch token creep (gradual increase over time).
        """
        for model, data in self.metrics.items():
            avg_in = data["tokens_in"] / data["calls"] if data["calls"] else 0
            avg_out = data["tokens_out"] / data["calls"] if data["calls"] else 0
            error_rate = (data["errors"] / data["calls"] * 100) if data["calls"] else 0
            
            # Cost calculation assumes $0.0001/input token, $0.0003/output token
            # Adjust based on your model's pricing (Ollama is free; OpenAI varies)
            total_cost = (data["tokens_in"] * 0.0001 + data["tokens_out"] * 0.0003) / 1000
            
            print(f"\n📊 {model}:")
            print(f"   Average input tokens: {avg_in:.0f} (how much context you're using)")
            print(f"   Average output tokens: {avg_out:.0f} (how long the response is)")
            print(f"   Total cost so far: ${total_cost:.2f}")
            print(f"   Total API calls: {data['calls']}")
            print(f"   Error rate: {error_rate:.1f}% (retries, failed calls)")
            
            # Developer advice: if avg_in > 2000, you're sending too much context
            if avg_in > 2000:
                print(f"   ⚠️  Context bloat detected! Avg input > 2000 tokens.")
                print(f"      Reduce conversation history or enable vector DB retrieval.")

# Usage in your agent — wrap every LLM call with token tracking
monitor = TokenMonitor()

try:
    response = ollama.chat(model="qwen3:14b", messages=[...])
    prompt_tokens = len(str(messages)) // 4  # Rough estimate
    completion_tokens = len(response["message"]["content"]) // 4
    monitor.log_inference("qwen3:14b", prompt_tokens, completion_tokens)
except Exception as e:
    monitor.log_inference("qwen3:14b", prompt_tokens, 0, error=True)
    raise

# Check metrics daily to spot token creep
monitor.report()

Common Token Consumption Patterns

Use Case	Avg Input Tokens	Avg Output Tokens	Cost/Query (at $0.01/$0.03)
Simple Q&A	50	100	$0.004
Code generation	200	300	$0.012
RAG retrieval + reasoning	800	150	$0.012
Multi-turn conversation (5 turns)	500	100	$0.008
Document summarization (10K chars)	2,500	500	$0.035
Agentic loop (3 tool calls)	600	400	$0.018

Part 10: Troubleshooting & Common Token Mistakes

Quick Troubleshooting Decision Tree

Problem: "LLM isn't working as expected"
  ↓
Is the output nonsense/incoherent?
  ├─ Yes → Is temperature > 1.2?
  │         ├─ Yes → Reduce to 0.7 (too much randomness)
  │         └─ No → Check context window (too much input?)
  └─ No → Output is coherent but wrong/inconsistent?
           ├─ Same prompt, different answers? → Set temperature=0
           └─ Always wrong answer? → Check prompt quality (unclear instructions)

Problem: "API bill is way too high"
  ↓
Run: print(average_input_tokens)
  ├─ > 2,000? → Context bloat (too much conversation history)
  ├─ 500-2,000? → Acceptable (check RAG chunks)
  └─ < 500? → Check output token waste (model talking too much?)

Problem: "Local inference is very slow"
  ↓
Check token throughput: time curl... | jq ...
  ├─ < 10 tok/sec? → Quantization too aggressive or GPU not being used
  │  ├─ Check: nvidia-smi (is GPU in use?)
  │  └─ Switch to Q4_K_M (balanced quality + speed)
  ├─ 10-40 tok/sec? → Acceptable (maybe upgrade GPU/CPU)
  └─ > 40 tok/sec? → Great (you're fine)

Common Mistakes & Fixes

Mistake	Symptom	Fix
temperature > 1.2	Output is gibberish/hallucinations	Set to 0.7 or lower
Huge context window	Slow inference, high token count	Truncate old conversation turns (keep recent 3-5 turns only)
No RAG, pure prompt	Answer goes out-of-date	Add vector DB retrieval for current info
temperature=0 always	Boring, repetitive responses	Use 0.7 for user-facing chat
Context limit exceeded	API error: “context_length_exceeded”	Use smaller model or enable RAG + chunk retrieval
Wrong token count estimate	Budget exceeded unexpectedly	Use tiktoken.encode() for exact count (not character/4 estimate)

“My API bill is way too high — where are the tokens going?”

Debug checklist:

Check context window inflation: Log input_tokens per request. If average input > 2,000, you have context bloat.

if prompt_tokens > 2000:
    print("WARNING: Large context detected. Review conversation history truncation.")

Identify token-heavy operations: Summarization and RAG are the biggest token consumers (800+ input tokens each).

Reduce conversation history: Instead of keeping entire conversation in context, store in vector DB and retrieve only relevant turns:

# Bad: Send entire 10-turn conversation
context = all_previous_turns  # ~500 tokens

# Good: Retrieve similar past turns
relevant_turns = vector_db.search(current_query, top_k=2)  # ~200 tokens

Use token budgets: Set hard limits per request:

max_input_tokens = 1000  # Fail if exceeded
if len(messages) * 75 > max_input_tokens:  # Rough estimate
    prune_old_messages()

“My local Ollama inference is slow — how do I measure token throughput?”

# Measure tokens-per-second
time curl http://localhost:11434/api/generate -d '{
  "model": "qwen3:14b",
  "prompt": "Write a Python function to calculate Fibonacci numbers",
  "stream": false
}' | jq '.eval_count / .eval_duration * 1e9'

# Compare across quantizations:
# FP16: ~20 tok/sec
# Q8_0: ~40 tok/sec
# Q4_K_M: ~80 tok/sec

Conclusion

LLMs are statistical next-token predictors that have compressed world knowledge into billions of numerical weights. Understanding tokens (the input unit), context windows (the working memory), temperature (randomness control), and the open-weight/API divide (the sovereignty question) gives you the mental model to use them effectively. The practical takeaway: for any application where input data is sensitive, open-weight models running locally via Ollama are the correct choice — same output quality, zero data transmitted externally.

See Best Open-Weight AI Models 2026 for model selection, and LangChain Local Inference for advanced prompt techniques.

Best Open-Weight AI Models 2026: Llama 4, Qwen3, Gemma3 Compared

>_ 14 May | 20 min | Dev Corner

Vucense Audit: Compare the top open-weight LLMs for sovereign deployment in 2026 — Llama 4 Scout, Qwen3 14B, Gemma3, Mistral Small 3.1, and Phi-4. Benchmarks, licensing, GGUF sizes, and Ollama setup.

By Kofi Mensah

Fine-Tuning LLMs with QLoRA and Unsloth 2026: Local Training Guide

>_ 22 Apr | 22 min | Dev Corner

🔴Advanced

Fine-tune large language models locally with QLoRA and Unsloth on Ubuntu 24.04 in 2026. Covers dataset preparation, LoRA configuration, training on RTX 4090, evaluation, GGUF export, and Ollama deployment.

By Marcus Thorne

LLM Evaluation 2026: Local RAG, RAGAS, LLM-as-Judge, and AI Metrics on Ubuntu

>_ 9 May | 19 min | Dev Corner

🟡Intermediate

Comprehensive guide to local LLM evaluation on Ubuntu 24.04: RAG, RAGAS, LLM-as-judge, open-source metrics, and AI-driven validation. Includes scripts, datasets, and best practices for search-optimized, sovereign AI workflows.

By Kofi Mensah

#llm #tokens #context-window #transformer #ai-foundations #2026

Key Takeaways

Introduction

Part 1: Tokens — The Building Block

How Tokenization Works (Visual Flow)

Tokenization in Action

Why tokenization matters: Your API bill is charged per token, not per character.

A “smart” prompt that uses 800 tokens is 8x more expensive than a 100-token prompt.

Test tokenization on real-world examples

Part 3: Temperature and Sampling

Temperature Visual: How Randomness Works

Temperature in Action

Run the same prompt at different temperatures

Key insight: What’s actually happening inside the model

Part 5: Key Inference Parameters

Token Efficiency Comparison: Why Some Models Use Fewer Tokens

Part 6: Fine-Tuning Impact on Tokens & Inference Efficiency

Fine-Tuning Token Economy

Part 7: Quantization & Token Efficiency in Local Inference

When to Quantize vs Keep Full Precision

Part 8: Retrieval-Augmented Generation (RAG) & Context Window Strategy

Part 9: Real-World Token Patterns & Monitoring

Common Token Consumption Patterns

Part 10: Troubleshooting & Common Token Mistakes

Quick Troubleshooting Decision Tree

Common Mistakes & Fixes

“My API bill is way too high — where are the tokens going?”

“My local Ollama inference is slow — how do I measure token throughput?”

Conclusion

People Also Ask

What is the difference between parameters and tokens?

Why does a model sometimes give different answers to the same question?

Further Reading

Get the Sovereign Stack Playbook

You're in — welcome to the community!

Further Reading

Best Open-Weight AI Models 2026: Llama 4, Qwen3, Gemma3 Compared

Fine-Tuning LLMs with QLoRA and Unsloth 2026: Local Training Guide

LLM Evaluation 2026: Local RAG, RAGAS, LLM-as-Judge, and AI Metrics on Ubuntu

Get the Sovereign Stack Playbook

You're in — welcome!

Comments

Recently Visited