Key Takeaways
- Try prompt engineering first, always. It’s free, instant, and solves most problems.
- RAG for knowledge problems: “The model doesn’t know our products” → RAG. “The model doesn’t know recent events” → RAG. “The model can’t access our docs” → RAG.
- Fine-tuning for behaviour problems: “The model won’t format output correctly” → fine-tuning. “The model doesn’t match our brand voice” → fine-tuning. “The model uses wrong domain terminology” → fine-tuning.
- They combine: Fine-tuned model + RAG + optimised prompts = maximum capability.
Introduction
Direct Answer: When should I use RAG vs fine-tuning vs prompt engineering for customising an LLM in 2026?
Start with prompt engineering — a well-designed system prompt with few-shot examples and format constraints solves most problems with zero cost and zero data. Use RAG when the model needs access to specific documents, up-to-date information, or proprietary knowledge that isn’t in the model’s training data — RAG retrieves relevant context at query time without modifying the model. Use fine-tuning when prompt engineering doesn’t achieve the required output style, format, or domain-specific behaviour — fine-tuning trains a new model version on 200–5,000 examples of your desired output. In practice: start with prompt engineering, add RAG if knowledge gaps are the problem, add fine-tuning only if consistent style/format is still the issue after optimising prompts. All three approaches work with local Ollama models at zero per-query cost.
The Decision Framework
START HERE: What is the actual problem?
│
├─► "The model gives wrong/hallucinated answers"
│ └─► Is the information in the model's training data?
│ ├─► YES → Better prompt + chain-of-thought → Prompt Engineering
│ └─► NO → The model doesn't have this knowledge → RAG
│
├─► "The model knows the information but formats it wrong"
│ └─► Prompt Engineering: specify format explicitly + few-shot examples
│ └─► Still wrong after 10 iterations? → Fine-Tuning
│
├─► "The model is too slow for my use case"
│ └─► Prompt Engineering: shorter prompts, smaller model, cached responses
│
├─► "The model uses wrong terminology / brand voice"
│ └─► Try system prompt first → if inconsistent → Fine-Tuning
│
└─► "I need the model to access real-time / private data"
└─► RAG (retrieval-augmented generation)
Detailed Comparison
| Dimension | Prompt Engineering | RAG | Fine-Tuning |
|---|---|---|---|
| Setup time | Minutes | Hours–Days | Days–Weeks |
| Data required | None | Documents | 200–5000 labelled examples |
| Infrastructure | Just the LLM | LLM + vector DB + embedding model | GPU + training pipeline |
| Cost (one-time) | $0 | $0–$50 (storage) | $0 (local GPU) or $20–500 (cloud) |
| Cost (per query) | LLM inference only | LLM + retrieval overhead | LLM inference only |
| Knowledge freshness | Static (training cutoff) | Real-time (update the docs) | Static (training cutoff) |
| Can cite sources? | No | Yes (retrieved chunks) | No |
| Changes model weights? | No | No | Yes |
| Reversible? | Yes (edit prompt) | Yes (update/delete docs) | Requires retraining |
| Best for | Format, tone, behaviour | Knowledge, Q&A, grounding | Style, domain vocabulary, format |
Part 1: Prompt Engineering First
Before anything else, optimise the prompt:
import ollama
# ❌ Vague prompt — inconsistent results
bad = ollama.chat(model="qwen3:14b", messages=[
{"role": "user", "content": "Tell me about our return policy"}
])
# ✅ Specific prompt with role, constraints, and format
good = ollama.chat(model="qwen3:14b", messages=[
{"role": "system", "content": """You are a customer support agent for Acme Corp.
Answer questions about our 30-day return policy.
Rules:
- Return must be within 30 days of purchase
- Item must be unused and in original packaging
- Digital products are non-refundable
Format your answer in 2-3 sentences maximum.
If you don't know the answer, say: 'Please contact [email protected]'"""},
{"role": "user", "content": "Tell me about our return policy"}
])
print("Bad:", bad["message"]["content"][:100])
print("Good:", good["message"]["content"][:100])
Expected output:
Bad: Our return policy allows customers to return most items within a reasonable timeframe...
Good: You can return unused items in original packaging within 30 days of purchase. Digital products are non-refundable. For assistance, contact [email protected].
Prompt engineering fixes: 80% of problems. Try 10 variations before moving to RAG or fine-tuning.
Part 2: When to Add RAG
Add RAG when the model lacks the necessary knowledge:
# Without RAG — model doesn't know your product catalogue
r = ollama.chat(model="qwen3:14b", messages=[
{"role": "system", "content": "You are a support agent for Acme Corp."},
{"role": "user", "content": "What are the specs for the ProMax 4000?"}
])
print("Without RAG:", r["message"]["content"])
# Output: "I don't have specific information about the ProMax 4000..."
# With RAG — retrieve from product database
from your_rag_module import retrieve_context # Your RAG implementation
context = retrieve_context("ProMax 4000 specifications")
r = ollama.chat(model="qwen3:14b", messages=[
{"role": "system", "content": f"You are a support agent. Use ONLY this context:\n{context}"},
{"role": "user", "content": "What are the specs for the ProMax 4000?"}
])
print("With RAG:", r["message"]["content"])
# Output: "The ProMax 4000 has 16GB RAM, 512GB SSD, Intel Core i9-14900K..."
RAG use cases:
- Product catalogues and documentation
- Company policy and procedure documents
- Research papers and knowledge bases
- Recent news and events (updated docs)
- Customer-specific data (their orders, history)
Part 3: When Fine-Tuning Is Warranted
Fine-tune when prompt engineering doesn’t achieve consistent behaviour:
# Problem: Model won't reliably output structured support tickets
# Even with detailed system prompt, 20% of outputs are wrong format
# Solution: Fine-tune on 500 examples of correct ticket format
# Training data format (JSONL):
training_examples = [
{
"instruction": "Convert this support email to a ticket",
"input": "Hi, my payment failed twice today",
"output": '{"priority": "high", "category": "billing", "title": "Payment failure", "description": "Customer reports payment failed twice on same day"}'
},
# ... 499 more examples
]
# After fine-tuning: 98%+ correct format, consistent every time
# Before: 80% correct, 20% needed manual correction
Fine-tuning use cases:
- Consistent output format (JSON schema, specific structure)
- Domain-specific vocabulary (medical, legal, proprietary)
- Style and tone matching (brand voice, writing style)
- Instruction following for specific task types
Part 4: Combining All Three
The highest-quality production setup uses all three:
LAYER 1 — Fine-tuned model:
Model trained to always output JSON, use our terminology, match our tone
COST: One-time training (hours on GPU, $0 local)
LAYER 2 — RAG retrieval:
Each query retrieves relevant product docs, policies, customer data
COST: Embedding + vector search (<10ms, negligible)
LAYER 3 — Optimised system prompt:
Role, constraints, output format, edge case handling
COST: Included in inference
# Full stack: fine-tuned model + RAG + prompt
from your_rag_module import retrieve_context
def answer_query(user_query: str) -> str:
# Layer 2: Retrieve relevant context
context = retrieve_context(user_query)
# Layer 3: Optimised system prompt
system = f"""You are AcmeBot, Acme Corp's support agent.
[Relevant context from our knowledge base]
{context}
Rules:
- Answer only from the provided context
- Format: JSON with keys: answer, confidence (0-1), needs_human (bool)
- If context is insufficient, set needs_human: true"""
# Layer 1: Fine-tuned model (knows our format, terminology, tone)
r = ollama.chat(
model="acmebot:v2", # Your fine-tuned Ollama model
messages=[
{"role": "system", "content": system},
{"role": "user", "content": user_query}
],
format="json"
)
return r["message"]["content"]
Cost and Timeline Summary
SCENARIO: Build a customer support chatbot
Option A — Prompt Engineering only:
Timeline: 1 day
Data needed: None
Cost: $0
Quality: 70-80% correct responses
Option B — Prompt Engineering + RAG:
Timeline: 3-5 days
Data needed: Company docs and FAQs
Cost: $0-$20 (storage, pgvector)
Quality: 85-92% correct, grounded responses
Option C — All three (Prompt + RAG + Fine-tuning):
Timeline: 2-3 weeks
Data needed: 500+ labelled examples + company docs
Cost: $0 (local GPU) or $50-200 (cloud training)
Quality: 93-98% correct, consistent format
Recommendation: Start with Option B. Fine-tune (Option C) only
if format consistency remains a problem after 2 weeks of prompt iteration.
Conclusion
Prompt engineering solves most problems. RAG solves knowledge gaps. Fine-tuning solves consistent behaviour. The three are complementary, not competitive — the best production systems use all three in layers. The key is sequencing: prompt first, RAG second, fine-tune last.
See RAG Tutorial 2026 for the RAG implementation and Fine-Tune Llama 4 with QLoRA and Unsloth for the fine-tuning implementation.
People Also Ask
How many examples do I need to fine-tune an LLM?
For instruction fine-tuning (teaching a specific output format or domain style): 200–1,000 high-quality examples are typically sufficient. For domain adaptation (teaching technical vocabulary): 1,000–5,000 examples. For full capability training (replicating GPT-4): billions of tokens — impractical without massive compute. Quality dramatically outweighs quantity: 300 carefully curated, consistent examples reliably outperform 5,000 scraped, noisy examples. The litmus test: if a human expert can’t consistently produce the target output from your training examples, the model won’t learn to either.
Is RAG or fine-tuning better for up-to-date information?
RAG is clearly better for up-to-date information. Fine-tuning creates a static model snapshot — knowledge has a cutoff at the training date, and updating requires retraining. RAG retrieves from a live document store — update the docs and the model instantly answers with current information. The only case where fine-tuning is preferable for “recent” knowledge is when you have a large static corpus that doesn’t change and needs to be always available without retrieval latency.
Part 6: Retrieval Architectures for RAG
RAG is not a single pattern — it consists of several architectural choices.
6.1 Hybrid retrieval: sparse + dense
Combine keyword search with vector search. Use a sparse search engine such as Elasticsearch or PostgreSQL full-text search to filter candidate documents, then rerank them with dense embeddings.
SELECT id, title, text
FROM docs
WHERE text @@ plainto_tsquery('english', $1)
ORDER BY ts_rank_cd(text, plainto_tsquery('english', $1)) DESC
LIMIT 100;
Then compute dense embeddings for the top 100 documents and rerank by cosine similarity.
6.2 Local vector stores
For sovereignty, keep the vector store on-premises:
- pgvector inside PostgreSQL
- Qdrant in a private Docker container
- ChromaDB on local disk
This avoids sending embeddings or queries to any external vendor.
6.3 Document chunking and context windows
Split documents into chunks that fit your LLM context window. For a 14B model with a 4,096 token window, 300–500 token chunks work well.
Use overlapping chunks to preserve continuity:
- chunk 1: tokens 0–400
- chunk 2: tokens 320–720
- chunk 3: tokens 640–1040
This gives the retriever enough context without losing segment boundaries.
6.4 Prompting the retriever output
When assembling the prompt, provide the LLM with only the top K chunks plus an explicit instruction:
You are a knowledge assistant. Use ONLY the provided sources below when answering. If the answer is not in the sources, say "I don't know." Output in complete sentences.
Source 1:
<chunk text>
Source 2:
<chunk text>
This reduces hallucination and keeps the model grounded.
Part 7: Evaluating RAG Quality
Measure retrieval quality, not just final answer quality.
7.1 Retrieval precision and recall
Precision: percentage of retrieved chunks that are actually relevant. Recall: percentage of relevant chunks included in the top results.
A good RAG system should have a recall of 90%+ for the top 10 chunks.
7.2 Answer fidelity
Compare the model’s answer to the source documents. If the model is “hallucinating,” the retrieval or prompt is at fault.
7.3 Human-in-the-loop evaluation
Use domain experts to rate answers on correctness, completeness, and hallucination. Track the human score over time as you iterate.
7.4 Explainability
Store the source references and chunk IDs with each answer. Use them for audits and debugging.
Part 8: Fine-Tuning Data Strategy
Fine-tuning works best with clean, consistent examples.
8.1 Data quality over quantity
A dataset of 200 high-quality examples often beats 2,000 noisy ones. Each example should demonstrate the exact output pattern you want the model to learn.
8.2 Input/output pair design
For instruction fine-tuning, use a structure like:
{
"instruction": "Summarize this paragraph in two sentences.",
"input": "...",
"output": "..."
}
For style tuning, the instruction can be abstract and the output should be the desired style directly.
8.3 Validation and held-out prompts
Keep a validation set of prompts that the model has not seen during training. Use it to check whether the fine-tuned model generalises or simply memorises.
8.4 Iterating on the dataset
After the first round of fine-tuning, review outputs and add examples for failure cases. Focus on where the model still misses the required format or tone.
Part 9: Cost and Maintenance Comparison
A sovereign system should document not only the initial implementation but the ongoing cost.
9.1 Maintenance burden
- Prompt engineering: low maintenance, easy to update
- RAG: moderate maintenance, requires document updates and index rebuilds
- Fine-tuning: higher maintenance, requires retraining when requirements change
9.2 Infrastructure cost
- Prompt engineering: only inference cost
- RAG: inference + storage + retrieval compute
- Fine-tuning: training compute + model version management
9.3 Operational risk
RAG adds retrieval complexity and a second data store. Fine-tuning adds model version drift and validation risk. Prompt engineering adds the least infrastructure risk.
9.4 Governance checklist
- Prompt templates are version controlled
- Retrieval documents are audited and timestamped
- Fine-tuned models are tracked with version metadata
- Answer provenance is stored with each response
- fallback behaviours are documented
Part 10: Practical Rule of Thumb
For a sovereign local AI deployment:
- Start with prompt engineering.
- Add RAG when the model needs current or proprietary knowledge.
- Fine-tune only if output style or format still fails after prompt and retrieval iteration.
This sequence keeps the system manageable and avoids unnecessary training cycles.
Part 11: Embedding and Vector Quality
The quality of your RAG system depends heavily on the embeddings.
11.1 Embedding model selection
Choose an embedding model that matches your data type. For text, use an embedding model trained on semantic similarity. For code, use code-specific embeddings. For multilingual data, use a multilingual text embedding model.
11.2 Embedding caching and storage
Generate embeddings once and store them locally. Each document should have a metadata record with the embedding model version and a timestamp.
11.3 Vector index tuning
The performance of HNSW indexes depends on parameters such as M, ef_construction, and ef_search. A typical config for production is:
M = 16ef_construction = 200ef_search = 200
These settings balance build time and query quality. For lower RAM systems, use smaller values and test recall.
11.4 Document chunk scoring
When you retrieve multiple chunks, score them not only by similarity but also by relevance heuristics such as document freshness, source trust level, and user intent match. A simple blended score can improve final answer accuracy.
Part 12: Prompt Templates and Guardrails
Use prompt templates for every RAG query.
You are an expert assistant. Use ONLY the following retrieved sources to answer the user's question.
If the answer is not contained in the sources, say "I don't know." Do not hallucinate.
Sources:
{sources}
Question: {question}
Answer:
Keep the template stable and use placeholders for the user query and source text. This makes your system predictable.
Part 13: Evaluating Fine-Tuned Outputs
After fine-tuning, measure quality with a consistent set of metrics.
13.1 Exact match and similarity
For structured outputs, use exact match or normalized string comparison. For free-form outputs, use semantic similarity against reference answers.
13.2 Human evaluation
Use a checklist for human review:
- Does the output follow the required format?
- Is the tone appropriate?
- Does it use the correct terminology?
- Does it avoid hallucinations?
13.3 Regression testing
Keep a regression suite of prompts that previously failed. Re-run it after every new fine-tuning iteration.
Part 14: Operationalising your knowledge stack
A production RAG/fine-tuned system should have clear operational boundaries.
14.1 Document update workflows
When documents change, update the vector index and re-run retrieval tests. Keep a changelog of document refreshes.
14.2 Model versioning
Track fine-tuned model versions with metadata: training date, dataset hash, prompt template version, and evaluation scores.
14.3 Rollback procedures
If a fine-tuned model performs worse in production, rollback to the previous version. Keep a stable version as the default and a candidate version for canary testing.
Part 15: Cost-saving patterns
Even local, compute costs matter.
15.1 Reduce document corpus size
Use document filtering to keep the vector store focused on relevant documents only. More data is not always better.
15.2 Use smaller models for retrieval-only tasks
For retrieval and reranking, a smaller model or embedder may be sufficient. Reserve the larger generative model for final answer composition.
15.3 Prune old vectors
If documents are stale or no longer relevant, remove them from the index instead of leaving them to pollute results.
Part 16: Governance and Auditability
Sovereignty means you can explain and audit every decision.
16.1 Answer provenance
Store the chunk IDs and sources that contributed to each answer. This creates a traceable path from question to response.
16.2 Feedback loops
If users flag an answer as incorrect, store the feedback and use it to improve retrieval, prompt templates, or fine-tuning examples.
16.3 Local runbooks
Keep runbooks for:
- updating the document corpus
- refreshing the embedding index
- retraining or rolling back fine-tuned models
- handling hallucination incidents
Part 17: Practical Templates
Use templated prompts for the three approaches:
Prompt Engineering template
You are a helpful assistant.
Answer the user's question concisely.
Use a polite tone and avoid speculation.
User: {question}
Assistant:
RAG template
You are an expert assistant. Use ONLY the sources provided below.
If the answer cannot be found, say "I don't know."
Sources:
{sources}
Question: {question}
Answer:
Fine-tuning instruction template
Instruction: {instruction}
Input: {input}
Output:
{output}
A clean template makes training examples easier to author and review.
Part 18: Practical Risk Mitigation
Every AI system has risks. A sovereign AI system must be designed to mitigate them.
18.1 Hallucination containment
Use explicit instructions and grounding. If the model cannot answer from the provided sources, it should say so. Do not let it guess.
18.2 Sensitive data isolation
For RAG, keep sensitive documents in a separate index and only retrieve them when the user is authorised.
18.3 Versioned prompts and templates
Keep prompt templates under version control. When you update a template, record the change and test the system to ensure answer quality does not regress.
Part 19: Low-Risk Testing Strategies
Validate the system in a staging environment that mirrors production.
19.1 Canary prompts
Create a set of representative prompts and run them against every new model or retrieval pipeline change. Compare the outputs to a baseline.
19.2 Regression prompts
Keep a set of prompts that previously exposed issues and rerun them after each change.
19.3 Data drift monitoring
Track the distribution of query types and retrieved sources. If the query mix changes, adjust the retrieval and prompt strategy accordingly.
Part 20: Local Tooling and Developer Experience
A self-hosted AI project must be easy for developers to work with.
20.1 Local development stack
Use local Docker Compose or local services to run the vector store, the embedding service, and the LLM. Developers should be able to spin up the full stack with one command.
20.2 Sample data and fixtures
Keep a small sample corpus and a test dataset in the repo. This enables quick local experiments without needing the full production data.
20.3 Reproducible experiments
Use a experiments/ folder for prompt templates, model versions, and result summaries. This creates a local research log.
Part 21: Final Production Readiness Checklist
- retrieval index is regularly refreshed
- answer provenance is stored with every result
- prompt templates are version controlled
- fine-tuned models are clearly tagged and tracked
- fallback behaviour is defined for unknown queries
- audit logs exist for retrieval and generation
- performance metrics are monitored for drift
- security boundaries are defined for sensitive content
- the system can be restored from backup quickly
A production-ready RAG/fine-tuning system is not just about better answers; it is about making the whole stack auditable, maintainable, and resilient.
Part 22: Handling Domain Drift
Domain drift occurs when the topic or terminology of the user’s queries changes over time.
22.1 Monitoring topic drift
Log the top topics and extract keywords from incoming queries. If the query distribution shifts, plan an index refresh or prompt update.
22.2 Adaptive retrieval
For drifting domains, use a tiered retrieval architecture. Keep a stable base index and a smaller fresh index for recent or rapidly changing content. Query both and merge the top results.
22.3 Feedback-driven retraining
If users repeatedly mark answers as wrong or incomplete, surface those cases into a feedback dataset. Use that dataset to improve fine-tuning or to refine the prompt instructions.
Part 23: Explainability in Production
For sovereign systems, explainability is a key trust feature.
23.1 Source attribution
Always attach the retrieved sources to the final answer. Make it simple to map each statement back to the document chunk that produced it.
23.2 Transparent scoring
Record the similarity scores and the retrieval rationale. If a user asks why an answer was chosen, you can show which documents contributed and how.
23.3 Audit reports
Generate periodic audit reports that show the most common queries, the highest-ranked sources, and any hallucination incidents. This provides governance evidence for internal stakeholders.
Part 24: Continuous Prompt Calibration
Prompt calibration should be continuous, especially when your retrieval or fine-tuning data evolves.
24.1 A/B test prompt variants
Run A/B tests of different prompt formulations against the same retrieval results. Compare accuracy, hallucination rate, and user satisfaction.
24.2 Keep a prompt change log
Every prompt update should be logged with the rationale and the observed effect. This log is essential for teams to understand why one prompt variant replaced another.
Further Reading
- RAG Tutorial 2026 — build the RAG pipeline described in this guide
- Fine-Tune Llama 4 with QLoRA and Unsloth — implement fine-tuning
- Prompt Engineering Guide 2026 — master prompt engineering before RAG/fine-tuning
- pgvector vs Qdrant vs ChromaDB — choose the vector store for your RAG pipeline
Last verified: April 28, 2026.