Key Takeaways
- RAG = Retrieve then Generate: First retrieve relevant chunks from your documents using vector similarity search; then give those chunks to the LLM as context. The LLM answers based on the retrieved context, not its training data.
- Three local components: Ollama (embedding model + LLM), pgvector in PostgreSQL (vector store), Python (orchestration). Zero cloud.
- Chunk overlap is not optional: Without overlap, a sentence split across two chunks loses context. 50-token overlap is the minimum; 10–20% of chunk size is the standard.
- Evaluate with RAGAS: Don’t guess if RAG is working — measure Faithfulness and Answer Relevancy with the RAGAS library to quantify quality before and after parameter changes.
Introduction
Direct Answer: How do I build a local RAG pipeline with Ollama and pgvector in Python in 2026?
A local RAG pipeline has three phases: (1) Ingestion — load documents, split into ~500-token chunks with 50-token overlap, embed each chunk with ollama.embeddings(model='nomic-embed-text:v1.5', prompt=chunk), and store the (chunk_text, embedding_vector) pairs in pgvector using CREATE TABLE docs (content TEXT, embedding vector(768)). (2) Retrieval — embed the user’s query with the same model, run SELECT content FROM docs ORDER BY embedding <=> query_vector LIMIT 5 to get the most similar chunks. (3) Generation — inject retrieved chunks into the LLM system prompt as context, then query ollama.chat(model='qwen3:14b', messages=[system_with_context, user_question]). The full pipeline runs locally — install with pip install ollama psycopg2-binary pgvector langchain-community and ensure both Ollama and PostgreSQL with pgvector are running.
Architecture
INGESTION PHASE (once, or when documents update):
┌──────────────┐ chunk ┌──────────┐ embed ┌──────────────────┐
│ Documents │─────────────▶│ Chunks │──────────────▶│ Embedding Model │
│ PDF/MD/TXT │ │ 500 tok │ Ollama SDK │ nomic-embed-text │
└──────────────┘ └──────────┘ └────────┬─────────┘
│ vectors
┌─────────▼────────┐
│ pgvector DB │
│ PostgreSQL 17 │
└─────────┬────────┘
QUERY PHASE (every user question):
User question ──embed──▶ Query vector ──cosine search──▶ Top K chunks
│
┌─────────▼──────────┐
│ LLM (Qwen3 14B) │
│ System: context │
│ User: question │
└─────────┬──────────┘
│
Grounded answer
Part 1: Setup
pip install ollama psycopg2-binary pgvector langchain-text-splitters --break-system-packages
# Pull embedding and LLM models
ollama pull nomic-embed-text:v1.5 # 274MB — fast, high quality embeddings
ollama pull qwen3:14b # 9GB — best local LLM for Q&A
# Enable pgvector in PostgreSQL
sudo -u postgres psql -d myapp -c "CREATE EXTENSION IF NOT EXISTS vector;"
sudo -u postgres psql -d myapp -c "SELECT extversion FROM pg_extension WHERE extname='vector';"
Expected output:
extversion
------------
0.8.0
Part 2: Ingestion Pipeline
# ingest.py
import ollama
import psycopg2
from psycopg2.extras import execute_values
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path
# Database connection
conn = psycopg2.connect("postgresql://appuser:password@localhost:5432/myapp")
# Create the documents table
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS rag_documents (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL,
chunk_index INT NOT NULL,
content TEXT NOT NULL,
embedding vector(768) -- nomic-embed-text:v1.5 produces 768-dim vectors
);
CREATE INDEX IF NOT EXISTS rag_docs_embedding_idx
ON rag_documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
""")
conn.commit()
print("Table and HNSW index created")
# Splitter: 500-token chunks, 50-token overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""],
)
def embed(text: str) -> list[float]:
return ollama.embeddings(model="nomic-embed-text:v1.5", prompt=text)["embedding"]
def ingest_file(filepath: str) -> int:
content = Path(filepath).read_text(encoding="utf-8", errors="ignore")
chunks = splitter.split_text(content)
source = Path(filepath).name
rows = []
for i, chunk in enumerate(chunks):
vector = embed(chunk)
rows.append((source, i, chunk, str(vector)))
with conn.cursor() as cur:
execute_values(cur,
"INSERT INTO rag_documents (source, chunk_index, content, embedding) VALUES %s",
rows,
template="(%s, %s, %s, %s::vector)"
)
conn.commit()
print(f" Ingested: {source} → {len(chunks)} chunks")
return len(chunks)
# Ingest documents
docs = list(Path("./docs").glob("*.md")) + list(Path("./docs").glob("*.txt"))
total = sum(ingest_file(str(d)) for d in docs)
print(f"\nTotal chunks ingested: {total}")
conn.close()
Expected output:
Table and HNSW index created
Ingested: ubuntu-setup.md → 47 chunks
Ingested: postgresql-guide.md → 63 chunks
Ingested: docker-tutorial.md → 38 chunks
Total chunks ingested: 148
Part 3: Retrieval and Generation
# rag_query.py
import ollama
import psycopg2
conn = psycopg2.connect("postgresql://appuser:password@localhost:5432/myapp")
def retrieve(query: str, k: int = 5) -> list[dict]:
"""Find the K most semantically similar chunks to the query."""
query_vec = ollama.embeddings(model="nomic-embed-text:v1.5", prompt=query)["embedding"]
with conn.cursor() as cur:
cur.execute("""
SELECT source, content,
1 - (embedding <=> %s::vector) AS similarity
FROM rag_documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (str(query_vec), str(query_vec), k))
rows = cur.fetchall()
return [{"source": r[0], "content": r[1], "similarity": r[2]} for r in rows]
def answer(question: str, k: int = 5) -> dict:
"""Retrieve relevant chunks and generate a grounded answer."""
chunks = retrieve(question, k=k)
# Build context from retrieved chunks
context = "\n\n---\n\n".join(
f"[Source: {c['source']} | Similarity: {c['similarity']:.3f}]\n{c['content']}"
for c in chunks
)
response = ollama.chat(
model="qwen3:14b",
messages=[
{
"role": "system",
"content": f"""You are a helpful technical assistant.
Answer the question using ONLY the information in the provided context.
If the context doesn't contain enough information to answer, say so clearly.
Do not use knowledge from outside the provided context.
CONTEXT:
{context}"""
},
{"role": "user", "content": question}
]
)
return {
"question": question,
"answer": response["message"]["content"],
"sources": [c["source"] for c in chunks],
"top_similarity": chunks[0]["similarity"] if chunks else 0
}
# Test queries
questions = [
"How do I configure PostgreSQL shared_buffers for 8GB RAM?",
"What UFW commands do I need to allow HTTPS traffic?",
"How do I check if a Docker container is healthy?",
]
for q in questions:
result = answer(q)
print(f"\nQ: {result['question']}")
print(f"A: {result['answer'][:200]}...")
print(f" Sources: {result['sources'][:2]} | Top similarity: {result['top_similarity']:.3f}")
Expected output:
Q: How do I configure PostgreSQL shared_buffers for 8GB RAM?
A: Based on the provided context, set shared_buffers = 2GB (25% of 8GB RAM) in
/etc/postgresql/17/main/conf.d/performance.conf. Also set effective_cache_size = 6GB...
Sources: ['postgresql-guide.md', 'ubuntu-setup.md'] | Top similarity: 0.891
Q: What UFW commands do I need to allow HTTPS traffic?
A: From the context: sudo ufw allow https (allows port 443/tcp) and sudo ufw allow http
(port 80/tcp). Always run sudo ufw allow ssh first before enabling the firewall...
Sources: ['ubuntu-setup.md', 'docker-tutorial.md'] | Top similarity: 0.847
Part 4: Evaluation with RAGAS
pip install ragas --break-system-packages
# evaluate.py
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
# Build evaluation dataset
eval_data = {
"question": [],
"answer": [],
"contexts": [],
"ground_truth": []
}
test_cases = [
{
"question": "What is the recommended shared_buffers for 8GB RAM?",
"ground_truth": "2GB (25% of total RAM)"
},
{
"question": "How do I allow HTTPS in UFW?",
"ground_truth": "sudo ufw allow https"
}
]
for tc in test_cases:
result = answer(tc["question"])
chunks = retrieve(tc["question"])
eval_data["question"].append(tc["question"])
eval_data["answer"].append(result["answer"])
eval_data["contexts"].append([c["content"] for c in chunks])
eval_data["ground_truth"].append(tc["ground_truth"])
dataset = Dataset.from_dict(eval_data)
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print("\nRAGAS Evaluation:")
print(f" Faithfulness: {scores['faithfulness']:.3f} (answer supported by context?)")
print(f" Answer Relevancy: {scores['answer_relevancy']:.3f} (on-topic answer?)")
print(f" Context Precision: {scores['context_precision']:.3f} (chunks actually relevant?)")
Expected output:
RAGAS Evaluation:
Faithfulness: 0.912 (answer supported by context?)
Answer Relevancy: 0.884 (on-topic answer?)
Context Precision: 0.856 (chunks actually relevant?)
Scores above 0.8 indicate a well-functioning RAG pipeline. If Faithfulness is low, the LLM is hallucinating beyond the context — tighten the system prompt. If Context Precision is low, improve chunking or increase chunk overlap.
Troubleshooting
Low similarity scores (< 0.5) for relevant queries
Cause: Chunking too coarse (chunks too large) or wrong embedding model. Fix: Reduce chunk_size to 300 tokens. Ensure you’re using the same embedding model for both ingestion and querying.
LLM answers with information not in context
Cause: System prompt not firm enough about using only retrieved context.
Fix: Add: "If the answer is not in the context, respond with: 'I don't have that information in the provided documents.'" to the system prompt.
pgvector extension not found
Fix: sudo apt-get install postgresql-17-pgvector then CREATE EXTENSION vector; in the database.
Conclusion
A sovereign RAG pipeline: documents chunked and embedded locally with nomic-embed-text:v1.5, stored in pgvector, retrieved via cosine similarity, and answered by Qwen3 14B — all on your hardware with zero external API calls. RAGAS evaluation gives you objective metrics to tune chunk size, overlap, and retrieval depth.
See pgvector vs Qdrant vs ChromaDB 2026 for a deeper comparison of vector store options, and Advanced RAG Techniques for hybrid search, reranking, and multi-query retrieval.
People Also Ask
What is the difference between RAG and fine-tuning?
RAG retrieves relevant information at query time from a dynamic document store — good for up-to-date knowledge, specific documents, and grounded answers. Fine-tuning trains the model to internalize patterns, styles, and domain knowledge — good for consistent tone, specialised output formats, and domain vocabulary. RAG doesn’t modify the model; fine-tuning creates a new model version. Use RAG when your knowledge changes frequently or when you need to cite sources. Use fine-tuning when you need the model to behave differently (output format, tone, domain expertise). See RAG vs Fine-Tuning vs Prompt Engineering 2026 for the full decision framework.
What chunk size should I use for RAG?
Start with 500 tokens, 50-token overlap as a baseline. For technical documentation (code, API docs): try 300 tokens with 30-token overlap — technical content is denser and benefits from smaller chunks. For narrative text (books, articles): try 800 tokens with 80-token overlap — more context per chunk helps the LLM understand meaning. The best chunk size depends on your documents and questions — measure with RAGAS after tuning.
Part 11: RAG System Design Patterns
A robust RAG system starts with clear architectural patterns.
11.1 Modular retrieval and generation
Keep retrieval and generation separate. The retrieval component should return source documents, while the generation component should compose the final answer from those documents.
This separation makes it easier to test and swap parts of the stack.
11.2 Incremental document updates
Design your ingestion pipeline to handle incremental updates. Recompute embeddings only for changed documents, and refresh the index without rebuilding the entire corpus when possible.
11.3 Freshness and recency
If your content changes frequently, separate recent documents into a “hot” index. Query both the hot index and the larger archive, then merge results with a recency-aware ranking.
Part 12: Chunking and Document Quality
How you split documents matters more than most people realise.
12.1 Semantic chunking
Chunk by semantic boundaries: paragraphs, sections, or logical units. Avoid arbitrary fixed-size blocks that break meaning.
12.2 Chunk metadata
Store metadata with every chunk: source document, section title, author, publication date, and trust level. This metadata is crucial for filtering and provenance.
12.3 Chunk ranking and diversity
When returning multiple chunks, prefer diverse sections from different sources over many similar chunks from one document. Diversity reduces redundancy and improves answer quality.
Part 13: Retrieval Chain and Scoring
The retrieval chain is the heart of RAG.
13.1 Candidate generation
Use semantic search to generate candidate chunks. If you have a large corpus, add a lightweight keyword filter before the semantic step to narrow the search space.
13.2 Re-ranking
Re-rank candidates using a second-stage model or a relevance heuristic. Consider both similarity score and document quality metadata.
13.3 Token budget management
Limit the total number of tokens sent to the generator. This budget should include prompt text, retrieved chunks, and the expected answer. If a query is very broad, use a smaller number of higher-quality chunks.
Part 14: Prompt Engineering for RAG
Effective prompts are the final step.
14.1 Grounding instructions
Tell the model to rely only on retrieved sources.
Use only the information provided in the Sources section. If the answer cannot be found, say "I don't know."
14.2 Structured answer templates
Use answer templates to constrain the output.
Question: {question}
Sources:
{sources}
Answer:
1. Summary:
2. Supporting sources:
This helps with consistency and makes validation easier.
14.3 Error recovery
Include instructions for uncertain cases.
If the source data is incomplete or conflicting, be transparent about the uncertainty and list the relevant sources.
Part 15: Evaluation and Feedback
Measure RAG quality with both automated and human reviews.
15.1 Ground-truth datasets
Build a test set of questions with expected answers and source provenance. Use it to validate retrieval recall and generation accuracy.
15.2 Human review
Have reviewers verify that model answers are supported by the cited sources. Flag hallucinations and wrong source attributions.
15.3 Continuous feedback loops
Collect user feedback and feed it back into the system. If a query frequently results in wrong answers, improve the retrieval data, prompt, or source corpus.
Part 16: Deployment and Service Patterns
Deploy RAG as a stable service with clear boundaries.
16.1 Local inference vs remote
A self-hosted RAG system can run entirely locally or use a local retrieval stage with a remote generator. For sovereignty, keep both retrieval and generation on-premises whenever possible.
16.2 API contract
Define a contract for RAG API responses.
{
"answer": "...",
"sources": ["doc1","doc2"],
"confidence": 0.82
}
Include source provenance and an optional confidence score.
16.3 Rate limiting and quotas
Protect your local service with rate limits. RAG generation can be expensive, and unbounded usage can overwhelm CPU/GPU resources.
Part 17: Observability and Debugging
Visibility into the retrieval and generation pipeline is essential.
17.1 Retrieval metrics
Track query volume, retrieval latency, number of chunks returned, and recall rates. Use these metrics to detect index degradation or stale data.
17.2 Generation metrics
Track generation latency, prompt lengths, and token usage. Monitor for slow queries and unexpected bursts of long answers.
17.3 Error tracing
Log errors at every stage: ingestion failures, embedding service errors, index timeouts, and generation failures. Correlate logs with request IDs.
Part 18: Security and Access Control
A RAG system can expose sensitive documents if not constrained.
18.1 Document-level access control
Protect sensitive documents with metadata-based filtering. Only retrieve them when the user is authorised.
18.2 Prompt sanitization
Sanitize user queries before using them in prompt templates. Remove or encode control characters and dangerous payloads.
18.3 Auditability
Keep an audit log of queries, retrieved sources, and generated answers. This is indispensable for compliance and incident response.
Part 19: Scaling the Vector Store
Vector stores can grow quickly. Plan for scale from the start.
19.1 Partitioning
Partition large corpora by domain, date, or source. Query the most relevant partitions first to improve latency.
19.2 Index maintenance
Rebuild or compress your index periodically to remove stale vectors and improve search quality. Monitoring index size and query latency helps determine the right cadence.
19.3 Hybrid indexes
Combine dense vectors with sparse keyword search for a hybrid retrieval strategy. This can improve recall on long-tail queries.
Part 20: Final RAG Operations Checklist
- retrieval and generation are decoupled
- chunking uses semantic boundaries and metadata
- prompts force grounding in sources
- evaluation includes both automated and human review
- security controls prevent unauthorized access to sensitive content
- metrics capture retrieval, generation, and infrastructure health
- index updates are incremental and repeatable
- audit logs preserve provenance and query history
A production RAG deployment is not simply about retrieval. It is about making search, generation, and governance work together in a way that is reliable, explainable, and maintainable.
Part 21: Vector Store Selection and Tradeoffs
Choosing the right vector store is critical for performance and cost.
21.1 Local vs remote indexes
Local vector stores give you sovereignty and low latency. Remote vector databases can simplify scaling but introduce external dependencies.
21.2 Approximate nearest neighbour settings
Tune ef_search, M, and other HNSW parameters for your workload. Higher values increase recall at the cost of latency and memory.
21.3 Storage compression
Compress vector data when RAM is limited. Trade off a small retrieval latency increase for lower memory usage.
Part 22: Query and Prompt Caching
Cache results to reduce repeated computation.
22.1 Query result caches
Cache retrieval results for repeated queries. Use a time-based expiration so stale data is refreshed.
22.2 Prompt template caches
Cache compiled prompt templates and injected metadata. This reduces overhead in high-throughput systems.
22.3 Answer cache invalidation
Invalidate caches when underlying documents change. Record document version or timestamp with the cache key.
Part 23: Handling Unstructured and Multimodal Data
RAG systems often need to work beyond plain text.
23.1 OCR and scanned documents
Run OCR on scanned files, then chunk and embed the extracted text. Store source coordinates for provenance.
23.2 Image and audio embeddings
For multimodal retrieval, use embeddings that support images or audio. Keep the modality metadata alongside the text chunks.
23.3 Composite retrieval
Combine text and image matches in the retrieval stage. Use a weighted scoring strategy to balance modalities.
Part 24: User Experience and Answer Quality
The quality of the result matters as much as the correctness.
24.1 Concise answer generation
Generate concise answers with source summaries. Lengthy, verbose responses are harder to verify and less useful in practice.
24.2 Answer confidence and transparency
Return confidence indicators and clearly cite the sources used. This helps users trust and evaluate the answer.
24.3 Handling unknowns
If the system cannot answer confidently, say so. A safe response is better than a convincing hallucination.
Part 25: Governance and Documentation
Keep your RAG system understandable and auditable.
25.1 System documentation
Document the retrieval pipeline, prompt templates, index refresh process, and access controls. Include a glossary of terms for the team.
25.2 Review cycles
Review RAG pipeline components periodically. Validate that embeddings, index quality, and prompt templates still match your use cases.
25.3 Change logs
Keep change logs for corpus updates, retrieval tuning, model changes, and prompt modifications. This is essential for debugging and accountability.
Part 26: Observability and Debugging
A RAG system is only maintainable if its behavior is observable.
26.1 Retrieval transparency
Log which chunks were retrieved for each query and their similarity scores. This helps diagnose why the model produced a particular answer.
26.2 Prompt and answer tracing
Capture the final prompt, the retrieved sources, and the generated answer for debugging. Store these traces separately from production logs to avoid exposing sensitive content unnecessarily.
26.3 Query classification
Track query types, intents, and failure modes. Use this data to identify whether your system is underperforming on a particular class of questions.
Part 27: Iterative Improvement Workflows
Improve RAG through repeatable processes.
27.1 User feedback loops
Collect explicit user feedback on answer quality. Feed low-confidence or incorrect answers back into model tuning, retrieval adjustments, or prompt revisions.
27.2 Progressive corpus expansion
Add new documents incrementally and validate retrieval quality after each update. Avoid large batch refreshes unless necessary.
27.3 Guardrail evolution
When you change prompt templates or answer policies, keep the previous version as a fallback. Record the change rationale and evaluation results.
Part 28: Security Boundaries and Sensitive Content
Protect sensitive information at every layer.
28.1 Context filtering
Pre-filter user queries and documents to avoid retrieving or exposing private data. Use metadata tags and access controls in the retrieval stage.
28.2 Response sanitisation
Sanitise generated answers to remove or redact sensitive terms when required. This is particularly important for documents with personally identifiable information.
28.3 Audit logs for source access
Log which source documents were used for each answer, without storing the full response if it contains regulated data. This provides traceability for audits.
Part 29: Performance Engineering
Tune your RAG pipeline for latency and throughput.
29.1 Retrieval cache warmup
Pre-warm the vector store and embedding cache before peak traffic. For local deployments, keep the index in memory for faster search.
29.2 Token budget optimization
Trim retrieved content to fit the generator’s input budget. Remove redundant text and prefer concise, high-value chunks.
29.3 Batch retrieval and generation
On high-throughput systems, batch multiple requests through the retrieval and generation stages. This can improve hardware utilization while keeping latency within targets.
Part 30: Final Product and Service Considerations
A production RAG service must feel solid to users.
30.1 Consistent answer format
Keep answer structure consistent across similar queries. This makes responses easier to consume and more reliable.
30.2 Source attribution UX
Present sources transparently, but not verbosely. Provide enough provenance for trust without overwhelming the user.
30.3 Service-level expectations
Define how the RAG service should behave under load, during updates, and on failure. Document and enforce those expectations in the operational runbook.
Part 31: Practical Prompt Templates
Well-structured templates can make your RAG answers more consistent and reliable.
31.1 Citations-first template
You are a knowledgeable assistant. Use only the sources listed below.
Cite the source number for each statement.
Sources:
{sources}
Question: {question}
Answer:
1. Summary:
2. Citations:
31.2 Safety-aware template
Answer the question directly. If the information is not present in the sources, say "I don't know." Do not fabricate details.
Sources:
{sources}
Question: {question}
Answer:
31.3 Evaluation template
Keep a small set of evaluation prompts with expected output structure. This makes it easier to detect prompt or retrieval regressions.
Further Reading
- pgvector vs Qdrant vs ChromaDB 2026 — choose the right vector store
- Private Document Q&A with Ollama and pgvector — production implementation of this pipeline
- LangChain and LangGraph with Ollama — LangChain’s RAG chain abstraction
- How to Install PostgreSQL 17 on Ubuntu 24.04 — prerequisite: PostgreSQL setup
Tested on: Ubuntu 24.04 LTS (RTX 4090). Ollama 0.5.12, pgvector 0.8.0, RAGAS 0.1.x. Last verified: April 28, 2026.