Vucense

RAG Tutorial 2026: Build a Local Retrieval-Augmented Generation Pipeline

🟡Intermediate

Build a sovereign RAG pipeline from scratch with Ollama, pgvector, and Python. Covers document chunking, embedding generation, vector search, context injection, and RAGAS evaluation.

RAG Tutorial 2026: Build a Local Retrieval-Augmented Generation Pipeline
Article Roadmap

Key Takeaways

  • RAG = Retrieve then Generate: First retrieve relevant chunks from your documents using vector similarity search; then give those chunks to the LLM as context. The LLM answers based on the retrieved context, not its training data.
  • Three local components: Ollama (embedding model + LLM), pgvector in PostgreSQL (vector store), Python (orchestration). Zero cloud.
  • Chunk overlap is not optional: Without overlap, a sentence split across two chunks loses context. 50-token overlap is the minimum; 10–20% of chunk size is the standard.
  • Evaluate with RAGAS: Don’t guess if RAG is working — measure Faithfulness and Answer Relevancy with the RAGAS library to quantify quality before and after parameter changes.

Introduction

Direct Answer: How do I build a local RAG pipeline with Ollama and pgvector in Python in 2026?

A local RAG pipeline has three phases: (1) Ingestion — load documents, split into ~500-token chunks with 50-token overlap, embed each chunk with ollama.embeddings(model='nomic-embed-text:v1.5', prompt=chunk), and store the (chunk_text, embedding_vector) pairs in pgvector using CREATE TABLE docs (content TEXT, embedding vector(768)). (2) Retrieval — embed the user’s query with the same model, run SELECT content FROM docs ORDER BY embedding <=> query_vector LIMIT 5 to get the most similar chunks. (3) Generation — inject retrieved chunks into the LLM system prompt as context, then query ollama.chat(model='qwen3:14b', messages=[system_with_context, user_question]). The full pipeline runs locally — install with pip install ollama psycopg2-binary pgvector langchain-community and ensure both Ollama and PostgreSQL with pgvector are running.


Architecture

INGESTION PHASE (once, or when documents update):
┌──────────────┐    chunk     ┌──────────┐    embed     ┌──────────────────┐
│  Documents   │─────────────▶│  Chunks  │──────────────▶│ Embedding Model  │
│ PDF/MD/TXT   │              │ 500 tok  │  Ollama SDK   │ nomic-embed-text │
└──────────────┘              └──────────┘               └────────┬─────────┘
                                                                   │ vectors
                                                         ┌─────────▼────────┐
                                                         │    pgvector DB   │
                                                         │  PostgreSQL 17   │
                                                         └─────────┬────────┘

QUERY PHASE (every user question):
User question ──embed──▶ Query vector ──cosine search──▶ Top K chunks

                                                    ┌─────────▼──────────┐
                                                    │  LLM (Qwen3 14B)   │
                                                    │  System: context   │
                                                    │  User: question    │
                                                    └─────────┬──────────┘

                                                         Grounded answer

Part 1: Setup

pip install ollama psycopg2-binary pgvector langchain-text-splitters --break-system-packages

# Pull embedding and LLM models
ollama pull nomic-embed-text:v1.5   # 274MB — fast, high quality embeddings
ollama pull qwen3:14b               # 9GB — best local LLM for Q&A

# Enable pgvector in PostgreSQL
sudo -u postgres psql -d myapp -c "CREATE EXTENSION IF NOT EXISTS vector;"
sudo -u postgres psql -d myapp -c "SELECT extversion FROM pg_extension WHERE extname='vector';"

Expected output:

 extversion
------------
 0.8.0

Part 2: Ingestion Pipeline

# ingest.py
import ollama
import psycopg2
from psycopg2.extras import execute_values
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path

# Database connection
conn = psycopg2.connect("postgresql://appuser:password@localhost:5432/myapp")

# Create the documents table
with conn.cursor() as cur:
    cur.execute("""
        CREATE TABLE IF NOT EXISTS rag_documents (
            id          BIGSERIAL PRIMARY KEY,
            source      TEXT NOT NULL,
            chunk_index INT NOT NULL,
            content     TEXT NOT NULL,
            embedding   vector(768)    -- nomic-embed-text:v1.5 produces 768-dim vectors
        );
        CREATE INDEX IF NOT EXISTS rag_docs_embedding_idx
            ON rag_documents USING hnsw (embedding vector_cosine_ops)
            WITH (m = 16, ef_construction = 64);
    """)
    conn.commit()
    print("Table and HNSW index created")

# Splitter: 500-token chunks, 50-token overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)

def embed(text: str) -> list[float]:
    return ollama.embeddings(model="nomic-embed-text:v1.5", prompt=text)["embedding"]

def ingest_file(filepath: str) -> int:
    content = Path(filepath).read_text(encoding="utf-8", errors="ignore")
    chunks = splitter.split_text(content)
    source = Path(filepath).name

    rows = []
    for i, chunk in enumerate(chunks):
        vector = embed(chunk)
        rows.append((source, i, chunk, str(vector)))

    with conn.cursor() as cur:
        execute_values(cur,
            "INSERT INTO rag_documents (source, chunk_index, content, embedding) VALUES %s",
            rows,
            template="(%s, %s, %s, %s::vector)"
        )
        conn.commit()

    print(f"  Ingested: {source}{len(chunks)} chunks")
    return len(chunks)

# Ingest documents
docs = list(Path("./docs").glob("*.md")) + list(Path("./docs").glob("*.txt"))
total = sum(ingest_file(str(d)) for d in docs)
print(f"\nTotal chunks ingested: {total}")

conn.close()

Expected output:

Table and HNSW index created
  Ingested: ubuntu-setup.md → 47 chunks
  Ingested: postgresql-guide.md → 63 chunks
  Ingested: docker-tutorial.md → 38 chunks

Total chunks ingested: 148

Part 3: Retrieval and Generation

# rag_query.py
import ollama
import psycopg2

conn = psycopg2.connect("postgresql://appuser:password@localhost:5432/myapp")

def retrieve(query: str, k: int = 5) -> list[dict]:
    """Find the K most semantically similar chunks to the query."""
    query_vec = ollama.embeddings(model="nomic-embed-text:v1.5", prompt=query)["embedding"]

    with conn.cursor() as cur:
        cur.execute("""
            SELECT source, content,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM rag_documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (str(query_vec), str(query_vec), k))
        rows = cur.fetchall()

    return [{"source": r[0], "content": r[1], "similarity": r[2]} for r in rows]

def answer(question: str, k: int = 5) -> dict:
    """Retrieve relevant chunks and generate a grounded answer."""
    chunks = retrieve(question, k=k)

    # Build context from retrieved chunks
    context = "\n\n---\n\n".join(
        f"[Source: {c['source']} | Similarity: {c['similarity']:.3f}]\n{c['content']}"
        for c in chunks
    )

    response = ollama.chat(
        model="qwen3:14b",
        messages=[
            {
                "role": "system",
                "content": f"""You are a helpful technical assistant.
Answer the question using ONLY the information in the provided context.
If the context doesn't contain enough information to answer, say so clearly.
Do not use knowledge from outside the provided context.

CONTEXT:
{context}"""
            },
            {"role": "user", "content": question}
        ]
    )

    return {
        "question": question,
        "answer": response["message"]["content"],
        "sources": [c["source"] for c in chunks],
        "top_similarity": chunks[0]["similarity"] if chunks else 0
    }

# Test queries
questions = [
    "How do I configure PostgreSQL shared_buffers for 8GB RAM?",
    "What UFW commands do I need to allow HTTPS traffic?",
    "How do I check if a Docker container is healthy?",
]

for q in questions:
    result = answer(q)
    print(f"\nQ: {result['question']}")
    print(f"A: {result['answer'][:200]}...")
    print(f"   Sources: {result['sources'][:2]} | Top similarity: {result['top_similarity']:.3f}")

Expected output:

Q: How do I configure PostgreSQL shared_buffers for 8GB RAM?
A: Based on the provided context, set shared_buffers = 2GB (25% of 8GB RAM) in 
/etc/postgresql/17/main/conf.d/performance.conf. Also set effective_cache_size = 6GB...
   Sources: ['postgresql-guide.md', 'ubuntu-setup.md'] | Top similarity: 0.891

Q: What UFW commands do I need to allow HTTPS traffic?
A: From the context: sudo ufw allow https (allows port 443/tcp) and sudo ufw allow http 
(port 80/tcp). Always run sudo ufw allow ssh first before enabling the firewall...
   Sources: ['ubuntu-setup.md', 'docker-tutorial.md'] | Top similarity: 0.847

Part 4: Evaluation with RAGAS

pip install ragas --break-system-packages
# evaluate.py
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": []
}

test_cases = [
    {
        "question": "What is the recommended shared_buffers for 8GB RAM?",
        "ground_truth": "2GB (25% of total RAM)"
    },
    {
        "question": "How do I allow HTTPS in UFW?",
        "ground_truth": "sudo ufw allow https"
    }
]

for tc in test_cases:
    result = answer(tc["question"])
    chunks = retrieve(tc["question"])

    eval_data["question"].append(tc["question"])
    eval_data["answer"].append(result["answer"])
    eval_data["contexts"].append([c["content"] for c in chunks])
    eval_data["ground_truth"].append(tc["ground_truth"])

dataset = Dataset.from_dict(eval_data)
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print("\nRAGAS Evaluation:")
print(f"  Faithfulness:      {scores['faithfulness']:.3f}   (answer supported by context?)")
print(f"  Answer Relevancy:  {scores['answer_relevancy']:.3f}   (on-topic answer?)")
print(f"  Context Precision: {scores['context_precision']:.3f}   (chunks actually relevant?)")

Expected output:

RAGAS Evaluation:
  Faithfulness:      0.912   (answer supported by context?)
  Answer Relevancy:  0.884   (on-topic answer?)
  Context Precision: 0.856   (chunks actually relevant?)

Scores above 0.8 indicate a well-functioning RAG pipeline. If Faithfulness is low, the LLM is hallucinating beyond the context — tighten the system prompt. If Context Precision is low, improve chunking or increase chunk overlap.


Troubleshooting

Low similarity scores (< 0.5) for relevant queries

Cause: Chunking too coarse (chunks too large) or wrong embedding model. Fix: Reduce chunk_size to 300 tokens. Ensure you’re using the same embedding model for both ingestion and querying.

LLM answers with information not in context

Cause: System prompt not firm enough about using only retrieved context. Fix: Add: "If the answer is not in the context, respond with: 'I don't have that information in the provided documents.'" to the system prompt.

pgvector extension not found

Fix: sudo apt-get install postgresql-17-pgvector then CREATE EXTENSION vector; in the database.


Conclusion

A sovereign RAG pipeline: documents chunked and embedded locally with nomic-embed-text:v1.5, stored in pgvector, retrieved via cosine similarity, and answered by Qwen3 14B — all on your hardware with zero external API calls. RAGAS evaluation gives you objective metrics to tune chunk size, overlap, and retrieval depth.

See pgvector vs Qdrant vs ChromaDB 2026 for a deeper comparison of vector store options, and Advanced RAG Techniques for hybrid search, reranking, and multi-query retrieval.


People Also Ask

What is the difference between RAG and fine-tuning?

RAG retrieves relevant information at query time from a dynamic document store — good for up-to-date knowledge, specific documents, and grounded answers. Fine-tuning trains the model to internalize patterns, styles, and domain knowledge — good for consistent tone, specialised output formats, and domain vocabulary. RAG doesn’t modify the model; fine-tuning creates a new model version. Use RAG when your knowledge changes frequently or when you need to cite sources. Use fine-tuning when you need the model to behave differently (output format, tone, domain expertise). See RAG vs Fine-Tuning vs Prompt Engineering 2026 for the full decision framework.

What chunk size should I use for RAG?

Start with 500 tokens, 50-token overlap as a baseline. For technical documentation (code, API docs): try 300 tokens with 30-token overlap — technical content is denser and benefits from smaller chunks. For narrative text (books, articles): try 800 tokens with 80-token overlap — more context per chunk helps the LLM understand meaning. The best chunk size depends on your documents and questions — measure with RAGAS after tuning.


Part 11: RAG System Design Patterns

A robust RAG system starts with clear architectural patterns.

11.1 Modular retrieval and generation

Keep retrieval and generation separate. The retrieval component should return source documents, while the generation component should compose the final answer from those documents.

This separation makes it easier to test and swap parts of the stack.

11.2 Incremental document updates

Design your ingestion pipeline to handle incremental updates. Recompute embeddings only for changed documents, and refresh the index without rebuilding the entire corpus when possible.

11.3 Freshness and recency

If your content changes frequently, separate recent documents into a “hot” index. Query both the hot index and the larger archive, then merge results with a recency-aware ranking.

Part 12: Chunking and Document Quality

How you split documents matters more than most people realise.

12.1 Semantic chunking

Chunk by semantic boundaries: paragraphs, sections, or logical units. Avoid arbitrary fixed-size blocks that break meaning.

12.2 Chunk metadata

Store metadata with every chunk: source document, section title, author, publication date, and trust level. This metadata is crucial for filtering and provenance.

12.3 Chunk ranking and diversity

When returning multiple chunks, prefer diverse sections from different sources over many similar chunks from one document. Diversity reduces redundancy and improves answer quality.

Part 13: Retrieval Chain and Scoring

The retrieval chain is the heart of RAG.

13.1 Candidate generation

Use semantic search to generate candidate chunks. If you have a large corpus, add a lightweight keyword filter before the semantic step to narrow the search space.

13.2 Re-ranking

Re-rank candidates using a second-stage model or a relevance heuristic. Consider both similarity score and document quality metadata.

13.3 Token budget management

Limit the total number of tokens sent to the generator. This budget should include prompt text, retrieved chunks, and the expected answer. If a query is very broad, use a smaller number of higher-quality chunks.

Part 14: Prompt Engineering for RAG

Effective prompts are the final step.

14.1 Grounding instructions

Tell the model to rely only on retrieved sources.

Use only the information provided in the Sources section. If the answer cannot be found, say "I don't know."

14.2 Structured answer templates

Use answer templates to constrain the output.

Question: {question}
Sources:
{sources}
Answer:
1. Summary:
2. Supporting sources:

This helps with consistency and makes validation easier.

14.3 Error recovery

Include instructions for uncertain cases.

If the source data is incomplete or conflicting, be transparent about the uncertainty and list the relevant sources.

Part 15: Evaluation and Feedback

Measure RAG quality with both automated and human reviews.

15.1 Ground-truth datasets

Build a test set of questions with expected answers and source provenance. Use it to validate retrieval recall and generation accuracy.

15.2 Human review

Have reviewers verify that model answers are supported by the cited sources. Flag hallucinations and wrong source attributions.

15.3 Continuous feedback loops

Collect user feedback and feed it back into the system. If a query frequently results in wrong answers, improve the retrieval data, prompt, or source corpus.

Part 16: Deployment and Service Patterns

Deploy RAG as a stable service with clear boundaries.

16.1 Local inference vs remote

A self-hosted RAG system can run entirely locally or use a local retrieval stage with a remote generator. For sovereignty, keep both retrieval and generation on-premises whenever possible.

16.2 API contract

Define a contract for RAG API responses.

{
  "answer": "...",
  "sources": ["doc1","doc2"],
  "confidence": 0.82
}

Include source provenance and an optional confidence score.

16.3 Rate limiting and quotas

Protect your local service with rate limits. RAG generation can be expensive, and unbounded usage can overwhelm CPU/GPU resources.

Part 17: Observability and Debugging

Visibility into the retrieval and generation pipeline is essential.

17.1 Retrieval metrics

Track query volume, retrieval latency, number of chunks returned, and recall rates. Use these metrics to detect index degradation or stale data.

17.2 Generation metrics

Track generation latency, prompt lengths, and token usage. Monitor for slow queries and unexpected bursts of long answers.

17.3 Error tracing

Log errors at every stage: ingestion failures, embedding service errors, index timeouts, and generation failures. Correlate logs with request IDs.

Part 18: Security and Access Control

A RAG system can expose sensitive documents if not constrained.

18.1 Document-level access control

Protect sensitive documents with metadata-based filtering. Only retrieve them when the user is authorised.

18.2 Prompt sanitization

Sanitize user queries before using them in prompt templates. Remove or encode control characters and dangerous payloads.

18.3 Auditability

Keep an audit log of queries, retrieved sources, and generated answers. This is indispensable for compliance and incident response.

Part 19: Scaling the Vector Store

Vector stores can grow quickly. Plan for scale from the start.

19.1 Partitioning

Partition large corpora by domain, date, or source. Query the most relevant partitions first to improve latency.

19.2 Index maintenance

Rebuild or compress your index periodically to remove stale vectors and improve search quality. Monitoring index size and query latency helps determine the right cadence.

19.3 Hybrid indexes

Combine dense vectors with sparse keyword search for a hybrid retrieval strategy. This can improve recall on long-tail queries.

Part 20: Final RAG Operations Checklist

  • retrieval and generation are decoupled
  • chunking uses semantic boundaries and metadata
  • prompts force grounding in sources
  • evaluation includes both automated and human review
  • security controls prevent unauthorized access to sensitive content
  • metrics capture retrieval, generation, and infrastructure health
  • index updates are incremental and repeatable
  • audit logs preserve provenance and query history

A production RAG deployment is not simply about retrieval. It is about making search, generation, and governance work together in a way that is reliable, explainable, and maintainable.

Part 21: Vector Store Selection and Tradeoffs

Choosing the right vector store is critical for performance and cost.

21.1 Local vs remote indexes

Local vector stores give you sovereignty and low latency. Remote vector databases can simplify scaling but introduce external dependencies.

21.2 Approximate nearest neighbour settings

Tune ef_search, M, and other HNSW parameters for your workload. Higher values increase recall at the cost of latency and memory.

21.3 Storage compression

Compress vector data when RAM is limited. Trade off a small retrieval latency increase for lower memory usage.

Part 22: Query and Prompt Caching

Cache results to reduce repeated computation.

22.1 Query result caches

Cache retrieval results for repeated queries. Use a time-based expiration so stale data is refreshed.

22.2 Prompt template caches

Cache compiled prompt templates and injected metadata. This reduces overhead in high-throughput systems.

22.3 Answer cache invalidation

Invalidate caches when underlying documents change. Record document version or timestamp with the cache key.

Part 23: Handling Unstructured and Multimodal Data

RAG systems often need to work beyond plain text.

23.1 OCR and scanned documents

Run OCR on scanned files, then chunk and embed the extracted text. Store source coordinates for provenance.

23.2 Image and audio embeddings

For multimodal retrieval, use embeddings that support images or audio. Keep the modality metadata alongside the text chunks.

23.3 Composite retrieval

Combine text and image matches in the retrieval stage. Use a weighted scoring strategy to balance modalities.

Part 24: User Experience and Answer Quality

The quality of the result matters as much as the correctness.

24.1 Concise answer generation

Generate concise answers with source summaries. Lengthy, verbose responses are harder to verify and less useful in practice.

24.2 Answer confidence and transparency

Return confidence indicators and clearly cite the sources used. This helps users trust and evaluate the answer.

24.3 Handling unknowns

If the system cannot answer confidently, say so. A safe response is better than a convincing hallucination.

Part 25: Governance and Documentation

Keep your RAG system understandable and auditable.

25.1 System documentation

Document the retrieval pipeline, prompt templates, index refresh process, and access controls. Include a glossary of terms for the team.

25.2 Review cycles

Review RAG pipeline components periodically. Validate that embeddings, index quality, and prompt templates still match your use cases.

25.3 Change logs

Keep change logs for corpus updates, retrieval tuning, model changes, and prompt modifications. This is essential for debugging and accountability.

Part 26: Observability and Debugging

A RAG system is only maintainable if its behavior is observable.

26.1 Retrieval transparency

Log which chunks were retrieved for each query and their similarity scores. This helps diagnose why the model produced a particular answer.

26.2 Prompt and answer tracing

Capture the final prompt, the retrieved sources, and the generated answer for debugging. Store these traces separately from production logs to avoid exposing sensitive content unnecessarily.

26.3 Query classification

Track query types, intents, and failure modes. Use this data to identify whether your system is underperforming on a particular class of questions.

Part 27: Iterative Improvement Workflows

Improve RAG through repeatable processes.

27.1 User feedback loops

Collect explicit user feedback on answer quality. Feed low-confidence or incorrect answers back into model tuning, retrieval adjustments, or prompt revisions.

27.2 Progressive corpus expansion

Add new documents incrementally and validate retrieval quality after each update. Avoid large batch refreshes unless necessary.

27.3 Guardrail evolution

When you change prompt templates or answer policies, keep the previous version as a fallback. Record the change rationale and evaluation results.

Part 28: Security Boundaries and Sensitive Content

Protect sensitive information at every layer.

28.1 Context filtering

Pre-filter user queries and documents to avoid retrieving or exposing private data. Use metadata tags and access controls in the retrieval stage.

28.2 Response sanitisation

Sanitise generated answers to remove or redact sensitive terms when required. This is particularly important for documents with personally identifiable information.

28.3 Audit logs for source access

Log which source documents were used for each answer, without storing the full response if it contains regulated data. This provides traceability for audits.

Part 29: Performance Engineering

Tune your RAG pipeline for latency and throughput.

29.1 Retrieval cache warmup

Pre-warm the vector store and embedding cache before peak traffic. For local deployments, keep the index in memory for faster search.

29.2 Token budget optimization

Trim retrieved content to fit the generator’s input budget. Remove redundant text and prefer concise, high-value chunks.

29.3 Batch retrieval and generation

On high-throughput systems, batch multiple requests through the retrieval and generation stages. This can improve hardware utilization while keeping latency within targets.

Part 30: Final Product and Service Considerations

A production RAG service must feel solid to users.

30.1 Consistent answer format

Keep answer structure consistent across similar queries. This makes responses easier to consume and more reliable.

30.2 Source attribution UX

Present sources transparently, but not verbosely. Provide enough provenance for trust without overwhelming the user.

30.3 Service-level expectations

Define how the RAG service should behave under load, during updates, and on failure. Document and enforce those expectations in the operational runbook.

Part 31: Practical Prompt Templates

Well-structured templates can make your RAG answers more consistent and reliable.

31.1 Citations-first template

You are a knowledgeable assistant. Use only the sources listed below.
Cite the source number for each statement.

Sources:
{sources}

Question: {question}

Answer:
1. Summary:
2. Citations:

31.2 Safety-aware template

Answer the question directly. If the information is not present in the sources, say "I don't know." Do not fabricate details.

Sources:
{sources}

Question: {question}

Answer:

31.3 Evaluation template

Keep a small set of evaluation prompts with expected output structure. This makes it easier to detect prompt or retrieval regressions.

Further Reading

Tested on: Ubuntu 24.04 LTS (RTX 4090). Ollama 0.5.12, pgvector 0.8.0, RAGAS 0.1.x. Last verified: April 28, 2026.

Kofi Mensah

About the Author

Inference Economics & Hardware Architect

Electrical Engineer | Hardware Systems Architect | 8+ Years in GPU/AI Optimization | ARM & x86 Specialist

Kofi Mensah is a hardware architect and AI infrastructure specialist focused on optimizing inference costs for on-device and local-first AI deployments. With expertise in CPU/GPU architectures, Kofi analyzes real-world performance trade-offs between commercial cloud AI services and sovereign, self-hosted models running on consumer and enterprise hardware (Apple Silicon, NVIDIA, AMD, custom ARM systems). He quantifies the total cost of ownership for AI infrastructure and evaluates which deployment models (cloud, hybrid, on-device) make economic sense for different workloads and use cases. Kofi's technical analysis covers model quantization, inference optimization techniques (llama.cpp, vLLM), and hardware acceleration for language models, vision models, and multimodal systems. At Vucense, Kofi provides detailed cost analysis and performance benchmarks to help developers understand the real economics of sovereign AI.

View Profile

Further Reading

All Dev Corner

Comments