Dev Corner RAG & Vector Search RAG Fundamentals

RAG Tutorial 2026: Build a Local Retrieval-Augmented Generation Pipeline

99 / 100

🟡Intermediate

Build a sovereign RAG pipeline from scratch with Ollama, pgvector, and Python. Covers document chunking, embedding generation, vector search, context injection, and RAGAS evaluation.

Current

By Kofi Mensah ✓

Mar 4, 2026

18 min

30 min

RAG Tutorial 2026: Build a Local Retrieval-Augmented Generation Pipeline

Article Roadmap

Key Takeaways

RAG (Retrieval-Augmented Generation) grounds LLM answers in your documents — the model retrieves relevant text chunks at query time and generates answers based only on that retrieved context, reducing hallucination and enabling up-to-date knowledge.
The sovereign RAG stack requires three components: an embedding model (nomic-embed-text:v1.5 via Ollama), a vector store (pgvector in PostgreSQL), and an LLM (Qwen3 14B via Ollama) — all running locally with zero cloud API calls after setup.
Chunk size is the most important RAG tuning parameter — 500 tokens with 50-token overlap is a good starting point; too large and retrieved context becomes noisy, too small and important context gets cut off mid-sentence.
Evaluate your RAG pipeline with RAGAS — it measures Faithfulness (does the answer match the retrieved context?), Answer Relevancy (is the answer on-topic?), and Context Precision (were the retrieved chunks actually relevant?) to objectively compare chunking and retrieval strategies.

Key Takeaways

RAG = Retrieve then Generate: First retrieve relevant chunks from your documents using vector similarity search; then give those chunks to the LLM as context. The LLM answers based on the retrieved context, not its training data.
Three local components: Ollama (embedding model + LLM), pgvector in PostgreSQL (vector store), Python (orchestration). Zero cloud.
Chunk overlap is not optional: Without overlap, a sentence split across two chunks loses context. 50-token overlap is the minimum; 10–20% of chunk size is the standard.
Evaluate with RAGAS: Don’t guess if RAG is working — measure Faithfulness and Answer Relevancy with the RAGAS library to quantify quality before and after parameter changes.

Introduction

Direct Answer: How do I build a local RAG pipeline with Ollama and pgvector in Python in 2026?

A local RAG pipeline has three phases: (1) Ingestion — load documents, split into ~500-token chunks with 50-token overlap, embed each chunk with ollama.embeddings(model='nomic-embed-text:v1.5', prompt=chunk), and store the (chunk_text, embedding_vector) pairs in pgvector using CREATE TABLE docs (content TEXT, embedding vector(768)). (2) Retrieval — embed the user’s query with the same model, run SELECT content FROM docs ORDER BY embedding <=> query_vector LIMIT 5 to get the most similar chunks. (3) Generation — inject retrieved chunks into the LLM system prompt as context, then query ollama.chat(model='qwen3:14b', messages=[system_with_context, user_question]). The full pipeline runs locally — install with pip install ollama psycopg2-binary pgvector langchain-community and ensure both Ollama and PostgreSQL with pgvector are running.

Architecture

INGESTION PHASE (once, or when documents update):
┌──────────────┐    chunk     ┌──────────┐    embed     ┌──────────────────┐
│  Documents   │─────────────▶│  Chunks  │──────────────▶│ Embedding Model  │
│ PDF/MD/TXT   │              │ 500 tok  │  Ollama SDK   │ nomic-embed-text │
└──────────────┘              └──────────┘               └────────┬─────────┘
                                                                   │ vectors
                                                         ┌─────────▼────────┐
                                                         │    pgvector DB   │
                                                         │  PostgreSQL 17   │
                                                         └─────────┬────────┘

QUERY PHASE (every user question):
User question ──embed──▶ Query vector ──cosine search──▶ Top K chunks
                                                              │
                                                    ┌─────────▼──────────┐
                                                    │  LLM (Qwen3 14B)   │
                                                    │  System: context   │
                                                    │  User: question    │
                                                    └─────────┬──────────┘
                                                              │
                                                         Grounded answer

Part 1: Setup

pip install ollama psycopg2-binary pgvector langchain-text-splitters --break-system-packages

# Pull embedding and LLM models
ollama pull nomic-embed-text:v1.5   # 274MB — fast, high quality embeddings
ollama pull qwen3:14b               # 9GB — best local LLM for Q&A

# Enable pgvector in PostgreSQL
sudo -u postgres psql -d myapp -c "CREATE EXTENSION IF NOT EXISTS vector;"
sudo -u postgres psql -d myapp -c "SELECT extversion FROM pg_extension WHERE extname='vector';"

Expected output:

 extversion
------------
 0.8.0

Part 2: Ingestion Pipeline

# ingest.py
import ollama
import psycopg2
from psycopg2.extras import execute_values
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pathlib import Path

# Database connection
conn = psycopg2.connect("postgresql://appuser:password@localhost:5432/myapp")

# Create the documents table
with conn.cursor() as cur:
    cur.execute("""
        CREATE TABLE IF NOT EXISTS rag_documents (
            id          BIGSERIAL PRIMARY KEY,
            source      TEXT NOT NULL,
            chunk_index INT NOT NULL,
            content     TEXT NOT NULL,
            embedding   vector(768)    -- nomic-embed-text:v1.5 produces 768-dim vectors
        );
        CREATE INDEX IF NOT EXISTS rag_docs_embedding_idx
            ON rag_documents USING hnsw (embedding vector_cosine_ops)
            WITH (m = 16, ef_construction = 64);
    """)
    conn.commit()
    print("Table and HNSW index created")

# Splitter: 500-token chunks, 50-token overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
)

def embed(text: str) -> list[float]:
    return ollama.embeddings(model="nomic-embed-text:v1.5", prompt=text)["embedding"]

def ingest_file(filepath: str) -> int:
    content = Path(filepath).read_text(encoding="utf-8", errors="ignore")
    chunks = splitter.split_text(content)
    source = Path(filepath).name

    rows = []
    for i, chunk in enumerate(chunks):
        vector = embed(chunk)
        rows.append((source, i, chunk, str(vector)))

    with conn.cursor() as cur:
        execute_values(cur,
            "INSERT INTO rag_documents (source, chunk_index, content, embedding) VALUES %s",
            rows,
            template="(%s, %s, %s, %s::vector)"
        )
        conn.commit()

    print(f"  Ingested: {source} → {len(chunks)} chunks")
    return len(chunks)

# Ingest documents
docs = list(Path("./docs").glob("*.md")) + list(Path("./docs").glob("*.txt"))
total = sum(ingest_file(str(d)) for d in docs)
print(f"\nTotal chunks ingested: {total}")

conn.close()

Expected output:

Table and HNSW index created
  Ingested: ubuntu-setup.md → 47 chunks
  Ingested: postgresql-guide.md → 63 chunks
  Ingested: docker-tutorial.md → 38 chunks

Total chunks ingested: 148

Part 3: Retrieval and Generation

# rag_query.py
import ollama
import psycopg2

conn = psycopg2.connect("postgresql://appuser:password@localhost:5432/myapp")

def retrieve(query: str, k: int = 5) -> list[dict]:
    """Find the K most semantically similar chunks to the query."""
    query_vec = ollama.embeddings(model="nomic-embed-text:v1.5", prompt=query)["embedding"]

    with conn.cursor() as cur:
        cur.execute("""
            SELECT source, content,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM rag_documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (str(query_vec), str(query_vec), k))
        rows = cur.fetchall()

    return [{"source": r[0], "content": r[1], "similarity": r[2]} for r in rows]

def answer(question: str, k: int = 5) -> dict:
    """Retrieve relevant chunks and generate a grounded answer."""
    chunks = retrieve(question, k=k)

    # Build context from retrieved chunks
    context = "\n\n---\n\n".join(
        f"[Source: {c['source']} | Similarity: {c['similarity']:.3f}]\n{c['content']}"
        for c in chunks
    )

    response = ollama.chat(
        model="qwen3:14b",
        messages=[
            {
                "role": "system",
                "content": f"""You are a helpful technical assistant.
Answer the question using ONLY the information in the provided context.
If the context doesn't contain enough information to answer, say so clearly.
Do not use knowledge from outside the provided context.

CONTEXT:
{context}"""
            },
            {"role": "user", "content": question}
        ]
    )

    return {
        "question": question,
        "answer": response["message"]["content"],
        "sources": [c["source"] for c in chunks],
        "top_similarity": chunks[0]["similarity"] if chunks else 0
    }

# Test queries
questions = [
    "How do I configure PostgreSQL shared_buffers for 8GB RAM?",
    "What UFW commands do I need to allow HTTPS traffic?",
    "How do I check if a Docker container is healthy?",
]

for q in questions:
    result = answer(q)
    print(f"\nQ: {result['question']}")
    print(f"A: {result['answer'][:200]}...")
    print(f"   Sources: {result['sources'][:2]} | Top similarity: {result['top_similarity']:.3f}")

Expected output:

Q: How do I configure PostgreSQL shared_buffers for 8GB RAM?
A: Based on the provided context, set shared_buffers = 2GB (25% of 8GB RAM) in 
/etc/postgresql/17/main/conf.d/performance.conf. Also set effective_cache_size = 6GB...
   Sources: ['postgresql-guide.md', 'ubuntu-setup.md'] | Top similarity: 0.891

Q: What UFW commands do I need to allow HTTPS traffic?
A: From the context: sudo ufw allow https (allows port 443/tcp) and sudo ufw allow http 
(port 80/tcp). Always run sudo ufw allow ssh first before enabling the firewall...
   Sources: ['ubuntu-setup.md', 'docker-tutorial.md'] | Top similarity: 0.847

Part 4: Evaluation with RAGAS

pip install ragas --break-system-packages

# evaluate.py
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": []
}

test_cases = [
    {
        "question": "What is the recommended shared_buffers for 8GB RAM?",
        "ground_truth": "2GB (25% of total RAM)"
    },
    {
        "question": "How do I allow HTTPS in UFW?",
        "ground_truth": "sudo ufw allow https"
    }
]

for tc in test_cases:
    result = answer(tc["question"])
    chunks = retrieve(tc["question"])

    eval_data["question"].append(tc["question"])
    eval_data["answer"].append(result["answer"])
    eval_data["contexts"].append([c["content"] for c in chunks])
    eval_data["ground_truth"].append(tc["ground_truth"])

dataset = Dataset.from_dict(eval_data)
scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_precision])
print("\nRAGAS Evaluation:")
print(f"  Faithfulness:      {scores['faithfulness']:.3f}   (answer supported by context?)")
print(f"  Answer Relevancy:  {scores['answer_relevancy']:.3f}   (on-topic answer?)")
print(f"  Context Precision: {scores['context_precision']:.3f}   (chunks actually relevant?)")

Expected output:

RAGAS Evaluation:
  Faithfulness:      0.912   (answer supported by context?)
  Answer Relevancy:  0.884   (on-topic answer?)
  Context Precision: 0.856   (chunks actually relevant?)

Scores above 0.8 indicate a well-functioning RAG pipeline. If Faithfulness is low, the LLM is hallucinating beyond the context — tighten the system prompt. If Context Precision is low, improve chunking or increase chunk overlap.

Troubleshooting

Low similarity scores (< 0.5) for relevant queries

Cause: Chunking too coarse (chunks too large) or wrong embedding model. Fix: Reduce chunk_size to 300 tokens. Ensure you’re using the same embedding model for both ingestion and querying.

LLM answers with information not in context

Cause: System prompt not firm enough about using only retrieved context. Fix: Add: "If the answer is not in the context, respond with: 'I don't have that information in the provided documents.'" to the system prompt.

`pgvector` extension not found

Fix: sudo apt-get install postgresql-17-pgvector then CREATE EXTENSION vector; in the database.

Conclusion

A sovereign RAG pipeline: documents chunked and embedded locally with nomic-embed-text:v1.5, stored in pgvector, retrieved via cosine similarity, and answered by Qwen3 14B — all on your hardware with zero external API calls. RAGAS evaluation gives you objective metrics to tune chunk size, overlap, and retrieval depth.

See pgvector vs Qdrant vs ChromaDB 2026 for a deeper comparison of vector store options, and Advanced RAG Techniques for hybrid search, reranking, and multi-query retrieval.

Part 11: RAG System Design Patterns

A robust RAG system starts with clear architectural patterns.

11.1 Modular retrieval and generation

Keep retrieval and generation separate. The retrieval component should return source documents, while the generation component should compose the final answer from those documents.

This separation makes it easier to test and swap parts of the stack.

11.2 Incremental document updates

Design your ingestion pipeline to handle incremental updates. Recompute embeddings only for changed documents, and refresh the index without rebuilding the entire corpus when possible.

11.3 Freshness and recency

If your content changes frequently, separate recent documents into a “hot” index. Query both the hot index and the larger archive, then merge results with a recency-aware ranking.

Part 12: Chunking and Document Quality

How you split documents matters more than most people realise.

12.1 Semantic chunking

Chunk by semantic boundaries: paragraphs, sections, or logical units. Avoid arbitrary fixed-size blocks that break meaning.

12.2 Chunk metadata

Store metadata with every chunk: source document, section title, author, publication date, and trust level. This metadata is crucial for filtering and provenance.

12.3 Chunk ranking and diversity

When returning multiple chunks, prefer diverse sections from different sources over many similar chunks from one document. Diversity reduces redundancy and improves answer quality.

Part 13: Retrieval Chain and Scoring

The retrieval chain is the heart of RAG.

13.1 Candidate generation

Use semantic search to generate candidate chunks. If you have a large corpus, add a lightweight keyword filter before the semantic step to narrow the search space.

13.2 Re-ranking

Re-rank candidates using a second-stage model or a relevance heuristic. Consider both similarity score and document quality metadata.

13.3 Token budget management

Limit the total number of tokens sent to the generator. This budget should include prompt text, retrieved chunks, and the expected answer. If a query is very broad, use a smaller number of higher-quality chunks.

Part 14: Prompt Engineering for RAG

Effective prompts are the final step.

14.1 Grounding instructions

Tell the model to rely only on retrieved sources.

Use only the information provided in the Sources section. If the answer cannot be found, say "I don't know."

14.2 Structured answer templates

Use answer templates to constrain the output.

Question: {question}
Sources:
{sources}
Answer:
1. Summary:
2. Supporting sources:

This helps with consistency and makes validation easier.

14.3 Error recovery

Include instructions for uncertain cases.

If the source data is incomplete or conflicting, be transparent about the uncertainty and list the relevant sources.

Part 15: Evaluation and Feedback

Measure RAG quality with both automated and human reviews.

15.1 Ground-truth datasets

Build a test set of questions with expected answers and source provenance. Use it to validate retrieval recall and generation accuracy.

15.2 Human review

Have reviewers verify that model answers are supported by the cited sources. Flag hallucinations and wrong source attributions.

15.3 Continuous feedback loops

Collect user feedback and feed it back into the system. If a query frequently results in wrong answers, improve the retrieval data, prompt, or source corpus.

Part 16: Deployment and Service Patterns

Deploy RAG as a stable service with clear boundaries.

16.1 Local inference vs remote

A self-hosted RAG system can run entirely locally or use a local retrieval stage with a remote generator. For sovereignty, keep both retrieval and generation on-premises whenever possible.

16.2 API contract

Define a contract for RAG API responses.

{
  "answer": "...",
  "sources": ["doc1","doc2"],
  "confidence": 0.82
}

Include source provenance and an optional confidence score.

16.3 Rate limiting and quotas

Protect your local service with rate limits. RAG generation can be expensive, and unbounded usage can overwhelm CPU/GPU resources.

Part 17: Observability and Debugging

Visibility into the retrieval and generation pipeline is essential.

17.1 Retrieval metrics

Track query volume, retrieval latency, number of chunks returned, and recall rates. Use these metrics to detect index degradation or stale data.

17.2 Generation metrics

Track generation latency, prompt lengths, and token usage. Monitor for slow queries and unexpected bursts of long answers.

17.3 Error tracing

Log errors at every stage: ingestion failures, embedding service errors, index timeouts, and generation failures. Correlate logs with request IDs.

Part 18: Security and Access Control

A RAG system can expose sensitive documents if not constrained.

18.1 Document-level access control

Protect sensitive documents with metadata-based filtering. Only retrieve them when the user is authorised.

18.2 Prompt sanitization

Sanitize user queries before using them in prompt templates. Remove or encode control characters and dangerous payloads.

18.3 Auditability

Keep an audit log of queries, retrieved sources, and generated answers. This is indispensable for compliance and incident response.

Part 19: Scaling the Vector Store

Vector stores can grow quickly. Plan for scale from the start.

19.1 Partitioning

Partition large corpora by domain, date, or source. Query the most relevant partitions first to improve latency.

19.2 Index maintenance

Rebuild or compress your index periodically to remove stale vectors and improve search quality. Monitoring index size and query latency helps determine the right cadence.

19.3 Hybrid indexes

Combine dense vectors with sparse keyword search for a hybrid retrieval strategy. This can improve recall on long-tail queries.

Part 20: Final RAG Operations Checklist

retrieval and generation are decoupled
chunking uses semantic boundaries and metadata
prompts force grounding in sources
evaluation includes both automated and human review
security controls prevent unauthorized access to sensitive content
metrics capture retrieval, generation, and infrastructure health
index updates are incremental and repeatable
audit logs preserve provenance and query history

A production RAG deployment is not simply about retrieval. It is about making search, generation, and governance work together in a way that is reliable, explainable, and maintainable.

Part 21: Vector Store Selection and Tradeoffs

Choosing the right vector store is critical for performance and cost.

21.1 Local vs remote indexes

Local vector stores give you sovereignty and low latency. Remote vector databases can simplify scaling but introduce external dependencies.

21.2 Approximate nearest neighbour settings

Tune ef_search, M, and other HNSW parameters for your workload. Higher values increase recall at the cost of latency and memory.

21.3 Storage compression

Compress vector data when RAM is limited. Trade off a small retrieval latency increase for lower memory usage.

Part 22: Query and Prompt Caching

Cache results to reduce repeated computation.

22.1 Query result caches

Cache retrieval results for repeated queries. Use a time-based expiration so stale data is refreshed.

22.2 Prompt template caches

Cache compiled prompt templates and injected metadata. This reduces overhead in high-throughput systems.

22.3 Answer cache invalidation

Invalidate caches when underlying documents change. Record document version or timestamp with the cache key.

Part 23: Handling Unstructured and Multimodal Data

RAG systems often need to work beyond plain text.

23.1 OCR and scanned documents

Run OCR on scanned files, then chunk and embed the extracted text. Store source coordinates for provenance.

23.2 Image and audio embeddings

For multimodal retrieval, use embeddings that support images or audio. Keep the modality metadata alongside the text chunks.

23.3 Composite retrieval

Combine text and image matches in the retrieval stage. Use a weighted scoring strategy to balance modalities.

Part 24: User Experience and Answer Quality

The quality of the result matters as much as the correctness.

24.1 Concise answer generation

Generate concise answers with source summaries. Lengthy, verbose responses are harder to verify and less useful in practice.

24.2 Answer confidence and transparency

Return confidence indicators and clearly cite the sources used. This helps users trust and evaluate the answer.

24.3 Handling unknowns

If the system cannot answer confidently, say so. A safe response is better than a convincing hallucination.

Part 25: Governance and Documentation

Keep your RAG system understandable and auditable.

25.1 System documentation

Document the retrieval pipeline, prompt templates, index refresh process, and access controls. Include a glossary of terms for the team.

25.2 Review cycles

Review RAG pipeline components periodically. Validate that embeddings, index quality, and prompt templates still match your use cases.

25.3 Change logs

Keep change logs for corpus updates, retrieval tuning, model changes, and prompt modifications. This is essential for debugging and accountability.

Part 26: Observability and Debugging

A RAG system is only maintainable if its behavior is observable.

26.1 Retrieval transparency

Log which chunks were retrieved for each query and their similarity scores. This helps diagnose why the model produced a particular answer.

26.2 Prompt and answer tracing

Capture the final prompt, the retrieved sources, and the generated answer for debugging. Store these traces separately from production logs to avoid exposing sensitive content unnecessarily.

26.3 Query classification

Track query types, intents, and failure modes. Use this data to identify whether your system is underperforming on a particular class of questions.

Part 27: Iterative Improvement Workflows

Improve RAG through repeatable processes.

27.1 User feedback loops

Collect explicit user feedback on answer quality. Feed low-confidence or incorrect answers back into model tuning, retrieval adjustments, or prompt revisions.

27.2 Progressive corpus expansion

Add new documents incrementally and validate retrieval quality after each update. Avoid large batch refreshes unless necessary.

27.3 Guardrail evolution

When you change prompt templates or answer policies, keep the previous version as a fallback. Record the change rationale and evaluation results.

Part 28: Security Boundaries and Sensitive Content

Protect sensitive information at every layer.

28.1 Context filtering

Pre-filter user queries and documents to avoid retrieving or exposing private data. Use metadata tags and access controls in the retrieval stage.

28.2 Response sanitisation

Sanitise generated answers to remove or redact sensitive terms when required. This is particularly important for documents with personally identifiable information.

28.3 Audit logs for source access

Log which source documents were used for each answer, without storing the full response if it contains regulated data. This provides traceability for audits.

Part 29: Performance Engineering

Tune your RAG pipeline for latency and throughput.

29.1 Retrieval cache warmup

Pre-warm the vector store and embedding cache before peak traffic. For local deployments, keep the index in memory for faster search.

29.2 Token budget optimization

Trim retrieved content to fit the generator’s input budget. Remove redundant text and prefer concise, high-value chunks.

29.3 Batch retrieval and generation

On high-throughput systems, batch multiple requests through the retrieval and generation stages. This can improve hardware utilization while keeping latency within targets.

Part 30: Final Product and Service Considerations

A production RAG service must feel solid to users.

30.1 Consistent answer format

Keep answer structure consistent across similar queries. This makes responses easier to consume and more reliable.

30.2 Source attribution UX

Present sources transparently, but not verbosely. Provide enough provenance for trust without overwhelming the user.

30.3 Service-level expectations

Define how the RAG service should behave under load, during updates, and on failure. Document and enforce those expectations in the operational runbook.

Part 31: Practical Prompt Templates

Well-structured templates can make your RAG answers more consistent and reliable.

31.1 Citations-first template

You are a knowledgeable assistant. Use only the sources listed below.
Cite the source number for each statement.

Sources:
{sources}

Question: {question}

Answer:
1. Summary:
2. Citations:

31.2 Safety-aware template

Answer the question directly. If the information is not present in the sources, say "I don't know." Do not fabricate details.

Sources:
{sources}

Question: {question}

Answer:

31.3 Evaluation template

Keep a small set of evaluation prompts with expected output structure. This makes it easier to detect prompt or retrieval regressions.

Private Document Q&A with pgvector: 100% Local RAG Pipeline 2026

>_ 17 Apr | 18 min | Dev Corner

🟡Intermediate

Build a fully local RAG pipeline in Python 2026. Ollama embeddings, pgvector 0.8 HNSW search, and Llama 4 Scout for document Q&A. No OpenAI. No cloud.

By Marcus Thorne

Build a Sovereign Local AI Stack: Ollama + Open WebUI + pgvector 2026

>_ 12 Apr | 18 min | Dev Corner

🟡Intermediate

Deploy a complete local AI stack: Ollama 5.x, Open WebUI, and pgvector: on Ubuntu 24.04. Zero cloud. Zero API costs. Full commands, and tested output.

By Divya Prakash

CrewAI Tutorial 2026: Multi-Agent Systems with Local Ollama

>_ 15 May | 24 min | Dev Corner

🟡Intermediate

Build sovereign multi-agent crews with CrewAI and local Ollama models. Covers role-based agents, task delegation, crew orchestration, tool integration.

By Kofi Mensah

#rag #retrieval-augmented-generation #pgvector #ollama #python #local-ai #dev-corner #2026

Key Takeaways

Introduction

Architecture

Part 1: Setup

Part 2: Ingestion Pipeline

Part 3: Retrieval and Generation

Part 4: Evaluation with RAGAS

Troubleshooting

Low similarity scores (< 0.5) for relevant queries

LLM answers with information not in context

pgvector extension not found

Conclusion

People Also Ask

What is the difference between RAG and fine-tuning?

What chunk size should I use for RAG?

Part 11: RAG System Design Patterns

11.1 Modular retrieval and generation

11.2 Incremental document updates

11.3 Freshness and recency

Part 12: Chunking and Document Quality

12.1 Semantic chunking

12.2 Chunk metadata

12.3 Chunk ranking and diversity

Part 13: Retrieval Chain and Scoring

13.1 Candidate generation

13.2 Re-ranking

13.3 Token budget management

Part 14: Prompt Engineering for RAG

14.1 Grounding instructions

14.2 Structured answer templates

14.3 Error recovery

Part 15: Evaluation and Feedback

15.1 Ground-truth datasets

15.2 Human review

15.3 Continuous feedback loops

Part 16: Deployment and Service Patterns

16.1 Local inference vs remote

16.2 API contract

16.3 Rate limiting and quotas

Part 17: Observability and Debugging

17.1 Retrieval metrics

17.2 Generation metrics

17.3 Error tracing

Part 18: Security and Access Control

18.1 Document-level access control

18.2 Prompt sanitization

18.3 Auditability

Part 19: Scaling the Vector Store

19.1 Partitioning

19.2 Index maintenance

19.3 Hybrid indexes

Part 20: Final RAG Operations Checklist

Part 21: Vector Store Selection and Tradeoffs

21.1 Local vs remote indexes

21.2 Approximate nearest neighbour settings

21.3 Storage compression

Part 22: Query and Prompt Caching

22.1 Query result caches

22.2 Prompt template caches

22.3 Answer cache invalidation

Part 23: Handling Unstructured and Multimodal Data

23.1 OCR and scanned documents

23.2 Image and audio embeddings

23.3 Composite retrieval

Part 24: User Experience and Answer Quality

24.1 Concise answer generation

24.2 Answer confidence and transparency

24.3 Handling unknowns

Part 25: Governance and Documentation

25.1 System documentation

25.2 Review cycles

25.3 Change logs

Part 26: Observability and Debugging

26.1 Retrieval transparency

26.2 Prompt and answer tracing

26.3 Query classification

Part 27: Iterative Improvement Workflows

27.1 User feedback loops

27.2 Progressive corpus expansion

27.3 Guardrail evolution

`pgvector` extension not found