RAG Architecture Guide: Building Knowledge-Grounded AI

Retrieval Augmented Generation (RAG) is the dominant approach for building AI systems that answer questions from enterprise knowledge bases — company documents, product manuals, policies, research reports, and regulatory guidelines. It solves the two most critical problems with bare LLMs in enterprise contexts: hallucination and knowledge cutoff.

Why RAG?

A raw LLM has three problems for enterprise knowledge applications:

Knowledge cutoff: The model was trained on data up to a specific date. It doesn't know about your latest product releases, policy updates, or this quarter's regulatory guidance.

Hallucination: LLMs confidently generate plausible-sounding answers that are factually wrong. In sensitive domains (legal, medical, financial), wrong confident answers are worse than no answers.

No proprietary knowledge: The model has no access to your internal documents, procedures, or data.

RAG addresses all three by grounding each response in documents retrieved from a knowledge base — retrieved at query time, not baked into training.

How RAG Works

The Two-Phase Process

Phase 1: Indexing (done once or on a schedule)

Collect documents: Gather all documents you want the system to know about
Chunk documents: Split into smaller passages (typically 200–1000 tokens each)
Embed: Generate vector embeddings for each chunk using an embedding model
Store: Save embeddings and original text in a vector database

Phase 2: Query

Receive user query
Embed query: Generate a vector embedding for the question
Retrieve: Search the vector database for the most semantically similar chunks
Augment: Combine retrieved chunks with the user query into an LLM prompt
Generate: LLM generates a response grounded in the retrieved context
Return: Response (with optional citations) returned to user

Architecture Patterns

Naive RAG (Starting Point)

Single-stage retrieval feeding directly into generation.

Query → Embed → Vector Search → Retrieved Chunks + Query → LLM → Response

Pros: Simple to implement; low latency; easy to debug. Cons: Retrieval quality directly limits response quality; no query understanding. Best for: Internal knowledge bases with well-structured documents; starting point for all RAG builds.

Advanced RAG (Production Standard)

Adds query transformation, re-ranking, and post-retrieval processing.

Query → [Query Rewriter] → Embed → Vector Search → [Re-ranker] → 
Top-K Chunks → [Context Compressor] → LLM → [Citation Adder] → Response

Key additions:

Query rewriting: Transform the user query into better search queries (often the query is too colloquial for good retrieval)
Hybrid search: Combine semantic search (vectors) with keyword search (BM25) for better recall
Re-ranking: Use a cross-encoder model to re-rank retrieved chunks by relevance
Context compression: Remove irrelevant portions of retrieved chunks to fit more in context

Agentic RAG (Complex Knowledge Tasks)

The retrieval step becomes iterative — the agent decides whether to retrieve more, what to search for next based on partial answers.

Query → [Agent] → Search → Review → Need more? → Search again → 
Synthesize → Response

Best for: Research tasks requiring synthesis across many documents; situations where the answer requires combining information from multiple sources.

Vector Databases Compared

| Database | Deployment | Scale | Best For | |---|---|---|---| | Pinecone | Managed cloud | Millions | Fastest to production; simple management | | Weaviate | Managed or self-hosted | Millions | GraphQL API; good for complex filtering | | Qdrant | Self-hosted | Large | Performance; European data residency | | Chroma | Self-hosted | Tens of thousands | Development/testing; easy local setup | | pgvector | PostgreSQL extension | Moderate | Teams already on PostgreSQL | | Redis Vector | Self-hosted | Moderate | Existing Redis users; low latency |

For most enterprises: Start with Pinecone (fastest to production) or pgvector (if you're on PostgreSQL). Move to Qdrant or Weaviate if you need on-premises hosting or advanced filtering.

Embedding Models

The embedding model determines how well semantic similarity is captured. Comparison of common options:

| Model | Dimensions | Context | Speed | Quality | |---|---|---|---|---| | OpenAI text-embedding-3-large | 3072 | 8191 tokens | Moderate | Excellent | | OpenAI text-embedding-3-small | 1536 | 8191 tokens | Fast | Very good | | Cohere embed-english-v3 | 1024 | 512 tokens | Fast | Very good | | BGE-large-en-v1.5 (open source) | 1024 | 512 tokens | Fast (self-hosted) | Very good | | Jina Embeddings v2 | 768 | 8192 tokens | Fast | Good |

Recommendation: For most enterprise RAG, text-embedding-3-small provides excellent quality with reasonable cost. For multi-lingual applications, use a multi-lingual model like Cohere's embed-multilingual-v3.

Chunking Strategies

How you split documents dramatically impacts retrieval quality.

Fixed-Size Chunking

Split documents into fixed-length chunks (e.g., 512 tokens, 100-token overlap).

Pros: Simple; consistent; predictable
Cons: Splits sentences and paragraphs mid-thought; poor for structured documents

Semantic Chunking

Split at natural semantic boundaries (paragraph breaks, section headers).

Pros: Preserves semantic coherence; better retrieval for most text
Cons: Variable chunk sizes; more complex to implement

Recursive Splitting (LangChain default)

Tries progressively smaller splits (paragraph → sentence → token) until chunks are within target size.

Pros: Good balance; works well for most text types
Best for: General-purpose RAG systems

Structure-Aware Splitting

For structured documents (tables, PDFs with headers, code), use structure-aware splitting:

Tables stay together as complete tables
Code blocks are not split mid-function
Headers create natural section boundaries

Parent-Child Chunking (Recommended for Production)

Store small chunks for retrieval (better precision), but retrieve the parent chunk for context (better quality).

Small child chunks (128 tokens) → better semantic matching
Parent chunk (512 tokens) → better context for generation

Evaluating RAG Systems

You need objective measurement to improve RAG quality. Key metrics:

Retrieval metrics:

Hit rate: Is the correct document in the top-K retrieved results?
MRR (Mean Reciprocal Rank): How highly is the correct document ranked?
NDCG: Normalized ranking quality metric

Generation metrics:

Faithfulness: Does the answer contain only information present in the retrieved chunks?
Answer relevancy: Does the answer address the user's actual question?
Context relevancy: Are the retrieved chunks actually relevant to the question?

Frameworks: RAGAs (Retrieval Augmented Generation Assessment), TruLens, and ARES provide automated evaluation pipelines for all of these metrics.

Common RAG Failure Modes

Retrieval miss: The right document is not retrieved. Causes: poor chunking (key information split across chunks), embedding model mismatch with domain, missing synonym handling.

Context overflow: Too much context stuffed into the prompt dilutes the signal. Fix: re-ranking and context compression.

Hallucination despite retrieval: LLM ignores retrieved context and generates from training knowledge. Fix: stronger system prompt instructions; evaluate faithfulness and retrain prompt.

Stale index: Documents updated but index not refreshed. Fix: event-driven index updates on document change.

Poor query routing: Simple lookup queries go to RAG; conversational queries need a different handler. Fix: query classification before routing.

Production Recommendations

Monitor faithfulness continuously — unfaithful responses erode trust fast
Keep chunks and index fresh — set up automated re-indexing on document updates
Log queries and retrieved chunks — this is your improvement dataset
A/B test retrieval strategies — small improvements in retrieval compound over millions of queries
Human review for high-stakes domains — RAG reduces hallucination; it doesn't eliminate it