RAG Architecture Guide: Building Knowledge-Grounded AI
Retrieval Augmented Generation (RAG) is the dominant approach for building AI systems that answer questions from enterprise knowledge bases — company documents, product manuals, policies, research reports, and regulatory guidelines. It solves the two most critical problems with bare LLMs in enterprise contexts: hallucination and knowledge cutoff.
Why RAG?
A raw LLM has three problems for enterprise knowledge applications:
Knowledge cutoff: The model was trained on data up to a specific date. It doesn't know about your latest product releases, policy updates, or this quarter's regulatory guidance.
Hallucination: LLMs confidently generate plausible-sounding answers that are factually wrong. In sensitive domains (legal, medical, financial), wrong confident answers are worse than no answers.
No proprietary knowledge: The model has no access to your internal documents, procedures, or data.
RAG addresses all three by grounding each response in documents retrieved from a knowledge base — retrieved at query time, not baked into training.
How RAG Works
The Two-Phase Process
Phase 1: Indexing (done once or on a schedule)
- Collect documents: Gather all documents you want the system to know about
- Chunk documents: Split into smaller passages (typically 200–1000 tokens each)
- Embed: Generate vector embeddings for each chunk using an embedding model
- Store: Save embeddings and original text in a vector database
Phase 2: Query
- Receive user query
- Embed query: Generate a vector embedding for the question
- Retrieve: Search the vector database for the most semantically similar chunks
- Augment: Combine retrieved chunks with the user query into an LLM prompt
- Generate: LLM generates a response grounded in the retrieved context
- Return: Response (with optional citations) returned to user
Architecture Patterns
Naive RAG (Starting Point)
Single-stage retrieval feeding directly into generation.
Query → Embed → Vector Search → Retrieved Chunks + Query → LLM → Response
Pros: Simple to implement; low latency; easy to debug. Cons: Retrieval quality directly limits response quality; no query understanding. Best for: Internal knowledge bases with well-structured documents; starting point for all RAG builds.
Advanced RAG (Production Standard)
Adds query transformation, re-ranking, and post-retrieval processing.
Query → [Query Rewriter] → Embed → Vector Search → [Re-ranker] →
Top-K Chunks → [Context Compressor] → LLM → [Citation Adder] → Response
Key additions:
- Query rewriting: Transform the user query into better search queries (often the query is too colloquial for good retrieval)
- Hybrid search: Combine semantic search (vectors) with keyword search (BM25) for better recall
- Re-ranking: Use a cross-encoder model to re-rank retrieved chunks by relevance
- Context compression: Remove irrelevant portions of retrieved chunks to fit more in context
Agentic RAG (Complex Knowledge Tasks)
The retrieval step becomes iterative — the agent decides whether to retrieve more, what to search for next based on partial answers.
Query → [Agent] → Search → Review → Need more? → Search again →
Synthesize → Response
Best for: Research tasks requiring synthesis across many documents; situations where the answer requires combining information from multiple sources.
Vector Databases Compared
| Database | Deployment | Scale | Best For | |---|---|---|---| | Pinecone | Managed cloud | Millions | Fastest to production; simple management | | Weaviate | Managed or self-hosted | Millions | GraphQL API; good for complex filtering | | Qdrant | Self-hosted | Large | Performance; European data residency | | Chroma | Self-hosted | Tens of thousands | Development/testing; easy local setup | | pgvector | PostgreSQL extension | Moderate | Teams already on PostgreSQL | | Redis Vector | Self-hosted | Moderate | Existing Redis users; low latency |
For most enterprises: Start with Pinecone (fastest to production) or pgvector (if you're on PostgreSQL). Move to Qdrant or Weaviate if you need on-premises hosting or advanced filtering.
Embedding Models
The embedding model determines how well semantic similarity is captured. Comparison of common options:
| Model | Dimensions | Context | Speed | Quality | |---|---|---|---|---| | OpenAI text-embedding-3-large | 3072 | 8191 tokens | Moderate | Excellent | | OpenAI text-embedding-3-small | 1536 | 8191 tokens | Fast | Very good | | Cohere embed-english-v3 | 1024 | 512 tokens | Fast | Very good | | BGE-large-en-v1.5 (open source) | 1024 | 512 tokens | Fast (self-hosted) | Very good | | Jina Embeddings v2 | 768 | 8192 tokens | Fast | Good |
Recommendation: For most enterprise RAG, text-embedding-3-small provides excellent quality with reasonable cost. For multi-lingual applications, use a multi-lingual model like Cohere's embed-multilingual-v3.
Chunking Strategies
How you split documents dramatically impacts retrieval quality.
Fixed-Size Chunking
Split documents into fixed-length chunks (e.g., 512 tokens, 100-token overlap).
- Pros: Simple; consistent; predictable
- Cons: Splits sentences and paragraphs mid-thought; poor for structured documents
Semantic Chunking
Split at natural semantic boundaries (paragraph breaks, section headers).
- Pros: Preserves semantic coherence; better retrieval for most text
- Cons: Variable chunk sizes; more complex to implement
Recursive Splitting (LangChain default)
Tries progressively smaller splits (paragraph → sentence → token) until chunks are within target size.
- Pros: Good balance; works well for most text types
- Best for: General-purpose RAG systems
Structure-Aware Splitting
For structured documents (tables, PDFs with headers, code), use structure-aware splitting:
- Tables stay together as complete tables
- Code blocks are not split mid-function
- Headers create natural section boundaries
Parent-Child Chunking (Recommended for Production)
Store small chunks for retrieval (better precision), but retrieve the parent chunk for context (better quality).
- Small child chunks (128 tokens) → better semantic matching
- Parent chunk (512 tokens) → better context for generation
Evaluating RAG Systems
You need objective measurement to improve RAG quality. Key metrics:
Retrieval metrics:
- Hit rate: Is the correct document in the top-K retrieved results?
- MRR (Mean Reciprocal Rank): How highly is the correct document ranked?
- NDCG: Normalized ranking quality metric
Generation metrics:
- Faithfulness: Does the answer contain only information present in the retrieved chunks?
- Answer relevancy: Does the answer address the user's actual question?
- Context relevancy: Are the retrieved chunks actually relevant to the question?
Frameworks: RAGAs (Retrieval Augmented Generation Assessment), TruLens, and ARES provide automated evaluation pipelines for all of these metrics.
Common RAG Failure Modes
Retrieval miss: The right document is not retrieved. Causes: poor chunking (key information split across chunks), embedding model mismatch with domain, missing synonym handling.
Context overflow: Too much context stuffed into the prompt dilutes the signal. Fix: re-ranking and context compression.
Hallucination despite retrieval: LLM ignores retrieved context and generates from training knowledge. Fix: stronger system prompt instructions; evaluate faithfulness and retrain prompt.
Stale index: Documents updated but index not refreshed. Fix: event-driven index updates on document change.
Poor query routing: Simple lookup queries go to RAG; conversational queries need a different handler. Fix: query classification before routing.
Production Recommendations
- Monitor faithfulness continuously — unfaithful responses erode trust fast
- Keep chunks and index fresh — set up automated re-indexing on document updates
- Log queries and retrieved chunks — this is your improvement dataset
- A/B test retrieval strategies — small improvements in retrieval compound over millions of queries
- Human review for high-stakes domains — RAG reduces hallucination; it doesn't eliminate it
Related Reading
Ready to deploy autonomous AI agents?
Our engineers are available to discuss your specific requirements.
Book a Consultation