How RAG Actually Works

Retrieval-Augmented Generation is how you make AI answer questions from your data — with citations, without hallucination. Here's the full pipeline, step by step.

The Problem RAG Solves

LLMs like ChatGPT are trained on public internet data. They don't know about your SOPs, contracts, internal docs, or proprietary data. When you ask about your stuff, they either hallucinate confidently or say "I don't have that information."

❌ Without RAG

  • AI makes up answers from training data
  • No source attribution — you can't verify
  • Can't access private or recent documents
  • Confidently wrong (hallucination)

✓ With RAG

  • AI answers only from your documents
  • Every answer cites its source
  • Works with any document you ingest
  • Says "I don't know" when it doesn't

5 stages from document to answer

1
Ingest

Documents → Chunks

Raw documents (PDFs, markdown, Word docs, web pages) are split into overlapping chunks of ~500-1000 characters. Chunks are split on paragraph boundaries so you don't cut mid-sentence. Overlap ensures context isn't lost at boundaries.

Before: Raw Document
SOP-002: CRAC Unit Failure Response

Defines the immediate response procedure when a Computer Room Air Conditioning unit fails or alarms. Severity classification: SEV-1 (multiple CRAC units failed, ambient >85°F), SEV-2 (single unit failed, redundancy covers load), SEV-3 (unit alarming but operational). Step 1: Assess severity. Step 2: Notify shift supervisor. Step 3: Increase adjacent CRAC fan speed (max 80%)...
↓ split into overlapping chunks ↓
After: Indexed Chunks
#1SOP-002: CRAC Unit Failure Response. Defines the immediate response procedure when a Computer Room Air...
#2...SEV-1 (multiple CRAC units failed, ambient >85°F), SEV-2 (single unit failed, redundancy covers load)...
#3...Step 1: Assess severity. Step 2: Notify shift supervisor. Step 3: Increase adjacent CRAC fan speed...

Why chunking matters: LLMs have limited context windows. You can't feed them 10,000 pages. Chunking lets you find and pass only the 3-5 most relevant pieces.

2
Embed

Text → Vector Embeddings

Each chunk is converted into a high-dimensional vector (a list of numbers) using an embedding model. These vectors capture semantic meaning — "CRAC failure" and "cooling unit broke" produce similar vectors even though they share zero words.

Embedding: Text → 384-dimension vectors
"CRAC failure"
[0.23, -0.41, 0.67, ...]
"cooling unit broke"
[0.21, -0.38, 0.64, ...]
↑ 92% similar — same concept, different words
"load bank test"
[-0.15, 0.72, -0.33, ...]
↑ 18% similar — different concept entirely

Key insight: This is what makes RAG semantic, not keyword-based. Traditional search requires exact word matches. Embedding-based search understands meaning.

3
Store

Vectors → Database

Vectors are stored in a vector database alongside their original text and metadata. We use pgvector — a PostgreSQL extension — because it keeps vectors in the same database as your business data. No extra infrastructure.

Vector Database (pgvector)
iddocumentchunkembedding
1sop-002-crac.md#1[0.23, -0.41, 0.67, ... 384 dims]
2sop-002-crac.md#2[0.19, -0.55, 0.42, ... 384 dims]
3incident-dh3.md#1[-0.31, 0.28, 0.51, ... 384 dims]

Why pgvector over Pinecone/Weaviate? For most businesses, your documents already live near a PostgreSQL database. pgvector keeps everything in one place — simpler ops, lower cost, no vendor lock-in.

4
Retrieve

Question → Nearest Vectors

When a user asks a question, their query is also embedded into a vector. The database performs a cosine similarity search — finding the stored chunks whose vectors are closest to the query vector. The top 3-5 chunks become the context.

Vector Similarity Search
Q: "What's the CRAC failure response procedure?"
67%
sop-002-crac.md — chunk #1 (purpose & severity)
46%
sop-002-crac.md — chunk #4 (notifications)
41%
incident-dh3.md — chunk #2 (CRAC mentioned)
15%
reference-power.md — chunk #3 (unrelated)

This is the "Retrieval" in RAG. The AI never sees your full document library — only the most relevant chunks for this specific question. This makes answers focused and accurate.

5
Generate

Context + Question → Grounded Answer

The retrieved chunks are injected into a prompt alongside the user's question: "Answer using ONLY these sources. Cite with [Source N]. If the information isn't in the sources, say so." The LLM generates an answer grounded in your actual documents.

Generation with Source Attribution

📄 Context (retrieved chunks)

[Source 1] SOP-002: CRAC Unit Failure Response. Defines the immediate response procedure... SEV-1: Multiple CRAC units failed, ambient >85°F — Immediate response...

[Source 2] ...Step 2: Page on-call HVAC technician. Notify Shift Supervisor. If SEV-1: Activate thermal emergency protocol...

✓ Generated Answer

The CRAC failure response procedure follows SOP-002:

1. Classify severity — SEV-1 is multiple units failed with ambient >85°F [Source 1]

2. Notify — Page on-call HVAC tech, notify Shift Supervisor [Source 2]

3. Mitigate — Increase adjacent CRAC fan speed (max 80%) [Source 2]

No hallucination. Every claim traces back to a source document. If the answer isn't in the retrieved chunks, the AI says "I don't have that information" instead of making something up.

~$0.002
Cost per query
<2s
Response time
384
Vector dimensions
100K+
Docs scalable

Production Stack

What we use to build RAG systems for clients

🐍
Python + FastAPIApplication layer
🐘
PostgreSQL + pgvectorVector storage
🤖
Claude / GPT-4Answer generation
📦
DockerDeployment
🧠
sentence-transformersLocal embeddings
🔒
Row-level securityAccess controls

See It In Action

We built a live RAG demo with 9 real documents — data center SOPs, incident reports, real estate analysis. Ask it anything in plain English.

Frequently Asked Questions

How is RAG different from fine-tuning?

Fine-tuning modifies the AI model itself with your data — expensive, requires ML expertise, and the model can still hallucinate. RAG keeps the model unchanged and retrieves relevant data at query time — cheaper, updatable without retraining, and answers are grounded in source documents. For most business use cases, RAG is the right choice. Fine-tuning is for changing the model's behavior or style, not for teaching it your data.

What types of documents can RAG ingest?

Anything text-based: PDFs, Word documents, markdown, HTML, plain text, emails, Confluence/Notion exports, database records, CSV files, Slack transcripts. For images and scanned documents, we add an OCR (optical character recognition) step first.

How much does a RAG system cost to run?

For a typical business knowledge base: embedding costs are near-zero (we use local models), vector storage is pennies (PostgreSQL), and LLM generation costs ~$0.002 per query with Claude Sonnet. A team of 50 people making 100 queries/day costs roughly $6/month in API fees. The main cost is building it — not running it.

Can RAG handle sensitive or regulated data?

Yes. Because RAG retrieves from your own database, your data never leaves your infrastructure (except the query + retrieved chunks sent to the LLM). For maximum security, we can deploy with local LLMs (Llama, Mistral) so no data leaves your network at all. Row-level security ensures users only search documents they're authorized to see.

What's the difference between RAG and ChatGPT?

ChatGPT answers from its training data (public internet, up to its cutoff date). RAG answers from your specific documents. Think of it this way: ChatGPT is an encyclopedia. RAG is a research assistant who reads your company's files and answers questions about them — with page numbers.

How long does it take to build?

A production RAG system typically takes 4-8 weeks: 1 week for data ingestion pipeline, 1-2 weeks for retrieval tuning, 1-2 weeks for the interface (web, Slack, Teams), and 1-2 weeks for security, monitoring, and deployment. We built the demo on this site in a single day — production systems need more rigor.

Learn More