How RAG Works — Retrieval-Augmented Generation Explained

1

Ingest

Documents → Chunks

Raw documents (PDFs, markdown, Word docs, web pages) are split into overlapping chunks of ~500-1000 characters. Chunks are split on paragraph boundaries so you don't cut mid-sentence. Overlap ensures context isn't lost at boundaries.

Before: Raw Document

              SOP-002: CRAC Unit Failure Response

              Defines the immediate response procedure when a Computer Room Air Conditioning unit fails or alarms. Severity classification: SEV-1 (multiple CRAC units failed, ambient >85°F), SEV-2 (single unit failed, redundancy covers load), SEV-3 (unit alarming but operational). Step 1: Assess severity. Step 2: Notify shift supervisor. Step 3: Increase adjacent CRAC fan speed (max 80%)...

↓ split into overlapping chunks ↓

After: Indexed Chunks

#1SOP-002: CRAC Unit Failure Response. Defines the immediate response procedure when a Computer Room Air...

#2...SEV-1 (multiple CRAC units failed, ambient >85°F), SEV-2 (single unit failed, redundancy covers load)...

#3...Step 1: Assess severity. Step 2: Notify shift supervisor. Step 3: Increase adjacent CRAC fan speed...

Why chunking matters: LLMs have limited context windows. You can't feed them 10,000 pages. Chunking lets you find and pass only the 3-5 most relevant pieces.

2

Embed

Text → Vector Embeddings

Each chunk is converted into a high-dimensional vector (a list of numbers) using an embedding model. These vectors capture semantic meaning — "CRAC failure" and "cooling unit broke" produce similar vectors even though they share zero words.

Embedding: Text → 384-dimension vectors

"CRAC failure"

[0.23, -0.41, 0.67, ...]

"cooling unit broke"

[0.21, -0.38, 0.64, ...]

↑ 92% similar — same concept, different words

"load bank test"

[-0.15, 0.72, -0.33, ...]

↑ 18% similar — different concept entirely

Key insight: This is what makes RAG semantic, not keyword-based. Traditional search requires exact word matches. Embedding-based search understands meaning.

3

Store

Vectors → Database

Vectors are stored in a vector database alongside their original text and metadata. We use pgvector — a PostgreSQL extension — because it keeps vectors in the same database as your business data. No extra infrastructure.

Vector Database (pgvector)

id	document	chunk	embedding
1	sop-002-crac.md	#1	[0.23, -0.41, 0.67, ... 384 dims]
2	sop-002-crac.md	#2	[0.19, -0.55, 0.42, ... 384 dims]
3	incident-dh3.md	#1	[-0.31, 0.28, 0.51, ... 384 dims]

Why pgvector over Pinecone/Weaviate? For most businesses, your documents already live near a PostgreSQL database. pgvector keeps everything in one place — simpler ops, lower cost, no vendor lock-in.

4

Retrieve

Question → Nearest Vectors

When a user asks a question, their query is also embedded into a vector. The database performs a cosine similarity search — finding the stored chunks whose vectors are closest to the query vector. The top 3-5 chunks become the context.

Vector Similarity Search

Q: "What's the CRAC failure response procedure?"

67%

sop-002-crac.md — chunk #1 (purpose & severity)

46%

sop-002-crac.md — chunk #4 (notifications)

41%

incident-dh3.md — chunk #2 (CRAC mentioned)

15%

reference-power.md — chunk #3 (unrelated)

This is the "Retrieval" in RAG. The AI never sees your full document library — only the most relevant chunks for this specific question. This makes answers focused and accurate.

5

Generate

Context + Question → Grounded Answer

The retrieved chunks are injected into a prompt alongside the user's question: "Answer using ONLY these sources. Cite with [Source N]. If the information isn't in the sources, say so." The LLM generates an answer grounded in your actual documents.

Generation with Source Attribution

📄 Context (retrieved chunks)

[Source 1] SOP-002: CRAC Unit Failure Response. Defines the immediate response procedure... SEV-1: Multiple CRAC units failed, ambient >85°F — Immediate response...

[Source 2] ...Step 2: Page on-call HVAC technician. Notify Shift Supervisor. If SEV-1: Activate thermal emergency protocol...

✓ Generated Answer

The CRAC failure response procedure follows SOP-002:

1. Classify severity — SEV-1 is multiple units failed with ambient >85°F [Source 1]

2. Notify — Page on-call HVAC tech, notify Shift Supervisor [Source 2]

3. Mitigate — Increase adjacent CRAC fan speed (max 80%) [Source 2]

No hallucination. Every claim traces back to a source document. If the answer isn't in the retrieved chunks, the AI says "I don't have that information" instead of making something up.

How RAG Actually Works

The Problem RAG Solves

❌ Without RAG

✓ With RAG

5 stages from document to answer

Documents → Chunks

Text → Vector Embeddings

Vectors → Database

Question → Nearest Vectors

Context + Question → Grounded Answer

📄 Context (retrieved chunks)

✓ Generated Answer

Production Stack

See It In Action

Frequently Asked Questions

Learn More