Retrieval-Augmented Generation is how you make AI answer questions from your data — with citations, without hallucination. Here's the full pipeline, step by step.
LLMs like ChatGPT are trained on public internet data. They don't know about your SOPs, contracts, internal docs, or proprietary data. When you ask about your stuff, they either hallucinate confidently or say "I don't have that information."
Raw documents (PDFs, markdown, Word docs, web pages) are split into overlapping chunks of ~500-1000 characters. Chunks are split on paragraph boundaries so you don't cut mid-sentence. Overlap ensures context isn't lost at boundaries.
Why chunking matters: LLMs have limited context windows. You can't feed them 10,000 pages. Chunking lets you find and pass only the 3-5 most relevant pieces.
Each chunk is converted into a high-dimensional vector (a list of numbers) using an embedding model. These vectors capture semantic meaning — "CRAC failure" and "cooling unit broke" produce similar vectors even though they share zero words.
Key insight: This is what makes RAG semantic, not keyword-based. Traditional search requires exact word matches. Embedding-based search understands meaning.
Vectors are stored in a vector database alongside their original text and metadata. We use pgvector — a PostgreSQL extension — because it keeps vectors in the same database as your business data. No extra infrastructure.
| id | document | chunk | embedding |
| 1 | sop-002-crac.md | #1 | [0.23, -0.41, 0.67, ... 384 dims] |
| 2 | sop-002-crac.md | #2 | [0.19, -0.55, 0.42, ... 384 dims] |
| 3 | incident-dh3.md | #1 | [-0.31, 0.28, 0.51, ... 384 dims] |
Why pgvector over Pinecone/Weaviate? For most businesses, your documents already live near a PostgreSQL database. pgvector keeps everything in one place — simpler ops, lower cost, no vendor lock-in.
When a user asks a question, their query is also embedded into a vector. The database performs a cosine similarity search — finding the stored chunks whose vectors are closest to the query vector. The top 3-5 chunks become the context.
This is the "Retrieval" in RAG. The AI never sees your full document library — only the most relevant chunks for this specific question. This makes answers focused and accurate.
The retrieved chunks are injected into a prompt alongside the user's question: "Answer using ONLY these sources. Cite with [Source N]. If the information isn't in the sources, say so." The LLM generates an answer grounded in your actual documents.
[Source 1] SOP-002: CRAC Unit Failure Response. Defines the immediate response procedure... SEV-1: Multiple CRAC units failed, ambient >85°F — Immediate response...
[Source 2] ...Step 2: Page on-call HVAC technician. Notify Shift Supervisor. If SEV-1: Activate thermal emergency protocol...
The CRAC failure response procedure follows SOP-002:
1. Classify severity — SEV-1 is multiple units failed with ambient >85°F [Source 1]
2. Notify — Page on-call HVAC tech, notify Shift Supervisor [Source 2]
3. Mitigate — Increase adjacent CRAC fan speed (max 80%) [Source 2]
No hallucination. Every claim traces back to a source document. If the answer isn't in the retrieved chunks, the AI says "I don't have that information" instead of making something up.
What we use to build RAG systems for clients
We built a live RAG demo with 9 real documents — data center SOPs, incident reports, real estate analysis. Ask it anything in plain English.
Fine-tuning modifies the AI model itself with your data — expensive, requires ML expertise, and the model can still hallucinate. RAG keeps the model unchanged and retrieves relevant data at query time — cheaper, updatable without retraining, and answers are grounded in source documents. For most business use cases, RAG is the right choice. Fine-tuning is for changing the model's behavior or style, not for teaching it your data.
Anything text-based: PDFs, Word documents, markdown, HTML, plain text, emails, Confluence/Notion exports, database records, CSV files, Slack transcripts. For images and scanned documents, we add an OCR (optical character recognition) step first.
For a typical business knowledge base: embedding costs are near-zero (we use local models), vector storage is pennies (PostgreSQL), and LLM generation costs ~$0.002 per query with Claude Sonnet. A team of 50 people making 100 queries/day costs roughly $6/month in API fees. The main cost is building it — not running it.
Yes. Because RAG retrieves from your own database, your data never leaves your infrastructure (except the query + retrieved chunks sent to the LLM). For maximum security, we can deploy with local LLMs (Llama, Mistral) so no data leaves your network at all. Row-level security ensures users only search documents they're authorized to see.
ChatGPT answers from its training data (public internet, up to its cutoff date). RAG answers from your specific documents. Think of it this way: ChatGPT is an encyclopedia. RAG is a research assistant who reads your company's files and answers questions about them — with page numbers.
A production RAG system typically takes 4-8 weeks: 1 week for data ingestion pipeline, 1-2 weeks for retrieval tuning, 1-2 weeks for the interface (web, Slack, Teams), and 1-2 weeks for security, monitoring, and deployment. We built the demo on this site in a single day — production systems need more rigor.