RAG Architecture Explained: How Private Document Search Actually Works

Retrieval Augmented Generation (RAG) is the architecture pattern that makes private AI useful. A language model on its own knows the internet but knows nothing about your firm. RAG bridges the gap by retrieving relevant passages from your documents and feeding them to the model as context. The result: AI that answers questions about your contracts, policies, and records — with citations — without those documents ever leaving your environment. This guide explains how it actually works.

Why You Need RAG

A standalone language model — Llama 3, Gemma 4, Mistral, or any other — has been trained on a vast but generic corpus of internet text. It can write a passable cover letter, summarise a generic news article, or explain a general concept.

What it cannot do:

Tell you what clauses are in your employment contract template
Find a precedent in your firm's prior matters
Summarise your internal policy on data handling
Quote from your clinical guidelines
Identify decisions in your board meeting minutes

For business use, generic capability is rarely enough. You need AI that knows your documents.

There are two paths to this:

Approach	Description	Practicality
Fine-tuning	Train the model further on your documents so they become part of its parametric memory	Expensive, slow, requires re-training when documents change, hard to update
RAG	Keep the model as-is; retrieve relevant documents at query time and feed them as context	Cheap, fast, updates instantly, transparent (model cites sources)

For 95% of enterprise use cases, RAG is the right answer. Fine-tuning is only required for narrow specialised tasks where you need the model itself to absorb domain-specific reasoning patterns, not just facts.

The RAG Architecture in Plain English

Here's how a typical RAG query flows:

User asks a question. "What's our policy on data retention for medical records?"
The system converts the question into a vector. A numerical representation of the question's meaning.
The system searches the vector database. It finds the document chunks whose meanings are closest to the question's meaning.
The retrieved chunks become "context." Typically 5-10 passages, each a few sentences to a few paragraphs.
The system constructs a prompt. "Based on the following passages from internal policy documents, answer the user's question and cite the source documents: [passages] [user question]"
The LLM generates an answer. Grounded in the retrieved passages, with citations.
The user sees the answer with citations. Can click through to read the underlying documents.

The whole process typically takes 1-5 seconds. The model never invents facts about your documents — it only quotes and reasons over what was retrieved.

The Four Components

A working RAG system has four moving parts. Let's look at each.

Component 1: Document Ingestion Pipeline

This is where your documents enter the system.

What happens:

Files are read from their source (file shares, SharePoint, document management system)
Text is extracted (handling PDFs, Word docs, emails, scanned images via OCR)
Documents are split into "chunks" — typically 200-800 words each
Metadata is captured (filename, document type, date, author, permissions)

Key decisions:

Chunk size: smaller chunks (200 words) are precise but lose context. Larger chunks (800 words) preserve context but reduce retrieval precision. 400-500 words is a common middle ground.
Overlap: chunks often overlap by 10-20% so that information spanning chunk boundaries isn't lost
Metadata extraction: good metadata enables filtering ("only search 2024 contracts", "only search HR policies")
Permission handling: if some documents are restricted, ingestion must capture access control information so retrieval respects it

Common tools: LangChain, LlamaIndex, custom Python pipelines.

Component 2: Embedding Model

The embedding model converts each chunk (and later, each query) into a vector — a list of typically 384-1536 numbers that represents the chunk's meaning.

Why this works: Chunks with similar meaning produce similar vectors. The phrase "data retention policy" produces a vector close to "how long do we keep records," even though they share few words.

Choices:

Embedding Model	Vector Size	Strengths	Trade-offs
bge-small-en-v1.5	384	Fast, efficient, good for English	Smaller capacity for nuance
bge-base-en-v1.5	768	Better quality, still fast	More memory
bge-large-en-v1.5	1024	High quality	Slower, more memory
all-MiniLM-L6-v2	384	Very fast, light	Older, less accurate than bge
mxbai-embed-large	1024	Top-tier quality	Larger compute requirement

For private deployments, the embedding model also runs locally — no data leaves the environment for embedding generation.

Component 3: Vector Database

The vector database stores all those embeddings and provides fast search.

What it does: Given a query vector, return the K most similar chunk vectors (where K is typically 5-20).

Choices for private deployment:

Vector DB	Type	Strengths	Notes
ChromaDB	Lightweight Python-native	Easy to deploy, good for small-mid scale	Most common choice for private AI in 2026
Qdrant	Production-grade	Fast, scalable, rich filtering	Good for larger deployments
Weaviate	Feature-rich	Built-in modules, hybrid search	More complex to deploy
pgvector	PostgreSQL extension	Use existing Postgres infrastructure	Slower than dedicated vector DBs at scale
Milvus	Distributed	Massive scale	Overkill for most business deployments

For most private deployments under 10 million chunks, ChromaDB running locally on the same server as the LLM is sufficient and simple to maintain.

Component 4: The LLM

Finally, the language model itself generates the answer. With RAG, the LLM's job changes from "answer from memory" to "synthesise an answer from the provided passages."

The prompt structure (simplified):

You are an AI assistant for [organisation]. Answer the user's
question based on the following passages from internal documents.
Cite the source for each fact you state. If the passages do not
contain the answer, say so clearly.

Passages:
[1] (filename: hr-policy-2024.pdf, page 12): "Employee personal
data is retained for 7 years after termination..."

[2] (filename: data-retention-procedure.docx, page 3): "Medical
records are retained for the longer of: (a) 10 years from the date
of last treatment, or (b) until the patient reaches age 25..."

[3] ...

User question: What's our policy on data retention for medical
records?

Answer:

The model generates an answer based on these passages, citing each fact's source. The user sees both the answer and the citations.

LLM choices: any modern model works — Llama 3, Gemma 4, Mistral. See our detailed model comparison guide.

Putting It All Together

A complete private RAG deployment looks like:

┌─────────────────────────────────────────────────────────────┐
│                    Your Organisation                         │
│                                                              │
│  ┌────────────┐    ┌────────────┐    ┌────────────────┐    │
│  │  Documents │ ── │  Ingestion │ ── │   Embedding    │    │
│  │ (PDFs,     │    │  Pipeline  │    │    Model       │    │
│  │  Word,     │    │            │    │  (bge-base)    │    │
│  │  emails)   │    │            │    │                │    │
│  └────────────┘    └────────────┘    └───────┬────────┘    │
│                                                │             │
│                                                ▼             │
│                                       ┌────────────────┐    │
│                                       │   ChromaDB     │    │
│                                       │ (vector store) │    │
│                                       └────────┬───────┘    │
│                                                │             │
│  ┌────────────┐    ┌────────────┐    ┌───────▼────────┐    │
│  │   User     │ ── │   Query    │ ── │   Retrieval    │    │
│  │  asks      │    │  Embedder  │    │   (find top    │    │
│  │ question   │    │            │    │   K matches)   │    │
│  └────────────┘    └────────────┘    └───────┬────────┘    │
│                                                │             │
│                                                ▼             │
│  ┌────────────┐    ┌────────────┐    ┌────────────────┐    │
│  │   Answer   │ ◀─ │    LLM     │ ◀─ │   Construct    │    │
│  │   with     │    │ (Llama 3,  │    │     Prompt     │    │
│  │ citations  │    │  Gemma 4)  │    │ (context + Q)  │    │
│  └────────────┘    └────────────┘    └────────────────┘    │
│                                                              │
│              All on YOUR infrastructure.                     │
│              No data leaves your environment.                │
└─────────────────────────────────────────────────────────────┘

Every component runs locally. The flow from question to answer never touches the internet.

Advanced RAG Patterns

The basic flow above is the foundation. Production systems add layers:

Hybrid Search

Combining vector search (semantic similarity) with keyword search (exact match). Vector search finds "data retention policy" when you ask about "how long we keep records." Keyword search finds the document that specifically mentions a clause number or proper noun. Best results often come from combining both, then re-ranking.

Re-Ranking

After initial retrieval (which is fast but imprecise), a more powerful re-ranking model evaluates the top candidates more carefully. Common pattern: retrieve top 50 with vector search, re-rank to top 10 with a cross-encoder model, feed those 10 to the LLM.

Query Expansion

The original question may be ambiguous or use different terminology than the documents. Query expansion generates several reformulations of the question and searches with each, combining results.

Document Filtering

Using metadata to constrain the search: "only search legal precedents from 2020 onwards" or "only show me documents I have permission to read." This is essential for organisations with permission boundaries.

Multi-Hop Retrieval

For complex questions, retrieve, generate a partial answer, identify what's still missing, retrieve again. This handles questions like "How does our data retention policy compare to APRA's requirements?" which needs information from both internal documents and reference materials.

Quality and Tuning

A working RAG system is not the same as a good RAG system. Quality depends on:

Document Quality

Garbage in, garbage out. Common issues:

Outdated documents — old policies, superseded contracts, retired procedures
Inconsistent versions — multiple drafts of the same document, unclear which is authoritative
Poor OCR — scanned documents with extraction errors
Permission ambiguity — unclear who should see what

The fix: treat document quality as part of the deployment. Sometimes the most valuable thing AIRGAP LLM does in a deployment is help the organisation realise and clean up its document chaos.

Chunk Quality

Bad chunking produces unhelpful context:

Chunks that split sentences mid-thought
Chunks too large (irrelevant content drowns the relevant)
Chunks too small (lacking context to make sense)
No metadata (can't filter, hard to cite)

The fix: chunk by semantic unit (paragraph, section) where possible; preserve heading hierarchy; capture rich metadata.

Retrieval Quality

Retrieval can return irrelevant chunks or miss relevant ones:

Embedding model not strong enough for the domain
Vector dimensions too small (information loss)
Query phrasing very different from document phrasing

The fix: evaluate retrieval quality with test queries; use re-ranking; consider hybrid search; tune K (the number of chunks retrieved).

Generation Quality

The LLM may produce plausible-but-wrong answers:

Hallucinating beyond what's in the context
Misinterpreting the passages
Missing critical details

The fix: prompt engineering (clear instructions, citation requirements); model selection (stronger models for complex tasks); answer validation (does every claim cite a source?).

RAG and Privacy

The privacy story for private RAG is straightforward: everything runs locally.

Component	Data Location	External Communication
Documents	Your file system	None
Ingestion pipeline	Your server	None
Embedding model	Your server	None
Vector database	Your server	None
LLM	Your server	None
Query and response	Your network	None

No part of the pipeline transmits data externally. The system can run completely air-gapped if required.

This is in stark contrast to cloud RAG offerings (some major cloud providers offer "managed RAG" services), where:

Documents are uploaded to the provider's infrastructure
Embeddings are generated on the provider's servers
Vector storage is on the provider's infrastructure
Queries flow through the provider's systems
The LLM is the provider's hosted model

Cloud RAG inherits all the compliance issues we covered in private LLM vs public LLM and sovereign AI Australia.

Implementation Path

For organisations deploying private RAG:

Phase 1: Use Case Definition (Week 1)

What questions do you want AI to answer?
What documents contain the answers?
Who should have access to what?
What's the success criteria?

Phase 2: Document Audit (Week 2)

What documents exist?
Where are they stored?
What format are they in?
How current are they?
What needs cleanup before ingestion?

Phase 3: Pilot Deployment (Weeks 3-5)

Set up infrastructure (server, LLM, vector DB)
Ingest a representative subset of documents
Test with real users on real queries
Tune chunking, retrieval, prompting

Phase 4: Production Rollout (Weeks 6-8)

Full document ingestion
User access controls
Monitoring and logging
Training for staff

Phase 5: Ongoing Operation

New documents ingested as they're created
Periodic re-tuning based on usage patterns
Model upgrades as better open-source models release
Compliance reviews

What This Looks Like in Practice

A 50-person law firm we worked with deployed private RAG over their precedent library, policy documents, and prior matter files. After two months:

Average query time: 3 seconds
Useful answer rate: 89% (vs ~40% for the firm's previous SharePoint search)
Document corpus: 12,000 documents, 180,000 chunks
Hardware: Single Mac Mini M4 Pro 48GB
Total deployment cost: $42,000 all-in (hardware + setup + first 6 months support)

The lawyers report that finding precedents now takes seconds instead of hours. Junior staff can ask questions about firm policies and get accurate, cited answers without interrupting senior colleagues.

The AIRGAP LLM Perspective

AIRGAP LLM specialises in private RAG deployments for Australian organisations. Every deployment includes:

Document ingestion pipeline customised to your file types and structures
Embedding model selection based on your content (English-only, mixed-language, technical jargon)
ChromaDB or Qdrant vector database based on scale
LLM selection (typically Llama 3, Gemma 4, or Mistral)
Retrieval tuning based on real user queries
Comprehensive citation requirements so every answer is traceable

For organisations evaluating private RAG, contact AIRGAP LLM for a tailored deployment assessment.

Frequently Asked Questions

What is RAG (Retrieval Augmented Generation)?

RAG is an architecture that combines a language model with a document retrieval system. Instead of asking the model to answer from its own training, the system first retrieves the most relevant passages from your documents, then feeds those passages to the model as context. The model's answer is grounded in your specific documents and can cite them. This is how private AI systems answer questions about your firm's contracts, policies, or records without those documents being part of the model's training data.

Why does RAG matter for private AI deployment?

RAG is what makes a private AI system useful in a business context. Without RAG, a private LLM only knows what's in its training data — general knowledge from the internet, no awareness of your firm's documents. With RAG, the same model can answer detailed questions about your contracts, precedents, policies, or records by retrieving and citing actual documents. RAG turns a generic AI model into an AI that knows your business.

What components make up a RAG system?

A complete RAG system has four main components: (1) a document ingestion pipeline that processes your documents into searchable chunks; (2) an embedding model that converts text chunks into numerical vectors; (3) a vector database (such as ChromaDB) that stores and searches those vectors; (4) the LLM itself that generates the final answer using retrieved context. In private deployments, all four run on your infrastructure with no external dependencies.

Does RAG hallucinate like regular LLMs?

RAG dramatically reduces hallucination but does not eliminate it entirely. Because the LLM is given specific document passages as context, it tends to answer based on that retrieved content rather than inventing information. However, the LLM can still make mistakes if the retrieved content is incomplete, misleading, or if the question is ambiguous. Well-designed RAG systems include citation requirements (every claim must reference a source) which makes any remaining hallucinations easy to spot.

How long does it take to set up RAG for an organisation?

For most organisations, RAG setup is part of the overall private AI deployment, taking 4-8 weeks total. The RAG-specific work — document ingestion, embedding generation, vector database setup, retrieval tuning — typically takes 2-4 weeks. Document preparation (cleaning up file structures, identifying what to ingest, handling permissions) often takes longer than the technical work. Once running, the RAG index updates incrementally as new documents are added.