Learn Architecture

Source · Lewis et al. "Retrieval-Augmented Generation" (2020); LangChain/LlamaIndex and vector-database documentation (2024–2025)

Why this matters

Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (2020)

An LLM only knows what was in its training data, and it will confidently make up the rest. Retrieval-augmented generation (RAG) fixes this by fetching relevant, up-to-date text at query time and feeding it into the prompt, so the model answers from real sources instead of guessing. It is the single most common technique for grounding an agent in your data and cutting hallucination.

Memory is the sibling idea: how an agent remembers within a conversation and across sessions. Both are about putting the right context in front of the model at the right moment.

The concept

Vector database and embedding documentation; LangChain/LlamaIndex RAG guides (2024–2025)

The RAG pipeline has two phases.

Indexing (offline): 1. Chunk documents into passages small enough to embed and retrieve precisely. 2. Embed each chunk into a vector with an embedding model, so semantically similar text lands near it in vector space. 3. Store the vectors in a vector store (e.g. an index supporting nearest-neighbor search).

Retrieval + generation (at query time): 4. Embed the query, find the nearest chunks (semantic search), optionally re-rank them. 5. Augment the prompt with the retrieved chunks and generate an answer grounded in them, ideally with citations.

Memory layers on top: short-term memory is the conversation transcript in context; long-term memory persists facts or summaries across sessions, often stored and retrieved with the same vector-search machinery. Chunking well matters — chunks too large dilute relevance and waste context; too small lose meaning.

Worked example

LlamaIndex and vector-search retrieval documentation (2024–2025)

Support agent over a product manual.

- Offline: split the manual into ~500-token chunks, embed each, store in a vector index. - Query: user asks "How do I reset the device to factory settings?" - Retrieve: embed the question, pull the top 4 nearest chunks; a re-ranker promotes the actual reset procedure above a loosely related warranty passage. - Generate: the prompt now contains those chunks; the model answers with the real steps and cites the manual section.

Without RAG the model might invent a plausible-but-wrong procedure. With it, the answer is grounded in the retrieved text — and if nothing relevant is retrieved, a well-built system says "I do not have that" instead of hallucinating.

How it connects

Anthropic and OpenAI RAG guidance; Lewis et al. (2020); retrieval-evaluation literature

RAG is the primary defense against hallucination and the main way to give an agent private or fresh knowledge without retraining. In an agent (AIS-01), retrieval is usually just another tool the model calls (AIS-03) — the agent decides when to search. Long-term memory reuses the same embed-and-retrieve pipeline to recall past interactions.

Grounding is not automatic: retrieval quality is everything. Bad chunking, weak embeddings, or no re-ranking pull in irrelevant text and the model still errs. That is exactly why evaluation (AIS-04) matters — you measure retrieval and answer quality, not assume RAG made things better.

Common traps

Assuming RAG eliminates hallucination outright. It reduces it only when retrieval surfaces the right chunks; poor retrieval still yields confident wrong answers.
Chunking carelessly. Chunks too large dilute relevance and waste context; too small lose meaning — chunk size and overlap are real tuning knobs.
Confusing keyword search with embedding search. RAG retrieval is semantic (vector nearest-neighbor), matching meaning, not just exact words.

Key takeaways

RAG = index (chunk -> embed -> store in a vector store) then retrieve nearest chunks and augment the prompt, grounding answers in real sources.
Short-term memory is the in-context transcript; long-term memory persists facts across sessions, typically via the same embed-and-retrieve machinery.
Retrieval quality is everything: chunking, embeddings, and re-ranking determine whether RAG actually reduces hallucination — so you must evaluate it.