Building RAG Systems That Don't Hallucinate

Retrieval-augmented generation (RAG) has become the default pattern for putting a language model to work on your own data. The pitch is irresistible: instead of fine-tuning a model on your documents, you retrieve the relevant passages at query time and hand them to the model as context. A weekend project gets you a chatbot that answers questions about your knowledge base.

Then you ship it, and the cracks appear. The model cites a policy that was retired two years ago. It confidently invents an API endpoint that never existed. It answers a question the documents never addressed. The demo was real; the trust was not.

The retrieval half is where systems quietly fail

Most teams obsess over the model and ignore retrieval, which is backwards. If the right passage never makes it into the context window, no model on earth can answer correctly — it can only guess convincingly. Three failures dominate:

Chunking that severs meaning. Splitting documents into fixed 500-token windows cuts tables in half and orphans the sentence that gave a paragraph its meaning. Chunk on structure — headings, sections, list items — not on character counts.
Embeddings that miss the question. A user asks “how do I cancel,” your document says “termination of service.” Pure vector similarity can miss this. Hybrid search, combining dense embeddings with old-fashioned keyword (BM25) matching, recovers the cases each method misses alone.
No re-ranking. Vector search returns the top 50 candidates fast but imprecisely. A cross-encoder re-ranker reorders them by true relevance before the top handful reach the model. This single step is often the largest quality jump available.

Make the model admit what it doesn’t know

Even with perfect retrieval, a model will answer from its parametric memory when the context is thin. The fix is partly prompt design and partly architecture. Instruct the model to answer only from the provided context and to say “I don’t have information on that” when the context is silent — then actually test that it does. Attach the source passages to every answer so a human can verify the claim in one click. An answer without a citation is a liability, not a feature.

Evaluate retrieval and generation separately

When an answer is wrong, you need to know which half failed. Measure retrieval with metrics like recall@k — did the correct passage appear in the top results at all? — and measure generation with faithfulness checks that ask whether each claim in the answer is supported by the retrieved text. Conflating them hides the actual bug and sends you tuning the wrong component for a week.

The unglamorous work is the work

A production RAG system is mostly plumbing: a pipeline that re-indexes when documents change, monitoring that flags when retrieval scores drop, a feedback loop that captures the questions users asked that returned nothing useful. None of it demos well. All of it is the difference between a clever prototype and a system your colleagues will actually rely on. Build evaluation before you build features, and the rest follows.

The retrieval half is where systems quietly fail

Make the model admit what it doesn’t know

Evaluate retrieval and generation separately

The unglamorous work is the work

Keep reading

LLM Agents Beyond the Demo: What Production Actually Looks Like

Fine-Tuning vs Prompting: Choosing the Cheaper Path

How to Actually Evaluate an LLM Feature