Production RAG: lessons from shipping it

Retrieval-augmented generation is the most over-demoed, under-engineered pattern in applied AI. A convincing prototype takes an afternoon: embed some documents, stuff the top matches into a prompt, get fluent answers. Then it meets real users and real content, and the gap between demo and production turns out to be most of the work. Here is what that gap actually contains.

Lesson 1: Retrieval is the product, generation is the garnish

Teams obsess over which model writes the answer and barely think about what gets retrieved. This is backwards. If the right chunk is not in the retrieved set, no model — however capable — can produce a correct answer. It can only produce a fluent wrong one, which is worse.

Spend your effort on retrieval quality: chunking strategy, embedding choice, re-ranking, and metadata filtering. In our projects, the single biggest quality jumps came from better retrieval, not better generation. The model is rarely the bottleneck.

Lesson 2: Your corpus is dirtier and staler than you think

Demo corpora are clean. Real corpora are twelve years of overlapping, contradictory, half-deprecated content. A RAG system that faithfully retrieves a document that was correct in 2019 will confidently tell users something that is now false, and they will believe it because it has a citation.

Content curation is not a preprocessing step you do once; it is an ongoing discipline. Classify sources by authority and recency. Exclude superseded material explicitly rather than hoping the embeddings sort it out. Tag by product area and version so retrieval can filter, not just rank. The unglamorous content pipeline is often the highest-leverage part of the system.

Lesson 3: Chunking is a real decision

Fixed-size chunking is the default and it is usually wrong. Splitting every 500 tokens cuts tables in half, separates a heading from its content, and orphans the sentence that contained the actual answer. Chunk along semantic boundaries — sections, list items, logical units — and preserve enough surrounding context that a retrieved chunk is interpretable on its own. Test this directly: look at what actually gets retrieved for real questions. You will be unpleasantly surprised, and then you will fix it.

Lesson 4: "I don't know" is a feature, not a failure

The instinct is to make the system always answer. Resist it. A RAG system that answers everything will, by construction, answer confidently when retrieval found nothing relevant — that is a hallucination with a citation attached, the most dangerous output of all.

Set a retrieval confidence threshold. If nothing clears it, the honest response is "I don't have a reliable answer for that." Counter-intuitively, this is the behaviour that earns user trust. On one project, the abstention path was the single change that took the assistant from "switched off after a bad incident" to "default tool for the support team."

Lesson 5: You cannot improve what you do not measure

The most common reason RAG systems stagnate is that nobody can tell whether a change helped. Build an evaluation set early: real questions, known-good answers, and a way to score retrieval (was the right source retrieved?) separately from generation (was the answer correct and grounded?). Separating those two scores tells you where the problem is — retrieval or synthesis — which is the difference between fixing it in an hour and guessing for a week.

Run the harness on every change. Prompt tweaks that "obviously" help routinely regress something else; you only find out if you measure.

Lesson 6: Citations change the trust equation

Answers without provenance are unverifiable, and unverifiable answers are untrusted by exactly the expert users you most want to serve. Inline citations that link to the precise source let a skeptical user verify in one click. That verifiability is often what makes the system adoptable in professional contexts at all — it converts the AI from an oracle you must believe into an accelerator you can check.

Lesson 7: Observability or it didn't happen

In production you will get the report "the assistant gave a bad answer." Without logging the retrieved context, the prompt, and the response for that interaction, you cannot diagnose whether retrieval missed, the content was wrong, or generation drifted. Log all three from day one. RAG without observability is not debuggable, and undebuggable systems do not improve — they get switched off.

The honest summary

Production RAG is a retrieval-quality problem, a content-curation problem, and an evaluation problem wearing a language-model costume. The model matters least of the four. Teams that internalise that ship RAG systems people trust. Teams that treat it as "embed and prompt" ship the demo, watch it fail in contact with reality, and conclude the technology is not ready — when it was the engineering that was not done.