← All posts

RAG, Explained by What Actually Breaks in Production

By Novacademy ·

Everyone's first RAG system works on the first try. You embed some docs, drop them in a vector database, retrieve the top few chunks for each question, stuff them into the prompt, and the model answers like it read your whole knowledge base. It feels like magic.

Then you point it at real documents and real users, and it starts confidently citing the wrong paragraph, missing answers that are literally in the corpus, and burning tokens to get dumber.

I've seen this happen enough times — to myself and to people I teach — that I now think the only useful way to explain RAG is through its failure modes. So here's the naive version everyone ships first, the three things that break it, and the one mental shift that fixes most of it.

The naive RAG everyone builds first

The starter recipe looks like this:

# 1. Index (once)
chunks = split_every_500_chars(documents)
embeddings = embed(chunks)
vector_db.add(chunks, embeddings)

# 2. Answer (per question)
query_vec = embed(user_question)
top_k = vector_db.search(query_vec, k=3)
prompt = f"Answer using this context: {top_k}\n\nQuestion: {user_question}"
answer = llm(prompt)

Four moving parts: chunk, embed, retrieve, generate. The demo works because your test questions happen to be phrased like your documents and the answer happens to live in one clean paragraph. Production is where those two coincidences stop holding.

Break #1: Chunking — you sliced the answer in half

Fixed-size chunking (every N characters or tokens) is the default, and it's the first thing to bite you. Splitting blindly does three bad things:

A concrete example: a policy doc says "Enterprise customers are exempt from this limit" two paragraphs after stating the limit. Fixed chunking separates them. Your bot now tells an enterprise customer they're subject to a limit they're explicitly exempt from. That's not a hallucination — the model faithfully answered from the broken context you handed it.

The fix is to chunk along meaning, not character count: split on structural boundaries (headings, sections, list items), add a sentence or two of overlap so boundary context survives, and keep tables and code intact as single units. Chunking is retrieval's foundation — get it wrong and nothing downstream can recover.

Break #2: Retrieval — similarity is not relevance

This is the big one, and the most counterintuitive. Vector search returns the chunks whose embeddings are closest to the query's embedding. People quietly assume "closest" means "most useful." It doesn't.

The fixes, in order of leverage:

  1. Hybrid search — run keyword/BM25 retrieval and vector retrieval, then merge. Keyword catches the exact terms embeddings smear; vectors catch the paraphrases keyword misses. This single change fixes a startling share of "it can't find the obvious answer" bugs.
  2. Reranking — over-retrieve (say, top 30) with cheap search, then use a cross-encoder reranker to score each candidate against the query directly and keep the best 3–5. Far more accurate than raw embedding distance, because the reranker actually reads the query and the chunk together instead of comparing two pre-computed vectors.
  3. Query rewriting — rephrase the user's question into something shaped like your documents (or generate a hypothetical answer and search with that) before retrieving.

Break #3: Context bloat — more chunks, worse answers

The instinct after Break #2 is "just retrieve more — top 20 instead of top 3." That makes it worse, and this surprises everyone.

The fix is precision over recall at the generation step: retrieve broadly, then filter hard. Rerank down to the few genuinely relevant chunks, drop anything below a relevance threshold (returning "I don't have that" is a feature, not a failure), and use metadata to pre-filter the search space — restrict to the right product, the current doc version, the right customer tier — so you're not relying on the embedding to encode all of that.

The mental model that fixes most of this

Here's the shift that made RAG click for me:

RAG is a search problem wearing an LLM costume.

The generation step is the easy, reliable part. Modern models write a great answer if you hand them the right context. Almost every RAG failure in production is a retrieval failure — the model was asked to answer from context that was incomplete, irrelevant, or buried. When your RAG bot is wrong, your instinct will be to blame the model or tweak the prompt. It's almost always the pipeline that fed it.

Which leads to the single most useful debugging habit: evaluate retrieval separately from generation. Before you judge an answer, look at what got retrieved. Ask: was the correct chunk even in the set? If no, it's a retrieval bug — fix chunking, search, or ranking, and no prompt engineering will save you. If yes but the answer is still wrong, now it's a generation or context-bloat problem. Most teams skip this split, stare at bad answers, and tune the wrong half of the system for weeks.

Build a tiny eval set — 20–50 real questions paired with the chunk that should be retrieved — and measure retrieval hit rate as a number. The moment retrieval becomes something you measure instead of something you eyeball, RAG stops being magic and starts being engineering.


We go deep on this — chunking strategies, hybrid search, rerankers, and evaluating RAG like a real information-retrieval system — in our courses. New to all this? Start with what RAG is for the plain-English version first.


Want to go deeper? Explore Novacademy courses →