Creative Codes
← All insights
AI/MLApril 15, 20269 min read

RAG Pipelines in Production: 5 Lessons from Real Deployments

Everyone's building RAG. Most of it breaks in production. Here's what we learned deploying retrieval-augmented generation for enterprise clients.

By Muhammad Hassan

Why RAG breaks in production

Retrieval-augmented generation is the right architecture for most enterprise AI applications. It grounds LLM responses in real data, allows you to update the knowledge base without retraining, and makes answers auditable. In a notebook, it works beautifully.

In production, most RAG systems fail quietly. The queries that work in demos fail on real user inputs. Retrieval returns irrelevant chunks. The LLM hallucinates details that weren't in the retrieved context. Performance degrades as the document corpus grows.

After building RAG pipelines for enterprise clients, including a 10,000+ document financial services knowledge base, here are the five lessons that made the difference between a demo and a production system.

Lesson 1: Chunking strategy is everything

The single biggest determinant of RAG quality is how you split documents into chunks. Most tutorials tell you to chunk on token count (e.g., 512 tokens with 50-token overlap). This works for some documents. For most enterprise content in practice, it's wrong.

Consider a compliance policy document: it has a header section, numbered clauses, sub-clauses, tables, and footnotes. Splitting it naively at token boundaries will shatter clause boundaries, put related content in different chunks, and split the table mid-row. The retrieval system then returns fragments that individually make no sense.

Better approaches:

Document-aware chunking: use the document's own structure to define chunk boundaries. Headers, sections, and paragraphs are natural breaks. For PDFs with known structure (financial reports, legal documents), parse the structure explicitly.

Semantic chunking: split where semantic similarity drops below a threshold, using embeddings to detect topic shifts. More expensive at index time, meaningfully better at retrieval.

Chunk size calibration: the right chunk size depends on your queries. Short factual queries (what is the penalty for X?) need small precise chunks. Summarization queries (explain the approach to Y) need larger chunks. For most production systems, we use a hierarchical approach: both sentence-level and paragraph-level chunks indexed together.

Lesson 2: Embed once, retrieve smart

People spend enormous energy debating embedding model choice. The model matters, but less than people think. What matters significantly more is how you search.

Pure vector similarity has a known failure mode: it retrieves semantically similar chunks, not necessarily relevant chunks. The embedding space conflates related but incorrect answers with correct ones.

Hybrid search (combining vector similarity with BM25 keyword matching) consistently outperforms either approach alone. The vector search finds semantically relevant content; keyword search anchors on exact terms and named entities that vector similarity can miss.

python
async def retrieve_context(query: str, filters: dict) -> list[Chunk]:
    # Hybrid search: vector similarity + keyword matching
    vector_results = await vector_db.search(
        embedding=embed(query),
        top_k=20,
        filters=filters
    )
    keyword_results = await full_text_search(query, top_k=10)

    # Re-rank combined results
    combined = deduplicate(vector_results + keyword_results)
    reranked = cross_encoder.rerank(query, combined, top_k=5)

    return reranked

The cross-encoder reranking step is often skipped in tutorials. It's not optional in production. A bi-encoder retrieves candidates; a cross-encoder rescores them by jointly encoding the query and each candidate together. The precision improvement is significant, especially on ambiguous queries.

Metadata filtering: before scoring relevance, filter by document type, date, author, or department. This constrains the search space and prevents an outdated policy document from outranking a current one.

Lesson 3: Ground everything, cite sources

The most common failure in deployed RAG systems isn't technical, it's that the LLM generates plausible-sounding answers that aren't grounded in the retrieved context. This is catastrophic in enterprise settings where users may act on the answer.

Our approach:

Constrained generation: the prompt explicitly instructs the model to answer only from the provided context. "If the answer is not contained in the following passages, say 'I don't have information on this' and do not attempt to answer."

Mandatory citations: every factual claim in the response must cite a source passage. The system formats responses as answer + footnoted citations. Users can verify any claim by checking the source.

Confidence scoring: if the retriever returns nothing above a relevance threshold, the system declines to answer rather than hallucinating. A "don't know" response is far less damaging than a confident wrong answer.

Warning

The most dangerous RAG failure isn't a wrong answer, it's a confident-sounding wrong answer with no citation. Always force source attribution and teach users to check it.

Lesson 4: Monitor retrieval quality, not just generation

Most teams instrument their RAG systems by monitoring LLM outputs (latency, costs, user ratings). They miss the more important layer: retrieval quality.

The retriever is the bottleneck. If retrieval returns low-relevance chunks, the LLM can't save the answer. Garbage in, garbage out. And retrieval quality degrades silently as the document corpus evolves.

Metrics worth tracking:

  • Retrieval precision: of the top-k chunks returned, what fraction are actually relevant? Requires sampling + human labeling, but worth doing periodically.
  • Mean reciprocal rank (MRR): is the most relevant chunk ranked first? Degrades as vocabulary drift occurs between documents and queries.
  • Groundedness rate: what percentage of LLM responses are fully grounded in retrieved context vs. containing non-cited claims? Automatable with a scoring model.
  • Null retrieval rate: how often does the retriever return nothing above threshold? A rising rate means the corpus doesn't cover queries users are actually asking.

These metrics let you catch quality regressions before users complain about them.

Lesson 5: Automated reindexing isn't optional

A RAG system deployed against a static corpus is a solved problem. A RAG system deployed against a living corpus, where documents are added, updated, and retired, is a maintenance challenge that most teams underinvest in.

When a source document changes and its embeddings aren't updated, the retriever returns stale content. The LLM then generates answers based on information that no longer reflects reality. In a compliance or financial context, this is a material risk.

We build change detection into every RAG pipeline from day one.

Detect doc changes
Re-chunk & embed
Update vector index
Validate retrieval

File watchers for local/SharePoint sources trigger reindexing on modification. Webhook triggers from document management systems (Notion, Confluence, Google Drive) push updates in near-real-time. Scheduled full reindexes run weekly as a catch-all for sources where incremental detection is unreliable.

The validation step after reindexing runs a suite of golden queries against the updated index to confirm retrieval quality hasn't regressed. If it has, the update is flagged for review before going live.


The gap between a working RAG demo and a production system is mostly engineering discipline, not technical novelty. Chunking strategy, hybrid retrieval, citation enforcement, quality monitoring, and automated reindexing aren't exciting, but they're what separates a system your users can trust from one they learn to distrust after the first bad answer.

The full case study for our enterprise deployment is here.

We publish new posts every few weeks. See more on the insights page.