What's the difference between RAG and fine-tuning?

RAG retrieves relevant documents at query time and injects them into the prompt. Fine-tuning bakes knowledge into model weights at training time. RAG is better for dynamic, updatable, or private knowledge bases. Fine-tuning is better for style and behavior changes. Most production systems use RAG.

How do you measure RAG accuracy?

We use a held-out evaluation set of query-answer pairs, automated with an LLM judge comparing retrieved context against expected answers. We track retrieval precision, answer faithfulness, and answer relevance as three separate metrics.

What vector database do you recommend?

Qdrant for production systems that need filtering and hybrid search. ChromaDB for local development or small-scale embedded use cases. Both are significantly better than FAISS for production workloads that need persistence and filtering.

← All insights

AI/MLApril 15, 20269 min read

RAG Pipelines in Production: 5 Lessons from Real Deployments

Everyone's building RAG. Most of it breaks in production. Here's what we learned deploying retrieval-augmented generation for enterprise clients.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

RAG pipelines fail in production for predictable reasons. Building one that achieves 94% accuracy on a 10,000-document corpus — and keeps that accuracy as the corpus grows — comes down to five engineering decisions most tutorials skip.

Why do RAG pipelines fail in production?

Retrieval-augmented generation is the right architecture for most enterprise AI applications. It grounds LLM responses in real data, allows you to update the knowledge base without retraining, and makes answers auditable. In a notebook, it works beautifully.

In production, most RAG systems fail quietly. The queries that work in demos fail on real user inputs. Retrieval returns irrelevant chunks. The LLM hallucinates details that weren't in the retrieved context. Performance degrades as the document corpus grows.

After building RAG pipelines for enterprise clients at Creative Codes, including a 10,000+ document financial services knowledge base, here are the five lessons that made the difference between a demo and a production system.

Lesson 1: Chunking strategy is everything

The single biggest determinant of RAG quality is how you split documents into chunks. Most tutorials tell you to chunk on token count (e.g., 512 tokens with 50-token overlap). This works for some documents. For most enterprise content in practice, it's wrong.

Consider a compliance policy document: it has a header section, numbered clauses, sub-clauses, tables, and footnotes. Splitting it naively at token boundaries will shatter clause boundaries, put related content in different chunks, and split the table mid-row. The retrieval system then returns fragments that individually make no sense.

Better approaches:

Document-aware chunking: use the document's own structure to define chunk boundaries. Headers, sections, and paragraphs are natural breaks. For PDFs with known structure (financial reports, legal documents), parse the structure explicitly.

Semantic chunking: split where semantic similarity drops below a threshold, using embeddings to detect topic shifts. More expensive at index time, meaningfully better at retrieval.

Chunk size calibration: the right chunk size depends on your queries. Short factual queries (what is the penalty for X?) need small precise chunks. Summarization queries (explain the approach to Y) need larger chunks. For most production systems, we use a hierarchical approach: both sentence-level and paragraph-level chunks indexed together.

Lesson 2: Embed once, retrieve smart

People spend enormous energy debating embedding model choice. The model matters, but less than people think. What matters significantly more is how you search.

Pure vector similarity has a known failure mode: it retrieves semantically similar chunks, not necessarily relevant chunks. The embedding space conflates related but incorrect answers with correct ones.

Hybrid search (combining vector similarity with BM25 keyword matching) consistently outperforms either approach alone. The vector search finds semantically relevant content; keyword search anchors on exact terms and named entities that vector similarity can miss.

python

async def retrieve_context(query: str, filters: dict) -> list[Chunk]:
    # Hybrid search: vector similarity + keyword matching
    vector_results = await vector_db.search(
        embedding=embed(query),
        top_k=20,
        filters=filters
    )
    keyword_results = await full_text_search(query, top_k=10)

    # Re-rank combined results
    combined = deduplicate(vector_results + keyword_results)
    reranked = cross_encoder.rerank(query, combined, top_k=5)

    return reranked

The cross-encoder reranking step is often skipped in tutorials. It's not optional in production. A bi-encoder retrieves candidates; a cross-encoder (we use sentence-transformers) rescores them by jointly encoding the query and each candidate together. The precision improvement is significant, especially on ambiguous queries.

Metadata filtering: before scoring relevance, filter by document type, date, author, or department. This constrains the search space and prevents an outdated policy document from outranking a current one.

Lesson 3: Ground everything, cite sources

The most common failure in deployed RAG systems isn't technical, it's that the LLM generates plausible-sounding answers that aren't grounded in the retrieved context. This is catastrophic in enterprise settings where users may act on the answer.

Our approach:

Constrained generation: the prompt explicitly instructs the model to answer only from the provided context. "If the answer is not contained in the following passages, say 'I don't have information on this' and do not attempt to answer."

Mandatory citations: every factual claim in the response must cite a source passage. The system formats responses as answer + footnoted citations. Users can verify any claim by checking the source.

Confidence scoring: if the retriever returns nothing above a relevance threshold, the system declines to answer rather than hallucinating. A "don't know" response is far less damaging than a confident wrong answer.

Warning

The most dangerous RAG failure isn't a wrong answer, it's a confident-sounding wrong answer with no citation. Always force source attribution and teach users to check it.

Lesson 4: Monitor retrieval quality, not just generation

Most teams instrument their RAG systems by monitoring LLM outputs (latency, costs, user ratings). They miss the more important layer: retrieval quality.

The retriever is the bottleneck. If retrieval returns low-relevance chunks, the LLM can't save the answer. Garbage in, garbage out. And retrieval quality degrades silently as the document corpus evolves.

Metrics worth tracking:

Retrieval precision: of the top-k chunks returned, what fraction are actually relevant? Requires sampling + human labeling, but worth doing periodically.
Mean reciprocal rank (MRR): is the most relevant chunk ranked first? Degrades as vocabulary drift occurs between documents and queries.
Groundedness rate: what percentage of LLM responses are fully grounded in retrieved context vs. containing non-cited claims? Automatable with a scoring model.
Null retrieval rate: how often does the retriever return nothing above threshold? A rising rate means the corpus doesn't cover queries users are actually asking.

These metrics let you catch quality regressions before users complain about them.

Lesson 5: Automated reindexing isn't optional

A RAG system deployed against a static corpus is a solved problem. A RAG system deployed against a living corpus, where documents are added, updated, and retired, is a maintenance challenge that most teams underinvest in.

When a source document changes and its embeddings aren't updated, the retriever returns stale content. The LLM then generates answers based on information that no longer reflects reality. In a compliance or financial context, this is a material risk.

We build change detection into every RAG pipeline from day one.

Detect doc changes

Re-chunk & embed

Update vector index

Validate retrieval

File watchers for local/SharePoint sources trigger reindexing on modification. Webhook triggers from document management systems (Notion, Confluence, Google Drive) push updates in near-real-time. Scheduled full reindexes run weekly as a catch-all for sources where incremental detection is unreliable.

The validation step after reindexing runs a suite of golden queries against the updated index to confirm retrieval quality hasn't regressed. If it has, the update is flagged for review before going live.

The gap between a working RAG demo and a production system is mostly engineering discipline, not technical novelty. Chunking strategy, hybrid retrieval, citation enforcement, quality monitoring, and automated reindexing aren't exciting, but they're what separates a system your users can trust from one they learn to distrust after the first bad answer.

The full case study for our enterprise deployment is in the RAG knowledge system case study. For the specific metrics and harnesses we use to measure RAG accuracy in production, see LLM Evaluation: How to Measure Production Accuracy.

If you're building a RAG system and running into the same issues, our RAG pipeline development service covers the full stack from chunking strategy to reranking to production monitoring. We've shipped these in compliance-sensitive environments where retrieval accuracy isn't negotiable. For a broader look at how we approach AI and ML (custom models, LLM integrations, predictive analytics), see our AI and Machine Learning services.

If you're still deciding between RAG and fine-tuning, Fine-Tuning vs RAG: How to Choose for Your Use Case walks through the decision framework we use on every production AI project.

Related service

Need a RAG pipeline, ML model, or AI agent built for production?

AI & Machine Learning →

← All insights

AI/ML9 min

Document AI in Production: OCR, Structured Extraction, and PDF Parsing at Scale

AI/ML9 min

LLM Integration for Production Apps: API Design, Latency, and Cost Control

AI/ML9 min

Qdrant vs ChromaDB vs Pinecone: Choosing a Vector Database for Production RAG

We publish new posts every few weeks. See more on the insights page.