What metrics should I use to evaluate a RAG pipeline?

At Creative Codes, we track three RAGAS metrics in production: retrieval precision (does the retrieved context contain the answer?), answer faithfulness (is the answer grounded in retrieved context, not hallucinated?), and answer relevance (does the answer address the question?). We automate scoring with an LLM judge on a held-out evaluation set of 50-100 query-answer pairs.

How do you measure LLM hallucination rate in production?

We compare model outputs against retrieved source chunks using an automated faithfulness scorer: a Claude or GPT-4o call asking whether the claim is supported by the provided context, run on a sample of production queries. We flag anything below 95% faithfulness for human review. For high-stakes systems, we add a citation enforcement layer that refuses to answer without a grounded source.

What is RAGAS and how does it work?

RAGAS is an open-source evaluation framework for RAG pipelines. It measures four dimensions automatically using an LLM judge: context precision, context recall, faithfulness, and answer relevance. You provide your question set, the retrieved contexts, and the generated answers. RAGAS scores each dimension without requiring ground-truth labels for every query. We use it as the baseline evaluation harness on every RAG project we ship.

← All insights

AI/MLJune 1, 20269 min read

LLM Evaluation: How to Measure Production Accuracy

Vibes-based testing gets you to a demo. Systematic evaluation gets you to production. Here are the metrics and harnesses we use to measure LLM accuracy on real tasks.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

Every LLM system that looked good in testing has failed at least once in production for reasons that testing didn't surface. Every one of the systems we've shipped went through structured evaluation before deployment. This post covers the framework we use, and why vibe-checking your prompts isn't enough.

Why "it looks good" isn't an evaluation strategy

An LLM can look correct on 10 test cases and fail on the 11th in a way that damages your business. A RAG pipeline that answers 95% of questions well but halluccinates on the other 5% is a support ticket waiting to happen. A classification system that performs well on your dev sample but drifts after deployment is a silent failure. You only discover it when a user files a complaint or a downstream metric starts moving in the wrong direction.

Systematic evaluation gives you:

A number you can track over time (is accuracy improving or degrading?)
A threshold you can enforce before deployment
A way to detect regression when you change the model, prompt, or retrieval setup

The three types of LLM evaluation

1. Retrieval evaluation (for RAG systems)

If you're using RAG, the retrieval step must be evaluated separately from the generation step. Bad retrieval is the most common source of RAG failures, and you'll never find it if you're only evaluating end-to-end answers.

Retrieval recall: for a given query, what percentage of the relevant chunks were retrieved? Retrieval precision: of the chunks retrieved, what percentage were actually relevant?

To measure this, you need a test set of query-chunk pairs: given this query, these are the chunks that should be retrieved. This requires labeling effort, but it's the foundation.

python

def evaluate_retrieval(query: str, expected_chunks: list[str], retrieved_chunks: list[str]) -> dict:
    expected_set = set(expected_chunks)
    retrieved_set = set(retrieved_chunks)
    
    true_positives = expected_set & retrieved_set
    recall = len(true_positives) / len(expected_set) if expected_set else 0
    precision = len(true_positives) / len(retrieved_set) if retrieved_set else 0
    
    return {"recall": recall, "precision": precision}

2. Generation evaluation (answer quality)

Once you have retrieval working, you need to evaluate the generated answer. Three dimensions matter:

Answer faithfulness: is the answer grounded in the retrieved context, or is the model adding information not present in the context (hallucination)?

Answer relevance: does the answer actually address the question that was asked?

Answer completeness: if the question has multiple parts, did the answer address all of them?

The standard tool for automated evaluation of these dimensions is RAGAS. It uses an LLM-as-judge approach: a secondary LLM evaluates whether the primary LLM's answer is faithful to the context and relevant to the question.

python

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

test_data = {
    "question": ["What was the Q3 revenue?", "Who is the CEO?"],
    "answer": ["Revenue was $42M in Q3 2025.", "Sarah Chen has been CEO since 2023."],
    "contexts": [
        ["Q3 2025 results: Revenue $42M, up 18% YoY..."],
        ["Sarah Chen joined as CEO in January 2023..."],
    ],
    "ground_truth": ["$42M", "Sarah Chen"],
}

dataset = Dataset.from_dict(test_data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(result)

3. Task-specific evaluation

For non-RAG LLM tasks (classification, extraction, summarization), define the metric that matches the task:

Classification: precision, recall, F1 per class. Don't use accuracy when classes are imbalanced.
Extraction: field-level accuracy. For each field you're extracting (name, date, amount), what percentage are extracted correctly?
Summarization: ROUGE scores if you have reference summaries; LLM-as-judge if you don't.

Building a test set

The quality of your evaluation is capped by the quality of your test set. A few rules:

Cover failure modes, not just the happy path. Include edge cases: empty documents, ambiguous questions, adversarial inputs.
Stratify by difficulty. Easy, medium, and hard examples. If your model only handles the easy cases, you need to know.
Keep the test set static. Don't add examples to the test set after seeing failures — that's overfitting your evaluation.
Minimum size. 100 examples gives you statistically meaningful results. 50 is too small to detect meaningful differences.

Automated evaluation in CI

Once you have a test set, run evaluation automatically:

python

import json
from pathlib import Path

def run_eval_suite(model_fn, test_set_path: str, threshold: float = 0.85) -> bool:
    test_cases = json.loads(Path(test_set_path).read_text())
    results = []
    
    for case in test_cases:
        response = model_fn(case["input"])
        score = evaluate_response(response, case["expected_output"])
        results.append(score)
    
    mean_score = sum(results) / len(results)
    passed = mean_score >= threshold
    
    print(f"Eval score: {mean_score:.3f} (threshold: {threshold})")
    print(f"Result: {'PASS' if passed else 'FAIL'}")
    return passed

if not run_eval_suite(my_model, "test_cases.json", threshold=0.90):
    raise SystemExit("Model did not meet accuracy threshold. Blocking deployment.")

We run this as a GitHub Actions step before merging prompt or retrieval changes. If the score drops below threshold, the merge is blocked.

Tracking accuracy over time

A one-time evaluation before deployment is necessary but not sufficient. Models drift as usage patterns change, documents get updated, and new query types emerge.

In production, we log:

Query text (or a hash if sensitive)
Retrieved chunks
Generated answer
User feedback where available (thumbs up/down, follow-up questions)

Every week, we run evaluation on a sample of recent production queries. If the faithfulness score drops more than 3 percentage points from baseline, we investigate.

What we track on client RAG deployments

For our enterprise RAG deployments (like the Enterprise RAG Knowledge System that achieved 94% query accuracy), the evaluation metrics we committed to before deployment were:

Retrieval recall: ≥ 0.90 on held-out query-chunk pairs
Answer faithfulness: ≥ 0.92 via RAGAS
Answer relevance: ≥ 0.88 via RAGAS
Hallucination rate: < 1% (defined as any claim in the answer not present in retrieved context)

These numbers weren't chosen arbitrarily. They came from a conversation with the client about what failure modes were acceptable and what the downstream cost of a wrong answer was.

Define the bar before you start training or tuning. Adjusting the bar after seeing results is how you end up shipping a system that doesn't actually work.

Catching accuracy drift between deployments

Pre-deploy evaluation tells you whether the system was good enough to ship. Scheduled evaluation tells you whether it's still good enough to run.

RAG system accuracy drifts for predictable reasons: source documents get updated, user query patterns shift, the embedding model falls behind new vocabulary in the domain. None of these trigger a deployment — and none of them will be caught by a pre-deploy eval suite you only run on code changes.

We run evaluation on a production sample every week using a script that compares current scores against a stored baseline:

python

import json
from datetime import datetime
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

BASELINE_PATH = "eval_baselines/latest.json"
DRIFT_THRESHOLD = 0.05  # flag if any metric drops more than 5 points

def run_weekly_eval(production_sample: list[dict]) -> dict:
    dataset = Dataset.from_dict({
        "question":   [s["question"] for s in production_sample],
        "answer":     [s["answer"] for s in production_sample],
        "contexts":   [s["contexts"] for s in production_sample],
        "ground_truth": [s["ground_truth"] for s in production_sample],
    })
    scores = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
    return scores.to_pandas().mean().to_dict()

def check_drift(current: dict, baseline: dict) -> list[str]:
    regressions = []
    for metric, score in current.items():
        if baseline.get(metric, 0) - score > DRIFT_THRESHOLD:
            regressions.append(
                f"{metric}: {baseline[metric]:.3f} → {score:.3f} (Δ{score - baseline[metric]:.3f})"
            )
    return regressions

# Run weekly via cron
current_scores = run_weekly_eval(sample_production_queries(n=200))
regressions = check_drift(current_scores, json.loads(open(BASELINE_PATH).read()))

if regressions:
    send_slack_alert(
        f":warning: RAG accuracy regression detected:\n" +
        "\n".join(f"• {r}" for r in regressions)
    )

The key part is sample_production_queries(n=200) — this pulls recent real queries from your logging database, not the static test set. Real queries catch vocabulary drift and new use patterns that your original test set doesn't cover.

When a regression alert fires, the investigation follows a standard path: first check whether source documents changed (retrieval recall drop), then check whether user queries changed (relevance drop), then check whether the LLM provider updated the model (faithfulness drop). In practice, document changes account for roughly 60% of the regressions we've seen. Keeping the reindex pipeline fast and correct is the cheapest way to prevent eval score drift.

One-time pre-deploy evaluation and weekly production sampling are not alternatives — they serve different purposes. Pre-deploy evaluation catches regressions introduced by your changes. Weekly production sampling catches regressions introduced by the world changing around you. Both are necessary. The teams that skip scheduled eval are the ones who discover their system quietly degraded over three months when a client finally escalates a pattern of bad answers.

If you're building an LLM-powered system and need systematic evaluation set up from the start, let's talk.

AI & Machine Learning services → | RAG Pipeline Development →

Related service

Need a RAG pipeline, ML model, or AI agent built for production?

AI & Machine Learning →

← All insights

AI/ML9 min

Document AI in Production: OCR, Structured Extraction, and PDF Parsing at Scale

AI/ML9 min

LLM Integration for Production Apps: API Design, Latency, and Cost Control

AI/ML9 min

Qdrant vs ChromaDB vs Pinecone: Choosing a Vector Database for Production RAG

We publish new posts every few weeks. See more on the insights page.