Creative Codes
← All insights
AI/MLJune 1, 20269 min read

LLM Evaluation: How to Measure Production Accuracy

Vibes-based testing gets you to demo. Systematic evaluation gets you to production. Here are the metrics and harnesses we use to measure LLM accuracy on real tasks.

By Muhammad Hassan

At Creative Codes, we've deployed LLM-powered systems into production: RAG pipelines, classification systems, extraction pipelines, and AI agents. Every one of them went through structured evaluation before deployment. This post covers the framework we use, and why vibe-checking your prompts isn't enough.

Why "it looks good" isn't an evaluation strategy

An LLM can look correct on 10 test cases and fail on the 11th in a way that damages your business. A RAG pipeline that answers 95% of questions well but halluccinates on the other 5% is a support ticket waiting to happen. A classification system that performs well on your dev sample but drifts after deployment is a silent failure.

Systematic evaluation gives you:

  • A number you can track over time (is accuracy improving or degrading?)
  • A threshold you can enforce before deployment
  • A way to detect regression when you change the model, prompt, or retrieval setup

The three types of LLM evaluation

1. Retrieval evaluation (for RAG systems)

If you're using RAG, the retrieval step must be evaluated separately from the generation step. Bad retrieval is the most common source of RAG failures, and you'll never find it if you're only evaluating end-to-end answers.

Retrieval recall: for a given query, what percentage of the relevant chunks were retrieved? Retrieval precision: of the chunks retrieved, what percentage were actually relevant?

To measure this, you need a test set of query-chunk pairs: given this query, these are the chunks that should be retrieved. This requires labeling effort, but it's the foundation.

python
def evaluate_retrieval(query: str, expected_chunks: list[str], retrieved_chunks: list[str]) -> dict:
    expected_set = set(expected_chunks)
    retrieved_set = set(retrieved_chunks)
    
    true_positives = expected_set & retrieved_set
    recall = len(true_positives) / len(expected_set) if expected_set else 0
    precision = len(true_positives) / len(retrieved_set) if retrieved_set else 0
    
    return {"recall": recall, "precision": precision}

2. Generation evaluation (answer quality)

Once you have retrieval working, you need to evaluate the generated answer. Three dimensions matter:

Answer faithfulness: is the answer grounded in the retrieved context, or is the model adding information not present in the context (hallucination)?

Answer relevance: does the answer actually address the question that was asked?

Answer completeness: if the question has multiple parts, did the answer address all of them?

The standard tool for automated evaluation of these dimensions is RAGAS. It uses an LLM-as-judge approach: a secondary LLM evaluates whether the primary LLM's answer is faithful to the context and relevant to the question.

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

test_data = {
    "question": ["What was the Q3 revenue?", "Who is the CEO?"],
    "answer": ["Revenue was $42M in Q3 2025.", "Sarah Chen has been CEO since 2023."],
    "contexts": [
        ["Q3 2025 results: Revenue $42M, up 18% YoY..."],
        ["Sarah Chen joined as CEO in January 2023..."],
    ],
    "ground_truth": ["$42M", "Sarah Chen"],
}

dataset = Dataset.from_dict(test_data)
result = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_recall])
print(result)

3. Task-specific evaluation

For non-RAG LLM tasks (classification, extraction, summarization), define the metric that matches the task:

  • Classification: precision, recall, F1 per class. Don't use accuracy when classes are imbalanced.
  • Extraction: field-level accuracy. For each field you're extracting (name, date, amount), what percentage are extracted correctly?
  • Summarization: ROUGE scores if you have reference summaries; LLM-as-judge if you don't.

Building a test set

The quality of your evaluation is capped by the quality of your test set. A few rules:

  1. Cover failure modes, not just the happy path. Include edge cases: empty documents, ambiguous questions, adversarial inputs.
  2. Stratify by difficulty. Easy, medium, and hard examples. If your model only handles the easy cases, you need to know.
  3. Keep the test set static. Don't add examples to the test set after seeing failures — that's overfitting your evaluation.
  4. Minimum size. 100 examples gives you statistically meaningful results. 50 is too small to detect meaningful differences.

Automated evaluation in CI

Once you have a test set, run evaluation automatically:

python
import json
from pathlib import Path

def run_eval_suite(model_fn, test_set_path: str, threshold: float = 0.85) -> bool:
    test_cases = json.loads(Path(test_set_path).read_text())
    results = []
    
    for case in test_cases:
        response = model_fn(case["input"])
        score = evaluate_response(response, case["expected_output"])
        results.append(score)
    
    mean_score = sum(results) / len(results)
    passed = mean_score >= threshold
    
    print(f"Eval score: {mean_score:.3f} (threshold: {threshold})")
    print(f"Result: {'PASS' if passed else 'FAIL'}")
    return passed

if not run_eval_suite(my_model, "test_cases.json", threshold=0.90):
    raise SystemExit("Model did not meet accuracy threshold. Blocking deployment.")

We run this as a GitHub Actions step before merging prompt or retrieval changes. If the score drops below threshold, the merge is blocked.

Tracking accuracy over time

A one-time evaluation before deployment is necessary but not sufficient. Models drift as usage patterns change, documents get updated, and new query types emerge.

In production, we log:

  • Query text (or a hash if sensitive)
  • Retrieved chunks
  • Generated answer
  • User feedback where available (thumbs up/down, follow-up questions)

Every week, we run evaluation on a sample of recent production queries. If the faithfulness score drops more than 3 percentage points from baseline, we investigate.

What we track on client RAG deployments

For our enterprise RAG deployments (like the Enterprise RAG Knowledge System that achieved 94% query accuracy), the evaluation metrics we committed to before deployment were:

  • Retrieval recall: ≥ 0.90 on held-out query-chunk pairs
  • Answer faithfulness: ≥ 0.92 via RAGAS
  • Answer relevance: ≥ 0.88 via RAGAS
  • Hallucination rate: < 1% (defined as any claim in the answer not present in retrieved context)

These numbers weren't chosen arbitrarily. They came from a conversation with the client about what failure modes were acceptable and what the downstream cost of a wrong answer was.

Define the bar before you start training or tuning. Adjusting the bar after seeing results is how you end up shipping a system that doesn't actually work.


If you're building an LLM-powered system and need systematic evaluation set up from the start, let's talk.

Related: RAG Pipelines in Production: 5 Lessons from Real Deployments | Fine-Tuning vs RAG: How to Choose for Your Use Case

AI & Machine Learning services → | RAG Pipeline Development →

Related service

Need a RAG pipeline, ML model, or AI agent built for production?

AI & Machine Learning

We publish new posts every few weeks. See more on the insights page.