How to Write LLM Evals: Testing AI Apps with Real Data

June 23, 2026 · How Claude Actually Works (part 23)

▶ Watch on YouTube & subscribe to The Stack Underflow

Running a few prompts and deciding “yeah, looks good” is not a test strategy — it is a vibe check. And vibes regress silently. You tweak one line in your system prompt, 50 cases that were passing now have one quietly flipping to red, and nobody notices until a user files a bug. This video (and article) is about converting “good” from a feeling into a number that a CI pipeline can defend.

The core loop is simple: a dataset, one or more graders, a score, and a gate. That is the whole stack.

The one-sentence version: Replace manual spot-checking with a dataset + grader pipeline that outputs a pass-rate your CI system can block on.

Why “vibe checks” fail at scale

When you have three prompts and a pair of eyes, a qualitative review is fine. The moment you have a non-trivial prompt, a changing model, and a team making edits, that approach collapses. Two problems emerge:

Silent regression. A change that fixes case 12 can break case 47 without you knowing because you only looked at a sample.
No baseline. Without a recorded number, you cannot tell whether this week’s build is better or worse than last week’s.

Vibes do not version-control. Numbers do.

Step 1 — Build a dataset

Every eval starts with a dataset. Think of it as a spreadsheet with two essential columns:

Input	Expected output / criteria
”Summarise this article in one bullet”	The bullet must mention the main claim and be under 25 words
”What is the capital of France?"	"Paris"
"Extract all dates from this text…”	Valid JSON array of ISO-8601 strings

The input column holds real prompts — pulled from production logs, edge cases users actually hit, or cases you know have tripped up the model before. The expected column holds what “good” looks like: an exact answer, an acceptable shape, or a rule.

Start with 20 real, hard cases. Twenty genuine cases with known tricky behavior beat a thousand synthetic happy-path examples. As your system ships, every production failure becomes a new row. The eval gets sharper the more it runs in the wild.

Step 2 — Choose a grader per criterion

Not every output can be judged the same way. Match the grader to the criterion:

Grader types by output kind
────────────────────────────
Exact match      → factual recall, deterministic answers
Schema valid     → JSON/XML structure checks
Rule-based       → regex, numeric range, word-count limits
LLM as judge     → fluency, tone, helpfulness, open-ended quality

Use one grader per criterion. A single output can have multiple criteria (and therefore multiple graders), but mixing graders for a single criterion just makes debugging harder.

Step 3 — Run and score

Execute your app against every row in the dataset. Each row resolves to a pass or a fail. Tally them up:

46 / 50 rows passed  →  92 %

Now “good” is a number. You can track it over time, diff it between commits, and set a threshold.

Step 4 — Wire the score into CI

This is the step that turns an eval from a script you run manually into a safety net that actually catches regressions.

CI pipeline
───────────────────────────────────────────
Run eval harness against new build
         │
         ▼
   Score ≥ 90% ?
    ┌─── yes ───┐
    ▼            ▼
 PASS ✅      FAIL ❌
 merge        block merge

Set a regression gate: a minimum pass-rate below which the pipeline blocks the merge. In the example above, 90% is the threshold. Drop to 88% and the gate turns red. The change does not reach users until the score recovers.

The threshold is a product decision, not a technical one. A customer-facing summarisation feature might need 95%. An internal debug tool might be fine at 80%.

The LLM-as-judge caveat

An LLM judge lets you scale grading for fuzzy, open-ended outputs that rule-based checks cannot handle — tone, coherence, factual nuance. But the judge is itself a model, which means:

It can be inconsistent across runs.
It can disagree with human raters in systematic ways.
Prompt changes to the judge alter every historical score.

Spot-check the judge’s verdicts against human ratings on a sample of cases. If the judge and your team disagree on 20% of samples, you are measuring with a broken ruler. Calibrate the judge prompt or swap to a stronger model before trusting its output in CI.

Common misconceptions

“I need hundreds of examples before evals are useful.” You do not. Twenty well-chosen, representative, hard cases catch more real regressions than 500 synthetic easy ones. Start small, grow from production failures.
“LLM-as-judge is cheating.” It is not cheating — it is a pragmatic tool for criteria that are genuinely hard to rule-check. The risk is calibration, not the approach itself.
“A high eval score means the model is good.” A high score means the model is good on your dataset. If your dataset does not represent production traffic, the score is telling you about the dataset, not the product.
“Evals are a one-time setup.” They are a living artifact. Every production incident that reveals a gap in coverage is a bug in your eval dataset, not just a bug in the model.

Frequently asked questions

How do I decide what threshold to set for the CI gate? Start by running your eval on a known-good baseline — ideally the version of the app that is currently in production. Record that score. Set your gate a few percentage points below it so you catch regressions without blocking every minor variation. Raise the threshold deliberately as your dataset matures and your model quality improves.

What if my LLM output is non-deterministic? Can I still eval it? Yes. Non-determinism is why you need multiple graders and enough cases for the score to be statistically stable. A single case flipping does not tank a 50-case eval by much. If you need tight reproducibility, set temperature to 0 for the eval run (assuming your prod config allows it) and note it in your eval metadata.

What counts as a “production failure” worth adding to the dataset? Any case where a user complained, where you found the output surprising in retrospect, or where a manual review flagged a problem. The signal is “this case exposed a gap in what the model can do on real data” — not just “the output was wrong once.”

Do I need a separate dataset for every model I test against? Not necessarily a separate dataset, but you should track scores per model. The same dataset run against two different models (or two different prompt versions) gives you a direct comparison. That is the whole point — the dataset is the constant, the model or prompt is the variable.

Where this fits in the series

This tutorial is part of How Claude Actually Works, a course that walks through the internals of building production-grade LLM applications — from how Claude processes context, to prompt design, to reliability and safety. Evals sit at the intersection of engineering rigor and LLM uncertainty: they are how you bring normal software-quality discipline to a non-deterministic system.

The next topic in the series covers prompt injection — what happens when user-supplied content tries to hijack your model’s behavior, and how to defend against it.

Browse all tutorials to follow the full course sequence.

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →