How to Write Acceptance Criteria for LLM Output (Not Just 'Be Accurate')
▶ Watch on YouTube & subscribe to The Stack Underflow
“Extract the invoice accurately” sounds like a complete instruction. It is not. Run that prompt three times against the same document and you may get a total returned as a string in one response, a hallucinated PO number in another, and a dropped currency field in the third. The model did not break — your prompt had gaps, and the model filled them silently, differently every time.
This tutorial is about closing those gaps before the model ever runs. The goal is to replace adjectives like accurate or good with statements a tester could actually check.
The one-sentence version: Vague quality words in prompts are wishes — real acceptance criteria specify output format, edge-case handling, missing-field behavior, and tie-breaking rules so unambiguously that two engineers would write identical test graders from them.
The Core Problem: Silent Delegation
Every unanswered question in a prompt becomes a decision you have delegated to the model. The model will answer that decision — and it will answer it differently on different runs, with different documents, or after a model update. Some common silent delegations:
- Output format — should
totalbe a number or a string like"$1,200.00"? - Missing fields — what happens when the PO number is not on the document?
- Date ambiguity — is
04/05/2024April 5th or May 4th? - Currency — is the number
1200USD, EUR, or whatever the model infers from context?
Each of these produces variance. Variance makes your system untestable.
The Four Rules for Turning Wishes into Criteria
Rule 1 — Format: Shape and Types Are a Contract
Specify the exact output schema including field names, value types, and (for strings) enumerated or constrained formats. Do not use language like “return the invoice data as JSON” — that still leaves dozens of decisions open.
// BAD — shape is unspecified
"Return the invoice data as JSON."
// GOOD — shape and types are the contract
{
"invoice_no": "string",
"total": "number",
"currency": "ISO 4217 three-letter code, e.g. USD",
"line_items": "array of objects with keys: description (string), amount (number)"
}
If total must be a number, say so. If currency must be an ISO 4217 code, say so. The model can infer plenty from context — your schema should remove the inferences you cannot afford to be wrong about.
Rule 2 — Edge Cases: Name the Weird Stuff Upfront
Your business rules are not obvious to a language model. If a negative total means a credit note, state that. If a multi-page invoice requires summing line items rather than trusting a printed total, state that. The model is not guessing randomly — it is making plausible assumptions. Your job is to replace its assumptions with your rules.
If total < 0, set is_credit_note: true.
If the invoice spans multiple pages, sum all line_items[*].amount
and use that as total; ignore any printed "Total" figure.
These are testable. “Handle edge cases appropriately” is not.
Rule 3 — Missing Data: Give Every Absent Field an Explicit Fate
The hallucinated PO number in the opening example happened because the prompt was silent on what to do when a field was not present. The model’s default behavior — in the absence of instructions — is often to make something plausible up. One line eliminates this entire class of error:
Any field not present in the source document must be set to null.
Do not infer, guess, or generate values for absent fields.
This is not a workaround for a model weakness. It is a specification. Every system that reads untrusted documents needs it.
Rule 4 — Ambiguity: Every Tie Gets an Explicit Breaker or an Honest Escape Hatch
Some ambiguities cannot be resolved from the document alone. A date written as 04/05 is genuinely ambiguous without knowing the source locale. The right response is not to pick one silently — it is to commit to a rule or surface the uncertainty.
Date format: prefer DD/MM since source documents are European.
If a date cannot be parsed unambiguously after applying this rule,
set the field value to null and set needs_review: true.
The needs_review flag is the escape hatch. It is an honest acknowledgment that a downstream human (or a second model pass) needs to resolve this case. Routing ambiguous cases to review is a feature, not a failure.
The Before / After
| Before (wish) | After (criteria) | |
|---|---|---|
| Format | ”return JSON” | Exact schema with field names and types specified |
| Edge cases | ”handle correctly” | Explicit rule per business edge case |
| Missing data | (silent) | null + no invention |
| Ambiguity | (silent) | Explicit breaker or needs_review: true |
| Testability | Two engineers would write different graders | Two engineers would write the same grader |
The video demonstrates this directly: the same model, the same source documents, with the same prompt rewritten to include these four rules — produces three identical, stable outputs where before it produced three different ones. Nothing about the model changed. Only the gaps in the prompt did.
The “Two Engineers” Test
Here is a practical heuristic for evaluating your own prompts before you ship them:
If two engineers reading your prompt could build different graders (different test assertions, different pass/fail logic), it is still a wish.
Acceptance criteria are statements a tester could check. They are not adjectives. Accurate, good, appropriate, reasonable — none of these are checkable. They describe a desired feeling, not an observable output property.
Run the two-engineers test on every instruction in your prompt. Any instruction that could be interpreted two ways is an instruction you need to rewrite.
Common Misconceptions
- “Specifying the schema will constrain the model too much.” A schema constrains the shape of the output, which is exactly what you want to constrain. The model still applies judgment to fill the fields — you are just removing ambiguity about what the fields should look like.
- “I can just post-process the output to fix format issues.” Post-processing that silently discards or mutates fields is a test you are not writing and a failure mode you are not surfacing. Specify upfront; validate on output.
- “The model is too unpredictable — criteria will not help.” The variability you observe in LLM output is almost always a function of underspecification, not fundamental model randomness. Closing the specification gaps dramatically reduces variance.
- “‘Do not hallucinate’ is a valid instruction.” It is not. The model does not experience its own hallucinations as such. Telling it not to hallucinate does not give it a mechanism to know when it is doing so. The
null+ no-invention rule for missing fields, combined withneeds_reviewflags for ambiguity, are the actual mechanisms.
Frequently Asked Questions
How detailed does the schema specification need to be? Detailed enough that two engineers would write identical test assertions from it. In practice, that usually means: field names, value types, allowed formats for string fields (especially dates and codes), and explicit behavior for null/missing cases. If you are returning nested structures, specify one level of nesting at a time.
What if the source document format varies a lot between documents? Your criteria still need to specify the output shape — that does not change based on input variance. What changes is your edge-case section: enumerate the known input variations and state how each should be handled. If you discover a new variation at runtime, that is feedback to add a new edge-case rule, not a reason to make the criteria vaguer.
Is needs_review: true reliable? What if the model sets it incorrectly?
No automated flag is perfectly reliable. The value of the escape hatch is that it surfaces cases the model itself identifies as ambiguous — which are usually the ones that would have produced the most variance or hallucination in a less-specified prompt. False negatives (missed ambiguities that should have been flagged) are reduced by explicit criteria; false positives (over-flagging) are usually harmless, routing a case to human review that could have been handled automatically.
Can these criteria be applied to non-extraction tasks — summarization, classification, generation? Yes. The four rules generalize: specify the output format (length, structure, tone constraints), name the edge cases (what to do with extremely short inputs, conflicting signals), define missing-data behavior (what to output when there is not enough information to respond), and provide tie-breakers for classification boundaries. The labels change; the structure does not.
Where This Fits in the Series
This tutorial is part of How Claude Actually Works, a course that builds a working mental model of how large language models behave — from tokenization and context windows through to prompt design and production reliability. The next topic picks up directly from the escape hatch introduced here: making the model surface its own uncertainty through confidence fields and routing, so your system can handle the cases it cannot fully resolve rather than silently committing to a wrong answer.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →