How to Pin Model Output Format Using Few-Shot Examples

June 23, 2026 · How Claude Actually Works (part 11)

▶ Watch on YouTube & subscribe to The Stack Underflow

When the model returns the wrong number format or flips a date’s day and month, the natural reflex is to drag the temperature slider to zero and rerun. That reflex is wrong — and this episode explains why, then gives you the actual fix.

The real lever is the messages array. By injecting a small set of worked examples directly into conversation history before your real request, you show the model exactly what correct input-output pairs look like. No fine-tuning, no infrastructure changes, no waiting.

The one-sentence version: Two or three concrete input/output examples pinned in the messages array teach the model your conventions more reliably than any combination of prose instructions or temperature adjustments.

The Problem: Silent, Expensive Format Failures

The video opens with a concrete scenario: you’re parsing a European invoice. The input is 1.234,56 € dated 7/4 (seventh of April). The model returns 123000 and a date in July.

Nothing crashed. No exception was thrown. The numbers just lie — and that’s the expensive kind of failure.

Two locale traps fired at once:

FieldInput conventionModel misread as
Decimal separator, (European comma)Thousands separator
Date orderDD/MM (day-month)MM/DD (month-day), turning April → July

Both are silent. Both would pass a basic type check. You’d only catch them downstream — maybe in an audit, maybe never.

Why Temperature Zero Doesn’t Help

The instinct: set temperature=0 and rerun for deterministic output. Same wrong output, now consistently wrong.

Temperature controls randomness, not knowledge. Lowering it makes the model more confident in whatever answer it already had — including the wrong one. It does not transfer any understanding of European number formatting or ISO date conventions. The training data is frozen at a point in time; you cannot reach in and patch it with a slider.

The Actual Fix: Few-Shot Examples in the Messages Array

The fix lives in the messages array. Before your real request, insert two or three synthetic turns that demonstrate correct behavior:

[
  {
    "role": "user",
    "content": "Parse this amount: 2.000,00 €"
  },
  {
    "role": "assistant",
    "content": "{\"amount\": 2000.00, \"currency\": \"EUR\"}"
  },
  {
    "role": "user",
    "content": "Parse this date: 31/12"
  },
  {
    "role": "assistant",
    "content": "{\"date\": \"2024-12-31\"}"
  },
  {
    "role": "user",
    "content": "Parse this invoice: Total 1.234,56 €, dated 7/4"
  }
]

Run that through the same model, same temperature. Output: 1234.56, date 2024-04-07. Two green checks.

The model generalized from your examples — it did not memorize them. The pattern clicked from the demonstrated pairs, and it applied the same logic to the new input.

Why This Works: Prompt as Runtime Context

Your examples live in the prompt, injected at runtime. The training data is frozen and unreachable, but the context window is yours to populate at inference time. No fine-tuning pipeline, no GPU hours, no waiting for a model update. The knowledge you inject is instant and costs only tokens.

This is the “show versus tell” principle made concrete:

TELL (prose rules):          SHOW (few-shot examples):
─────────────────────        ──────────────────────────
"Use European decimal        user: 2.000,00 €
 conventions. Commas         assistant: {"amount": 2000.00}
 are thousands separators
 in some locales but         user: 1.234,56 €
 decimal in others.          assistant: {"amount": 1234.56}
 Always output ISO dates."

The left side — a long bulleted rules prompt — still misses corner cases. The right side, just two examples, handles the same corner case correctly. Prose describes; examples demonstrate.

ASCII: Where the Examples Live in the Request

┌─────────────────────────────────────────────────────┐
│  messages array (your runtime context)              │
│                                                     │
│  ┌─────────────────────────────────────────────┐   │
│  │  [0] user:      "2.000,00 €"  ← example 1  │   │
│  │  [1] assistant: {"amount": 2000.00}         │   │
│  │  [2] user:      "31/12"       ← example 2  │   │
│  │  [3] assistant: {"date": "2024-12-31"}      │   │
│  │  [4] user:      REAL REQUEST               │   │
│  └─────────────────────────────────────────────┘   │
│                                                     │
│  Training data  ──────  FROZEN (cannot touch)       │
└─────────────────────────────────────────────────────┘

How Many Examples? The Dosage Curve

Accuracy rises steeply with one, two, and three examples, then flattens. Token cost climbs linearly the whole time.

Two or three examples is the sweet spot. Beyond that, you’re paying in tokens for diminishing accuracy returns. The video is explicit: don’t pile on ten examples thinking more is always better.

One practical optimization: cache those example turns. Because the prefix is identical across all your requests, a prompt cache hit means you pay almost nothing for the few-shot context after the first call. (Covered in detail in episode 6 of the playlist.)

Common Misconceptions

  • “Temperature zero gives me deterministic, correct output.” It gives you deterministic output — correct only if the model already knew the right answer. Zero temperature amplifies confidence, not correctness. It will confidently reproduce the same mistake every time.

  • “I need to fine-tune to teach the model my format conventions.” Fine-tuning is for persistent behavioral shifts across a whole model, not for per-request format pinning. Few-shot examples in the prompt are faster, cheaper, and reversible.

  • “More examples are always better.” Accuracy plateaus after two or three examples. Beyond that you’re burning tokens and potentially pushing useful context out of the window.

  • “Prose instructions are equivalent to examples.” They are not. Prose describes rules; examples demonstrate them. For edge cases and locale-specific conventions, demonstration reliably outperforms description.

Frequently Asked Questions

Does the model actually “learn” from few-shot examples? Not in a training sense — no weights are updated. The examples become part of the input context. The model pattern-matches against them at inference time, generalizing to the new input. It’s context-window reasoning, not learning.

Can I mix few-shot examples with a system prompt? Yes, and you should. The system prompt sets high-level behavior; the few-shot examples in the messages array pin the exact format. They complement each other. Put your worked examples in the messages array as synthetic user/assistant turns, not in the system prompt.

What if my examples are too similar to each other? Diversity matters. Pick examples that cover different surface forms of the same pattern — different amounts, different edge cases, different date values. If all your examples look identical, the model may overfit to superficial features of those specific strings rather than generalizing the underlying rule.

How does caching interact with few-shot examples? If the same few-shot prefix appears at the start of many requests (same examples, same order), the prompt cache can serve subsequent requests at a fraction of the token cost. Keep your examples stable and at the top of the messages array to maximize cache hits. Episode 6 of the series covers this.

Where This Fits in the Series

This tutorial is part of How Claude Actually Works — a course that builds up a working mental model of Claude from first principles. This episode (11) pairs with episode 401 on force tool calls and episode 404 on acceptance criteria: together they form a practical toolkit for getting reliable, structured output from the model. Browse all tutorials to follow the full series in order.

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →