How to Build a Structured Data Extraction Pipeline with Claude

June 23, 2026 · How Claude Actually Works (part 30)

▶ Watch on YouTube & subscribe to The Stack Underflow

Document extraction in production is never just “send the PDF, get JSON back.” Real invoices carry European decimal formats, ambiguous dates, and the occasional field that simply isn’t there. A pipeline that works on your ten test samples will embarrass you on invoice 847. This tutorial walks through the exact five-technique stack the video demonstrates — forced schema, few-shot examples, validate-and-retry, confidence routing, and prompt caching — assembled into one end-to-end flow.

The architecture sits on two planes: a prompts plane that carries policy, examples, and the output schema; and a reliability plane that extracts, validates, and routes every document. Understanding which technique belongs to which plane makes the whole thing much easier to debug and extend.

The one-sentence version: Force the model into a typed schema with tool_choice, lock conventions with few-shot examples, retry on validation failure, divert low-confidence rows to humans, and cache the stable prompt prefix so repeat invocations cost ~90% less.

The Four-Stage Pipeline

A raw invoice enters on the left and traverses four stages before it either lands in the database or goes to a human review tray.

raw invoice
    │
    ▼
┌─────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│ EXTRACT │────▶│ VALIDATE │────▶│  ROUTE   │────▶│ DATABASE │
│ (tool)  │     │ (pydantic│     │ (≥70%    │     │          │
│         │◀────│  model)  │     │  conf.)  │     │          │
│  retry  │     └──────────┘     └─────┬────┘     └──────────┘
└─────────┘                            │
                                       ▼ (<70%)
                                  human review tray

Each stage has a clear responsibility and a clear failure mode. Keeping them separate means you can swap out the validation model or the routing threshold without touching the extraction call.

Technique 1 — Force the Schema with Tool Use

The extraction stage uses tool_choice to force a single tool call. Claude is not asked to “write JSON” — it is required to call a named tool whose input is defined by a Pydantic model compiled into a JSON Schema.

from anthropic import Anthropic
import json

client = Anthropic()

# Pydantic model → JSON Schema → tool definition
invoice_tool = {
    "name": "extract_invoice",
    "description": "Extract structured fields from an invoice document.",
    "input_schema": InvoiceModel.model_json_schema(),
}

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    tools=[invoice_tool],
    tool_choice={"type": "tool", "name": "extract_invoice"},  # forced
    messages=messages,
)

extracted = response.content[0].input  # always a dict, never free-form prose

tool_choice set to a specific tool name guarantees a structured output on every call. You are not parsing prose; you are reading a validated dict.

Technique 2 — Lock Conventions with Few-Shot Examples

Two prior user/assistant turns are pinned above the extraction call. One demonstrates the European decimal convention (1.234,56 means twelve hundred thirty-four, not one-point-two). The other fixes the date format as day-month-year.

FEW_SHOT_TURNS = [
    {
        "role": "user",
        "content": "Invoice shows total: 1.234,56",
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_use",
                "id": "ex_001",
                "name": "extract_invoice",
                "input": {"total_amount": 1234.56, "currency": "EUR"},
            }
        ],
    },
    # ... date convention example ...
]

Few-shot examples are the cheapest way to enforce domain conventions without fine-tuning. They also serve as living documentation: a new engineer reading the prompt immediately sees why 1.234,56 maps to 1234.56.

Technique 3 — Validate and Retry (Once)

After extraction, the output is run through the same Pydantic model. If validation fails, the error is appended as a new user turn and the model retries — once.

def extract_with_retry(document: str) -> InvoiceModel:
    messages = [*FEW_SHOT_TURNS, {"role": "user", "content": document}]

    for attempt in range(2):  # max one retry
        raw = call_extract(messages)
        try:
            return InvoiceModel(**raw)
        except ValidationError as e:
            if attempt == 0:
                # append the error and let the model self-correct
                messages.append({"role": "assistant", "content": [{"type": "tool_use", "id": "retry_01", "name": "extract_invoice", "input": raw}]})
                messages.append({"role": "user", "content": f"Validation failed: {e}. Please correct and retry."})
            else:
                raise

The one-retry cap is deliberate. A model that fails twice on the same document is almost certainly facing genuine ambiguity — something a human should resolve, not a third API call.

Technique 4 — Confidence Routing

Every output row carries a confidence field (0–100). A threshold at 70 splits the stream:

Confidence	Destination	Volume (example run)
≥ 70	Database (auto-accept)	~92% of invoices
< 70	Human review tray	~8% of invoices

This is the key insight of the pipeline: you do not need the model to be right on every document. You need it to know when it is uncertain and route accordingly. Humans see only the ambiguous slice, which is a much better use of their time than reviewing every extracted record.

Technique 5 — Cache the Stable Prefix

The vendor policy block — decimal rules, date rules, the “do not invent values” instruction, and the few-shot turns — does not change between invoices. Marking that prefix with a cache_control breakpoint means repeat invocations land roughly 90% cheaper on the input token cost.

system_prompt = [
    {
        "type": "text",
        "text": VENDOR_POLICY + FEW_SHOT_TEXT,
        "cache_control": {"type": "ephemeral"},  # cache this prefix
    }
]

On a thousand-invoice batch, that 90% reduction is the difference between a pipeline that is economical and one that is a line item someone questions in a budget meeting.

At Scale

Pour a thousand invoices through this pipeline and the numbers shake out to roughly 92% auto-accepted into the database and 8% diverted to human review. The clean majority flows without human touch. The ambiguous slice — the genuinely hard cases — gets human attention. That is the goal.

Common Misconceptions

“Forcing a schema is the same as asking for JSON in the prompt.” It is not. tool_choice makes the model call a typed function; a prompt instruction to “output JSON” still produces prose that happens to look like JSON. The model can — and occasionally will — add commentary outside the JSON block when not constrained by tool use.
“Few-shot examples are just for teaching the model what the output looks like.” They also encode domain conventions that are invisible to a model trained on generic text. European number formatting, fiscal-year date conventions, and company-specific field names are all candidates for few-shot pins.
“One retry is not enough — you should keep retrying until it passes.” Unbounded retries turn rare failures into runaway costs. If validation fails twice, you have hit the ceiling of what the model can self-correct; escalate to a human or a fallback.
“Confidence routing requires a separate classification step.” No — the confidence score is a field in the same tool_use output. The model produces it in a single call; your code just reads it and branches.

Frequently Asked Questions

What Pydantic version works with model_json_schema()? Pydantic v2. In v1 the equivalent is .schema(). If you are on a mixed codebase, model_json_schema() will raise an AttributeError on v1 models — a fast way to find the mismatch.

How do I choose the confidence threshold? Start by measuring the false-accept rate (confident rows that were actually wrong) and false-divert rate (unconfident rows that were actually correct) on a labeled sample. 70 is a reasonable starting point, but the right number depends on the cost of a bad auto-accept versus the cost of unnecessary human review in your domain.

Does cache_control: ephemeral persist across sessions or API keys? No. The cache is per-API-key and has a time-to-live of roughly five minutes of inactivity. It is designed for high-throughput batches, not for caching across days or between different callers.

Can I use this pattern for documents other than invoices? Absolutely. The same five techniques apply to any document type where you need typed fields, domain conventions, and a human-in-the-loop fallback: receipts, contracts, medical forms, shipping manifests. The schema and few-shot examples change; the pipeline shape stays the same.

Where This Fits in the Series

This tutorial is part of How Claude Actually Works, a course that builds from first principles — tokens, context windows, tool use — up through production-grade patterns like the one shown here. The extraction pipeline is scenario 6 of the course’s applied scenarios; it draws on tool use, prompt caching, and reliability patterns covered in earlier episodes. Once you have this pipeline working, the series finale ties all the techniques together into a view of the full stack. Browse all tutorials to find the episodes that fill any gaps.

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →