How Prompt Caching Cuts Your AI Bill ~90% (and the Floor Trap)

June 23, 2026 · How Claude Actually Works (part 21)

▶ Watch on YouTube & subscribe to The Stack Underflow

If you send the same big chunk of context on every API call — a long system prompt, a knowledge base, a set of tool definitions — you’re paying full input price to re-process identical text over and over. Prompt caching fixes exactly that: the provider stores the processed prefix and lets you reuse it cheaply.

Used right, it cuts the input cost of repeated context by roughly 90%. Used carelessly, the “floor trap” and the short cache lifetime quietly eat the savings. This tutorial covers both.

The one-sentence version: prompt caching stores a stable prompt prefix so subsequent calls pay ~10% of input price for it instead of 100% — but reads have a 10% floor, there’s a minimum size, and the cache expires fast, so it only pays off when you reuse the same prefix soon and often.

The problem it solves

Every call re-sends and re-processes your whole input — system prompt, tool definitions, retrieved docs, history. If 90% of that is identical call to call, you’re repaying for the same work constantly. (If that surprises you, see the tokens tutorial earlier in this series — you pay per input token, every call.)

Prompt caching says: process that stable part once, store it, and on the next call charge me a small fraction to reuse it.

How it works

You mark a cache breakpoint at the end of the stable portion of your prompt. The rule that makes it work: caching applies to the prefix. Everything before the breakpoint must be byte-for-byte identical to hit the cache; everything after it is the part that changes per call.

┌─────────── cached prefix (stable) ───────────┐ ┌── varies ──┐
[ system prompt ][ tool defs ][ knowledge base ] [ user's question ]
                                               ↑
                                        cache breakpoint

So you put the unchanging stuff first (system prompt, tools, big reference context) and the changing stuff last (the user’s actual query). Order matters — a single different character early in the prompt busts the entire cache.

The real numbers (and why ~90%)

On Anthropic’s API (current rates — always verify against the pricing docs):

Operation	Cost vs. base input	Meaning
Cache read (hit)	0.1× (10%)	Reusing cached prefix is 90% cheaper
5-minute cache write	1.25×	Writing the cache costs a 25% premium once
1-hour cache write	2×	Longer retention costs more to write

Because a read is 0.1× and a 5-minute write is 1.25×, caching pays for itself after a single reuse within the window. Cache a 20,000-token knowledge base once, reuse it across many questions, and that 20K drops from full price to a tenth on every subsequent call. That’s the ~90% headline.

The floor trap (read this part)

Here’s what quietly costs people money:

The 10% floor. Cache reads cost 0.1× — not zero. There is a floor: cached context still costs 10% of input price on every call. If your prefix is gigantic, 10% of gigantic is still real money. Caching makes big context cheaper, not free — don’t treat it as a license to stuff unlimited context.
The expiry trap. The default cache lifetime is 5 minutes (it was quietly shortened from 1 hour in early 2026). If your traffic is sporadic — a call every 10 minutes — the cache expires between calls, so you keep paying the write premium (1.25×) to re-create it and never collect the cheap reads. Caching helps bursty, frequent reuse; it can actively hurt slow, intermittent traffic.
The minimum-size floor. There’s a minimum cacheable prefix (1,024 tokens on current Opus/Sonnet models). Below that, caching does nothing — small prompts can’t be cached at all.

When caching pays off vs. when it doesn’t

Situation	Caching helps?
Same 15K system prompt across many fast calls	✅ Big win
Long doc Q&A, many questions in a session	✅ Big win
One-off call with unique context	❌ You only pay the write premium
A call every 15 minutes (cache expires)	❌ Repaying writes, no reads
Prompt under ~1K tokens	❌ Below the minimum
Context changes every call	❌ Nothing stable to cache

Common misconceptions

“Cached tokens are free.” No — reads cost 10% of input. That’s the floor.
“Just turn caching on everywhere.” Only helps when you reuse a stable prefix soon and often. Sporadic traffic pays write premiums for nothing.
“Order doesn’t matter.” It’s everything. Stable content first, variable content last, or you never hit the cache.
“The cache lasts a while.” Default is ~5 minutes. Assume it’s gone fast unless you pay for longer retention.

Frequently asked questions

What exactly can I cache? Any stable prompt prefix — system prompt, tool definitions, large reference documents — as long as it’s identical call-to-call and meets the minimum length.

Why did my caching savings disappear? Likely the 5-minute expiry: if calls are spaced further apart than the TTL, the cache dies between them and you repay the write each time.

Does anything after the breakpoint get cached? No — only the prefix up to the breakpoint. The variable tail is always full-price input.

How do I know it’s working? The API response reports cache read vs. write token counts. Watch those: lots of writes and few reads means your cache isn’t being reused in time.

Where this fits in the series

This is part of the production/cost stretch of How Claude Actually Works — the episodes about running Claude affordably and reliably at scale. Continue with the navigation below, or browse all tutorials.

Sources: Anthropic prompt caching docs · Anthropic pricing

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →