How Prompt Caching Cuts Your AI Bill ~90% (and the Floor Trap)
▶ Watch on YouTube & subscribe to The Stack Underflow
If you send the same big chunk of context on every API call — a long system prompt, a knowledge base, a set of tool definitions — you’re paying full input price to re-process identical text over and over. Prompt caching fixes exactly that: the provider stores the processed prefix and lets you reuse it cheaply.
Used right, it cuts the input cost of repeated context by roughly 90%. Used carelessly, the “floor trap” and the short cache lifetime quietly eat the savings. This tutorial covers both.
The one-sentence version: prompt caching stores a stable prompt prefix so subsequent calls pay ~10% of input price for it instead of 100% — but reads have a 10% floor, there’s a minimum size, and the cache expires fast, so it only pays off when you reuse the same prefix soon and often.
The problem it solves
Every call re-sends and re-processes your whole input — system prompt, tool definitions, retrieved docs, history. If 90% of that is identical call to call, you’re repaying for the same work constantly. (If that surprises you, see the tokens tutorial earlier in this series — you pay per input token, every call.)
Prompt caching says: process that stable part once, store it, and on the next call charge me a small fraction to reuse it.
How it works
You mark a cache breakpoint at the end of the stable portion of your prompt. The rule that makes it work: caching applies to the prefix. Everything before the breakpoint must be byte-for-byte identical to hit the cache; everything after it is the part that changes per call.
┌─────────── cached prefix (stable) ───────────┐ ┌── varies ──┐
[ system prompt ][ tool defs ][ knowledge base ] [ user's question ]
↑
cache breakpoint
So you put the unchanging stuff first (system prompt, tools, big reference context) and the changing stuff last (the user’s actual query). Order matters — a single different character early in the prompt busts the entire cache.
The real numbers (and why ~90%)
On Anthropic’s API (current rates — always verify against the pricing docs):
| Operation | Cost vs. base input | Meaning |
|---|---|---|
| Cache read (hit) | 0.1× (10%) | Reusing cached prefix is 90% cheaper |
| 5-minute cache write | 1.25× | Writing the cache costs a 25% premium once |
| 1-hour cache write | 2× | Longer retention costs more to write |
Because a read is 0.1× and a 5-minute write is 1.25×, caching pays for itself after a single reuse within the window. Cache a 20,000-token knowledge base once, reuse it across many questions, and that 20K drops from full price to a tenth on every subsequent call. That’s the ~90% headline.
The floor trap (read this part)
Here’s what quietly costs people money:
-
The 10% floor. Cache reads cost 0.1× — not zero. There is a floor: cached context still costs 10% of input price on every call. If your prefix is gigantic, 10% of gigantic is still real money. Caching makes big context cheaper, not free — don’t treat it as a license to stuff unlimited context.
-
The expiry trap. The default cache lifetime is 5 minutes (it was quietly shortened from 1 hour in early 2026). If your traffic is sporadic — a call every 10 minutes — the cache expires between calls, so you keep paying the write premium (1.25×) to re-create it and never collect the cheap reads. Caching helps bursty, frequent reuse; it can actively hurt slow, intermittent traffic.
-
The minimum-size floor. There’s a minimum cacheable prefix (1,024 tokens on current Opus/Sonnet models). Below that, caching does nothing — small prompts can’t be cached at all.
When caching pays off vs. when it doesn’t
| Situation | Caching helps? |
|---|---|
| Same 15K system prompt across many fast calls | ✅ Big win |
| Long doc Q&A, many questions in a session | ✅ Big win |
| One-off call with unique context | ❌ You only pay the write premium |
| A call every 15 minutes (cache expires) | ❌ Repaying writes, no reads |
| Prompt under ~1K tokens | ❌ Below the minimum |
| Context changes every call | ❌ Nothing stable to cache |
Common misconceptions
- “Cached tokens are free.” No — reads cost 10% of input. That’s the floor.
- “Just turn caching on everywhere.” Only helps when you reuse a stable prefix soon and often. Sporadic traffic pays write premiums for nothing.
- “Order doesn’t matter.” It’s everything. Stable content first, variable content last, or you never hit the cache.
- “The cache lasts a while.” Default is ~5 minutes. Assume it’s gone fast unless you pay for longer retention.
Frequently asked questions
What exactly can I cache? Any stable prompt prefix — system prompt, tool definitions, large reference documents — as long as it’s identical call-to-call and meets the minimum length.
Why did my caching savings disappear? Likely the 5-minute expiry: if calls are spaced further apart than the TTL, the cache dies between them and you repay the write each time.
Does anything after the breakpoint get cached? No — only the prefix up to the breakpoint. The variable tail is always full-price input.
How do I know it’s working? The API response reports cache read vs. write token counts. Watch those: lots of writes and few reads means your cache isn’t being reused in time.
Where this fits in the series
This is part of the production/cost stretch of How Claude Actually Works — the episodes about running Claude affordably and reliably at scale. Continue with the navigation below, or browse all tutorials.
Sources: Anthropic prompt caching docs · Anthropic pricing
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →