How Prompt Caching Cuts Your AI Bill ~90% (and the Floor Trap)

June 23, 2026 · How Claude Actually Works (part 21)

▶ Watch on YouTube & subscribe to The Stack Underflow

If you send the same big chunk of context on every API call — a long system prompt, a knowledge base, a set of tool definitions — you’re paying full input price to re-process identical text over and over. Prompt caching fixes exactly that: the provider stores the processed prefix and lets you reuse it cheaply.

Used right, it cuts the input cost of repeated context by roughly 90%. Used carelessly, the “floor trap” and the short cache lifetime quietly eat the savings. This tutorial covers both.

The one-sentence version: prompt caching stores a stable prompt prefix so subsequent calls pay ~10% of input price for it instead of 100% — but reads have a 10% floor, there’s a minimum size, and the cache expires fast, so it only pays off when you reuse the same prefix soon and often.

The problem it solves

Every call re-sends and re-processes your whole input — system prompt, tool definitions, retrieved docs, history. If 90% of that is identical call to call, you’re repaying for the same work constantly. (If that surprises you, see the tokens tutorial earlier in this series — you pay per input token, every call.)

Prompt caching says: process that stable part once, store it, and on the next call charge me a small fraction to reuse it.

How it works

You mark a cache breakpoint at the end of the stable portion of your prompt. The rule that makes it work: caching applies to the prefix. Everything before the breakpoint must be byte-for-byte identical to hit the cache; everything after it is the part that changes per call.

┌─────────── cached prefix (stable) ───────────┐ ┌── varies ──┐
[ system prompt ][ tool defs ][ knowledge base ] [ user's question ]

                                        cache breakpoint

So you put the unchanging stuff first (system prompt, tools, big reference context) and the changing stuff last (the user’s actual query). Order matters — a single different character early in the prompt busts the entire cache.

The real numbers (and why ~90%)

On Anthropic’s API (current rates — always verify against the pricing docs):

OperationCost vs. base inputMeaning
Cache read (hit)0.1× (10%)Reusing cached prefix is 90% cheaper
5-minute cache write1.25×Writing the cache costs a 25% premium once
1-hour cache writeLonger retention costs more to write

Because a read is 0.1× and a 5-minute write is 1.25×, caching pays for itself after a single reuse within the window. Cache a 20,000-token knowledge base once, reuse it across many questions, and that 20K drops from full price to a tenth on every subsequent call. That’s the ~90% headline.

The floor trap (read this part)

Here’s what quietly costs people money:

  1. The 10% floor. Cache reads cost 0.1× — not zero. There is a floor: cached context still costs 10% of input price on every call. If your prefix is gigantic, 10% of gigantic is still real money. Caching makes big context cheaper, not free — don’t treat it as a license to stuff unlimited context.

  2. The expiry trap. The default cache lifetime is 5 minutes (it was quietly shortened from 1 hour in early 2026). If your traffic is sporadic — a call every 10 minutes — the cache expires between calls, so you keep paying the write premium (1.25×) to re-create it and never collect the cheap reads. Caching helps bursty, frequent reuse; it can actively hurt slow, intermittent traffic.

  3. The minimum-size floor. There’s a minimum cacheable prefix (1,024 tokens on current Opus/Sonnet models). Below that, caching does nothing — small prompts can’t be cached at all.

When caching pays off vs. when it doesn’t

SituationCaching helps?
Same 15K system prompt across many fast calls✅ Big win
Long doc Q&A, many questions in a session✅ Big win
One-off call with unique context❌ You only pay the write premium
A call every 15 minutes (cache expires)❌ Repaying writes, no reads
Prompt under ~1K tokens❌ Below the minimum
Context changes every call❌ Nothing stable to cache

Common misconceptions

  • “Cached tokens are free.” No — reads cost 10% of input. That’s the floor.
  • “Just turn caching on everywhere.” Only helps when you reuse a stable prefix soon and often. Sporadic traffic pays write premiums for nothing.
  • “Order doesn’t matter.” It’s everything. Stable content first, variable content last, or you never hit the cache.
  • “The cache lasts a while.” Default is ~5 minutes. Assume it’s gone fast unless you pay for longer retention.

Frequently asked questions

What exactly can I cache? Any stable prompt prefix — system prompt, tool definitions, large reference documents — as long as it’s identical call-to-call and meets the minimum length.

Why did my caching savings disappear? Likely the 5-minute expiry: if calls are spaced further apart than the TTL, the cache dies between them and you repay the write each time.

Does anything after the breakpoint get cached? No — only the prefix up to the breakpoint. The variable tail is always full-price input.

How do I know it’s working? The API response reports cache read vs. write token counts. Watch those: lots of writes and few reads means your cache isn’t being reused in time.

Where this fits in the series

This is part of the production/cost stretch of How Claude Actually Works — the episodes about running Claude affordably and reliably at scale. Continue with the navigation below, or browse all tutorials.


Sources: Anthropic prompt caching docs · Anthropic pricing

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →