How Claude Token Billing Works: Input, Output, and Cache Costs

June 23, 2026 · How Claude Actually Works (part 26)

▶ Watch on YouTube & subscribe to The Stack Underflow

Every Claude API invoice is built from three and only three line items: input tokens, output tokens, and cached tokens. Understanding what lands in each bucket — and the multipliers stacked on top — is the difference between a demo that costs pennies and a production agent that costs a small fortune.

This lesson walks through the full cost map. First the buckets and multipliers, then the three most common “leaks” developers hit, and finally the concrete fixes for each one.

The one-sentence version: Output tokens cost roughly 5x input tokens, agent loops repay your entire prefix on every iteration, and three targeted fixes (cache control, tool-output pruning, and model tier routing) will eliminate most of your runaway bills.

The Three Cost Buckets

Every provider bill — Anthropic, OpenAI, or anyone else running a frontier model — reduces to the same three categories:

Bucket	What it contains	Typical relative cost
Input tokens	System prompt + conversation history + tool results	1x (baseline)
Output tokens	The model’s generated response	~5x input rate
Cached tokens	Input tokens served from the prompt cache	~0.1x input rate

Nothing else goes on the bill. Understanding which bucket a given design decision fills is the core skill for cost-aware LLM engineering.

The 5x Output Multiplier

Output tokens are expensive. On most frontier models today, generating a token in the response costs roughly five times what it costs to feed a token in the prompt. That ratio is not arbitrary — inference compute scales with the number of tokens the model has to materialize, and the providers price accordingly.

The practical implication: verbose model outputs are not just slow, they are disproportionately expensive. A model that generates a 2,000-token explanation where 400 tokens would suffice is burning roughly 8,000 token-equivalents of budget on the output side alone. Conciseness is a performance optimization, not just a style preference.

The Prefix Repayment Problem in Agent Loops

Single-turn API calls are cheap. Agent loops are where costs get surprising.

Here is the key thing most developers miss: the model has no memory across API calls. Each call is stateless. So when your agent loops — checking a tool result, deciding the next step, calling another tool — it must resend the full system prompt plus the entire conversation history as a prefix on every single iteration.

Loop iteration 1:  [system prompt] + [tool call 1]             → response
Loop iteration 2:  [system prompt] + [tool call 1] + [result 1] + [tool call 2] → response
Loop iteration 3:  [system prompt] + [tool call 1] + [result 1] + [tool call 2] + [result 2] ... → response

If your loop runs n times, you pay for your system prompt n times. You pay for the first tool result n-1 times. The prefix is rebilled on every lap. This is why agent demos are cheap (two or three loop iterations, small outputs) and agent products are not (ten to fifty iterations, growing context on every turn).

The Three Cost Leaks — and How to Fix Them

Leak 1: Uncached Prefix Repaid Every Loop

The problem: The system prompt and stable conversation history are resent as raw input tokens on every agent iteration.

The fix: Cache control. Mark the stable prefix as cacheable, and every subsequent call that hits the cache pays roughly 1/10 of the standard input token rate. On a ten-iteration agent loop, that turns nine full-price prefix bills into nine discounted cache reads — often a 70–90% reduction in input costs for that prefix alone.

This is covered in depth in lesson 06-01 of this series.

Leak 2: Large Tool Outputs Accumulating in Context

The problem: A tool returns 8 KB of JSON. You dump it straight into the context window. Now it becomes part of the prefix for every subsequent turn, resent in full, at input token prices, forever.

The fix: Prune, summarize, slice, or drop tool output before it graduates to permanent prefix. If the downstream steps only need three fields from an 8 KB response, extract those three fields and discard the rest before appending to the conversation. If the raw output was needed once but not again, drop it from context after it has been processed.

Lesson 06-02 covers tool-output management strategies.

Leak 3: Using a Top-Tier Model for a Bottom-Tier Task

The problem: You have one API client pointed at Opus (or whatever the top-tier model is). Every call goes there — including the classifier that returns "positive" or "negative", the reformatter that just cleans up punctuation, and the yes/no gating step.

The fix: Tier routing. Map task complexity to model tier before the call goes out. A yes/no classifier belongs on the smallest, fastest, cheapest model available. A reformatter does too. Reserve the top-tier model for the calls that actually require it — complex reasoning, multi-step planning, nuanced generation. The cost difference between tiers is typically 10x to 20x; routing even half your calls to a smaller model can cut your bill in half.

def route_to_model(task_type: str) -> str:
    simple_tasks = {"classify", "reformat", "yes_no", "extract_field"}
    if task_type in simple_tasks:
        return "claude-haiku-4-5"   # fast, cheap
    return "claude-opus-4-5"        # reserved for tasks that need it

The Full Cost Map at a Glance

A single request
├── Input tokens         (1x)
├── Output tokens        (5x)  ← verbose answers burn budget fast
└── Cached tokens        (0.1x) ← cache your stable prefix

An agent loop (n iterations)
├── Prefix repaid n times       ← Leak 1: fix with cache control
├── Tool output grows prefix    ← Leak 2: fix with pruning/summarization
└── All calls hit top-tier model ← Leak 3: fix with tier routing

Common Misconceptions

“Output is cheap because prompts are longer.” Output tokens cost roughly 5x input tokens on frontier models. A short prompt producing a long answer is more expensive than a long prompt producing a short answer, token-for-token.
“My agent only runs 10 iterations so costs are bounded.” The prefix grows on every iteration. Iteration 10 sends the system prompt plus nine rounds of tool calls and results. The cost per iteration rises as the loop progresses, not stays flat.
“Caching is for static content, not agents.” The system prompt and initial few-shot examples in an agent are exactly the kind of stable prefix that prompt caching was designed for. Caching is most valuable in loops, not least.
“I should just use the best model for everything to get the best results.” Quality and cost are not always correlated for simple tasks. A classifier asked to output positive or negative produces the same quality result on Haiku as on Opus — at 20x lower cost. Blindly using the top tier introduces cost without introducing quality.

Frequently Asked Questions

How do I know which model tier to use for a given task? Start by characterizing the task: does it require multi-step reasoning, nuanced judgment, or creative synthesis? If yes, use a top-tier model. If the task can be described as a lookup, extraction, classification, or simple reformatting, start with the smallest model and only escalate if output quality is insufficient.

Does prompt caching work automatically, or do I need to enable it? It depends on the provider and SDK version. With the Anthropic API, you currently need to add cache_control breakpoints to mark which portions of your prompt are eligible for caching. It does not cache everything by default. Lesson 06-01 of this series walks through the exact implementation.

What counts as a “tool output” that I should worry about pruning? Any tool result that returns more data than the next reasoning step actually needs. Web search results, database query responses, API payloads — all of these can balloon to several kilobytes and get silently appended to the context on every turn. The rule of thumb: if a tool result is larger than 500 tokens, ask whether all of it needs to survive in context beyond the current turn.

If cached tokens are 10x cheaper, why not cache everything? Cache hits only occur when the exact token sequence matches what is stored. Only the stable, unchanging prefix of your prompt can realistically be cached — the parts that vary per request (user query, dynamic data) cannot. Also, there is typically a modest cache write cost on the first call. Caching pays off when the same prefix is reused across many calls, which is exactly the agent loop pattern.

Where This Fits in the Series

This lesson is part of How Claude Actually Works, a developer-focused course that tracks three parallel lanes — context, reliability, and cost — through every major concept in Claude’s architecture. This video opens the cost lane: after understanding what tokens are and how the context window fills up, you now have the dollar meter that makes those abstract counts concrete. The next lesson in the cost lane (Part B) builds a price ladder from cents to dollars per call, so you can reason about costs at the system design level before writing a line of code.

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →