How Claude's Context Window Works: Limits, Costs, and Overflow
▶ Watch on YouTube & subscribe to The Stack Underflow
The context window is the single most misunderstood concept when developers start building with Claude. Everyone knows the number — “1 million tokens!” — but very few understand what actually fills that space, which models actually have it, where the pricing cliff lives, and what happens when you hit the wall.
This tutorial walks through the mechanics, model-by-model limits, and the cost implications that make the 1M window a deliberate architectural choice rather than a free upgrade.
The one-sentence version: The context window is Claude’s entire working memory — system prompt, tool definitions, conversation history, retrieved documents, your message, and the response all share one finite bar of tokens, and the 1M version is not available on every model, in every region, or without an opt-in flag.
What actually fills the context window
Think of the context window as a single horizontal bar that fills left to right. Every request consumes that bar in this order:
- System prompt — your instructions to the model. Written once per session, but counted on every turn.
- Tool definitions — every function schema you expose adds tokens before the conversation even starts. Three tools with verbose JSON schemas can easily cost you thousands of tokens on every call.
- Conversation history — every prior user turn and assistant reply accumulates. In a long multi-turn chat, this is usually the biggest consumer.
- Retrieved documents / search results / file contents — RAG results, web search snippets, uploaded files — all land in the same bar.
- Your new message — typically the smallest slice.
- Output space — a strip at the right edge is reserved for the model’s response. That space comes out of your total budget; you do not get the full window for input.
|--- system prompt ---|-- tools --|---- history ----|-- docs --|-- msg --|- output -|
^ ^ ^
left edge 200K mark hard limit
Nothing gets a separate lane. It is all one finite space.
What happens when you overflow
When the bar hits the hard right edge, extra tokens do not queue. They either:
- Fall off the left — older turns are silently dropped (depending on how your client manages history), or
- Get compacted — you or the SDK truncates/summarizes history to make room, or
- Cause an API error — you see
model context window exceededand the request is rejected outright.
There is no grace period. Overflow is not a warning, it is a failure mode. Plan for it.
The 1M context window is not universal
This is where most developers get burned. “Claude has a 1M context window” is a headline that hides model-level and region-level variation:
| Model | Context Window | Notes |
|---|---|---|
| Claude Opus 4.8 | 1,000,000 tokens | Default; GA on Claude API, Bedrock, Vertex AI. Capped at 200K on Microsoft Foundry. |
| Claude Sonnet 4.6 | 1,000,000 tokens (beta) | 200K in production by default. Requires an explicit beta header opt-in. |
| Claude Haiku 4.5 | 200,000 tokens | No 1M option exists. |
The honest summary: some Claude models support 1M tokens, in some environments, sometimes requiring an explicit flag your code must set. Pick your model first, then verify its actual limit for your target deployment environment.
Opting into extended context on Sonnet 4.6
On Claude Sonnet 4.6, the 200K default is a deliberate production choice. To access 1M tokens you pass a beta header:
anthropic-beta: extended-context-window-2025-05
Without that header, Sonnet 4.6 behaves as a 200K model regardless of how much content you try to send.
The 200K pricing boundary
Even when your model supports 1M tokens, crossing 200K input tokens is a cost decision, not just a context decision.
Anthropic applies long context pricing to input tokens above the 200K mark. Tokens below the threshold are billed at the standard input rate; tokens above it are billed at a higher rate. The pricing tier does not apply to output tokens.
0 ────────────────────── 200K ──────────────────────── 1M
│ standard input rate │ long-context input rate │
This means a request with 300K input tokens is billed at two different rates for the same field. Budget accordingly before assuming “it fits, so it’s fine.”
Even when it fits, the middle is dangerous
Fitting inside the window does not mean every token is equally useful. Research on large language models (and mentioned as “that’s next” in the video) points to a well-documented phenomenon: retrieval accuracy degrades for information placed in the middle of a very long context. Tokens near the beginning and end of the window tend to be recalled more reliably.
For practical applications: if you have critical instructions or high-priority documents, put them at the start of the system prompt or close to the user’s question — not buried in the middle of a 500K retrieval dump.
Common misconceptions
-
“Claude has a 1M context window” — full stop. Not quite. Haiku 4.5 caps at 200K with no upgrade path, Sonnet 4.6 needs a beta header, and Opus 4.8 is capped at 200K on Microsoft Foundry. Always check model + environment.
-
“The context window is only for user messages.” The system prompt, every tool schema, every prior assistant turn, and every retrieved document all consume the same shared space. A verbose system prompt + 10 tool definitions can silently eat 10-20K tokens before a user says hello.
-
“Tokens that don’t fit are automatically summarized.” The API does not auto-compact. Overflow is either an error or silent truncation on the client side depending on your SDK configuration. You are responsible for managing history length.
-
“Using 1M context costs the same as using 200K.” No — there is a pricing step-change at 200K input tokens. Sending 250K tokens costs more per input token (for the tokens above 200K) than sending 150K.
Frequently asked questions
How do I know how many tokens my request is actually using?
Use the usage field in the API response, which returns input_tokens and output_tokens for every call. You can also call the token-counting endpoint before sending a request to estimate cost without consuming output budget.
If I use Sonnet 4.6 without the beta header and send 300K tokens, what happens? The API will reject the request with a context-window-exceeded error. The 200K limit is enforced server-side; the beta header is not cosmetic.
Does output length affect how much input I can send? Yes. Output tokens are drawn from the same total context budget on many models. If a model has a 200K context window and you request a maximum output of 4K tokens, your effective input budget is 200K minus 4K, not the full 200K. Check the model card for whether input and output share the same pool or have separate limits.
Is the “lost in the middle” problem real or theoretical? It is well-documented in published research across multiple model families. In practice it matters most for needle-in-a-haystack retrieval across hundreds of thousands of tokens. For typical RAG workloads under 50K tokens, positional effects are less dramatic — but good retrieval design (placing the most relevant chunks close to the question) remains a best practice regardless of window size.
Where this fits in the series
This tutorial is part of How Claude Actually Works, a course that builds a ground-up mental model of Claude — from tokenization and attention to tool use, pricing, and production deployment patterns. Understanding the context window is foundational: nearly every other concept in the series (RAG, multi-turn conversation management, tool calling costs) makes more sense once you know exactly what shares that one finite bar of space.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →