Context Engineering: Pin, Summarize, Prune, and Compact

June 23, 2026 · How Claude Actually Works (part 22)

▶ Watch on YouTube & subscribe to The Stack Underflow

Long agentic sessions have a predictable failure mode: the things the model needs most — account IDs, environment flags, the current goal — gradually sink into the middle of the context window. The model technically reads them, but attention is not uniform across a window. Facts buried in the middle get treated as weak signal. The result is a model that technically has the information and still fumbles it.

Context engineering is the practice of actively managing what sits in that window, where it sits, and how much space it consumes. This episode covers the four core techniques: pin, summarize, prune, and compact.

The one-sentence version: Actively shape your context window — keep stable facts pinned at the top, collapse resolved turns into summaries, strip bloated tool output, and compact the middle only when you must — and you get better answers at lower cost simultaneously.

Why the middle of the window is dangerous

Transformer attention is not uniformly strong across the full context. Empirically, models perform better on information placed near the beginning or end of a long context than on information buried in the middle. This is sometimes called the “lost in the middle” effect.

In a live session, that middle zone fills up fast: early setup messages, intermediate reasoning turns, tool call outputs, and abandoned lines of thinking all accumulate there. The longer the session, the more the genuinely important facts drift away from the edges and into that weak zone.

The four techniques below are a direct answer to this problem.

Pin: lock stable facts at the top

Anything that does not change during a session belongs at the very top of the context — above whatever the user and model are actively working through. The transcript calls this the “pin zone,” and it is deliberately placed where the model always has strong attention.

What belongs pinned:

  • Account or user IDs
  • Product or service context (“we are building a B2B invoicing tool”)
  • Environment (“staging cluster, Postgres 15, Node 20”)
  • Hard constraints (“never modify the payments table directly”)

In practice this means your system prompt or your first injected context block should be structured, stable, and treated as read-only. If you are building an agentic loop, the pin block should be re-injected at the top on every turn rather than left to drift.

[PINNED CONTEXT — do not summarize or prune]
Customer: Acme Corp (ID: 8821)
Product: Billing API v3
Environment: production
Active incident: INV-4492 — duplicate charge on 2026-06-20

Summarize: collapse resolved turns

Every resolved sub-problem is a candidate for compression. Eight back-and-forth messages about a refund policy question, once settled, do not need to sit verbatim in the window. The transcript gives a clean example: those eight turns collapse into one line — “Resolved: Refund #23 approved per policy 4.2.”

The rule is simple: keep verbatim only the active issue. Anything resolved becomes a one-liner summary and the original turns are dropped.

This is especially important in long tool-use sessions where the model is iterating toward an answer. Each intermediate step that got you here does not need to be fully represented — only the conclusion does.

Prune: strip bloated tool output

Tools lie about how much of their output you actually need. A database query returns 8 KB of JSON. You consumed three fields. The remaining 7.9 KB is sitting in your context, spending tokens, and potentially confusing the model with irrelevant data.

Prune aggressively, before the output ever enters the window:

Tool output sizeFields you usedAction
8 KB JSON blob3 fieldsExtract fields, drop blob
200-line log2 error linesGrep for errors, drop rest
Full file read1 functionSlice relevant lines only

The token cost drop is immediate and the signal-to-noise ratio in the context improves at the same time.

Compact: restructure under pressure

Compaction is what you do when the window is genuinely full and you need to keep going. The middle of the conversation collapses into a structured summary card while the pinned facts at the top remain verbatim.

[COMPACTED SESSION SUMMARY — turn 1-34]
- Goal: debug duplicate charge for Acme Corp (INV-4492)
- Confirmed: charge fired twice at 14:02 and 14:03 UTC
- Root cause hypothesis: race condition in webhook handler
- Ruled out: network retry, idempotency key mismatch
- Next step: inspect handler lock logic

[PINNED CONTEXT — verbatim, unchanged]
...

After compaction you have the same conversation in two “windows”: old context collapsed to a summary card, active thread still live at the bottom. The model can continue as if it has full history, but the window usage resets to something sustainable.

The key distinction the transcript makes explicit: compaction is a fallback, not a strategy. Reaching for compact before you have done pin, summarize, and prune is using a sledgehammer where a scalpel would work. Compact last.

The two metrics that move together

The transcript makes a satisfying observation: pin, summarize, prune, and compact are not trade-offs. They move two metrics in your favor simultaneously:

  1. Answer coherence rises — because the model sees the facts it needs in positions where it pays strong attention to them.
  2. Token cost falls — because you are not spending tokens on resolved history and bloated tool output.

Most optimization decisions force a trade-off. This one does not.

Common misconceptions

  • “Compaction is the main technique.” No. It is the last resort. If you are compacting every few turns, you have not done the upstream work of pruning tool output and summarizing resolved turns.
  • “The model reads everything equally.” It does not. Long-context models still show meaningful attention degradation in the middle of the window. Where a fact sits matters, not just whether it is present.
  • “Summarizing loses important detail.” Summarizing resolved turns loses detail you no longer need. The discipline is in knowing what is still active. Active issues stay verbatim; resolved issues become one-liners.
  • “Coherent output means correct output.” The transcript explicitly flags this. A well-managed context produces coherent responses, but coherence is not correctness. The next episode covers evals — the mechanism for actually verifying correctness.

Frequently asked questions

What is the difference between summarizing and compacting? Summarizing is a scalpel: you collapse one resolved sub-thread into a one-liner while the rest of the conversation stays intact. Compacting is wholesale restructuring of a large middle chunk into a summary card, typically because the window is under genuine pressure. Do summarize proactively; do compact reactively.

How do I know when to compact? When you have already pinned stable facts, summarized resolved turns, and pruned tool output — and the window is still filling up — that is when compaction earns its place. A practical signal: if your context is regularly hitting 60-70% of the model’s window limit mid-session, you need either better pruning upstream or compaction as a release valve.

Should I pin the system prompt or inject pinned context separately? Both work, but the pattern that survives the longest sessions is a dedicated pinned block at the very top of the context that you re-inject on every agentic turn. This prevents the system prompt from getting overridden or reinterpreted as the conversation grows. Treat it as immutable infrastructure.

Does this apply to single-turn prompts or only multi-turn sessions? The pin and prune techniques apply even to single-turn prompts — you still want stable facts first and lean tool output. Summarize and compact are inherently multi-turn concerns, since they act on accumulated conversation history.

Where this fits in the series

This tutorial is part of How Claude Actually Works, a developer-focused course on building reliably with Claude. This episode sits near the end of the context management arc — after the mechanics of tokens and windows, we get to the active engineering decisions that separate a session that stays on-task from one that drifts. The next episode closes the loop: coherent is not correct, so we cover evals — how to systematically verify that your well-managed context is actually producing right answers.

Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.

Subscribe on YouTube →