How to Structure a Production Claude Agent: All Layers Explained
▶ Watch on YouTube & subscribe to The Stack Underflow
Building a Claude agent that survives contact with production is less about clever prompting and more about assembling the right stack. There are distinct layers — each with a clear job — and the failure modes that plague most early-stage agents can be traced back to skipping or collapsing one of them.
This tutorial walks through every layer in the order they matter, from the edge of the system all the way to escalation. Think of it as a reference map for the “Claude stack.”
The one-sentence version: A production Claude agent is a disciplined stack of layers — gateway, model router, agent loop, scoped tools, policy hooks, structured boundaries, cached prefix, eval harness, and clean escalation — each one guarding a specific failure mode.
The Gateway: Before Any Model Call
The first thing a request hits is the gateway — and the gateway lives entirely outside the model. Its primary job is rate limiting, which means it protects both your cost budget and your upstream dependencies before a single token is spent. Anything that can be blocked or throttled at this layer is cheaper to block here than inside the loop.
L0: The Model Router
Not every step in an agent’s reasoning is equally hard. Routing model selection per step is one of the highest-leverage cost and latency levers available:
| Step complexity | Model choice |
|---|---|
| Fast, routine steps | Haiku 4.5 — fast and cheap |
| Mid-tier reasoning | Sonnet — balanced |
| Hard steps that earn it | Opus 4.8 — reserved for genuine complexity |
The rule is simple: smallest model per step, always. Routing is a first-class architectural concern, not an afterthought.
L1 / L3 Core: The Agent Loop
The agent loop is the heartbeat of the system. Everything else hangs off it. The loop itself is just four steps:
while not done:
response = call_model(messages)
stop_reason = response.stop_reason
if stop_reason == "tool_use":
result = run_tool(response.tool_call)
messages.append(tool_result(result))
elif stop_reason == "end_turn":
break
# policy hook blocks also exit here
Call the model. Read the stop reason. Run a tool (if requested). Append the result. Repeat. The triad of signals that tells you everything about what happened in any loop iteration is: stop reasons, tool calls, and hook blocks. If your logging captures all three, you can reconstruct any failure.
L2: Reach — Scoped Tools and MCP Servers
The agent touches the world through tools and MCP servers. The key word in a production system is scoped: four or five tools, not forty. A large, undifferentiated tool list creates ambiguity for the model, inflates the context window on every call, and makes the policy surface harder to reason about.
Typical scoped reach in a well-designed agent might look like:
tools:
- database_query # read-only access, specific schema
- web_search # external lookup
- internal_api_call # one internal service, typed inputs
- submit_refund # gated by policy hook (see L3)
- escalate_to_human # explicit exit path
MCP servers are a natural fit for providing clean, version-controlled interfaces to external systems without bloating the tool list with bespoke wrappers.
L3: Control — Policy Hooks and Sub-Agents
Two mechanisms live at this layer:
Pre-tool-use policy hooks. Before the agent is allowed to call a sensitive tool (like a database write or a refund path), a hook gates the call. This is policy enforced in code, not in prompts. Prompt-based guardrails can be reasoned around or eroded by context drift. A code-level hook cannot.
Sub-agent isolation. For noisy, high-context work — think a research pass, a long document analysis, or anything that would pollute the main context window — a sub-agent runs the task in its own context window and returns only a clean, summarized result. The parent loop stays lean.
Structured Output Boundaries
At every point where the agent crosses a boundary (calling a tool, returning a result, handing off to another system), output must be schema-forced. The mechanism is a forced tool schema: the model is required to produce a specific structure, not free-form text that gets parsed into JSON downstream.
// Bad: free-form JSON in a string (fragile)
"result": "{\"status\": \"approved\", \"amount\": 42.00}"
// Good: structured output via forced tool schema
{
"status": "approved",
"amount": 42.00
}
Never free-form JSON in a string. Classify, extract, and stamp at every boundary — schema enforced.
Cost Plane: Prompt Caching
The system prompt and the tool definitions almost never change between calls in an agent loop. Mark them with cache control. The stable prefix gets cached, and you pay roughly 90% less on every subsequent call that hits the cache.
[SYSTEM PROMPT] <-- mark for caching (stable)
[TOOL DEFINITIONS] <-- mark for caching (stable)
[CONVERSATION HISTORY] <-- not cached (changes each turn)
[CURRENT USER MESSAGE] <-- not cached
This single optimization often makes the difference between a cost structure that scales and one that doesn’t.
Reliability Plane: Eval Harness and Logging
An offline eval harness gates deploys. Nothing ships if it regresses against a known production set. This is not optional; it’s the only way to make incremental changes to prompts, model versions, or tool definitions without flying blind.
Logging strategy follows the triad: every loop iteration captures the stop reason, any tool calls made, and any hook blocks triggered. That triad is sufficient to diagnose nearly any failure mode in production.
Escalation: Clean Exit to a Human
When the agent cannot proceed — due to policy, complexity, ambiguous risk, or an explicit user request — it exits cleanly to a human. The trigger is never sentiment (the model “feeling” uncertain). It is a deterministic condition: policy block, complexity threshold, risk flag, or explicit request.
What gets handed off is not a raw transcript. It is a structured summary card: a clean, pre-formatted context object that lets the human start informed, not baffled. The human reads a summary, not a 40-turn conversation log.
The Full Stack at a Glance
[GATEWAY] rate limits — edge of system
|
[MODEL ROUTER] smallest model per step (Haiku / Sonnet / Opus)
|
[AGENT LOOP] call → stop_reason → tool → append → repeat
|
[SCOPED REACH] 4-5 tools + MCP servers, not 40
|
[POLICY HOOKS] pre-tool-use gates, code-enforced
[SUB-AGENTS] isolated context for noisy work
|
[BOUNDARIES] schema-forced structured output everywhere
|
[CACHED PREFIX] system prompt + tools cached, ~90% cost reduction
|
[EVAL HARNESS] offline regression gate before deploy
[LOGGING] stop reasons + tool calls + hook blocks
|
[ESCALATION] structured summary card → human, never on sentiment
Common Misconceptions
- “Prompt guardrails are enough for safety.” Policy hooks enforced in code are the correct place for hard constraints on sensitive tool calls. Prompts can drift, be reasoned around, or get overridden by long context. Code cannot.
- “More tools means a more capable agent.” Scope is a feature. A large tool list creates ambiguity, inflates context cost, and expands the policy surface. Four to five focused tools outperform forty loosely defined ones.
- “You should always use the best model.” Routing to the smallest model that can handle each step is an architectural discipline, not a cost-cutting compromise. It keeps the system fast and economically sustainable.
- “Escalation means the agent failed.” Clean escalation is a designed exit path. An agent that always tries to finish — regardless of complexity, risk, or policy — is an agent that eventually causes an incident.
Frequently Asked Questions
What is the difference between L1 and L3 in this architecture? L1/L3 refers to the agent loop itself — the core call-read-run-append cycle. L3 specifically also covers control mechanisms (policy hooks and sub-agents) that operate within or alongside the loop. The numbering follows the video’s layered framing rather than a strict hierarchy.
How does prompt caching actually work with the system prompt and tools?
The Anthropic API supports cache control headers on message blocks. When you mark your system prompt and tool definitions with cache_control, the API caches the computed KV (key-value) state for that prefix. Subsequent requests that share the same prefix hit the cache and are charged at roughly 10% of the normal input token cost, with no latency penalty after the first call.
When should a sub-agent be used instead of keeping work in the main loop? Use a sub-agent when a task would consume significant context (long document analysis, multi-step research, code generation with large inputs) and the parent loop only needs a clean summary of the result — not every intermediate step. Sub-agents keep the main context window lean and prevent context pollution from bleeding into subsequent decisions.
What goes into the structured summary card passed to a human on escalation? The summary card should contain: the original user intent, the steps the agent completed, the specific reason for escalation (policy block / complexity / risk / explicit request), and any relevant structured data the human needs to continue. It is not a transcript — it is a pre-digested context object.
Where This Fits in the Series
This tutorial is part of How Claude Actually Works — a course that builds from first principles (episodes 00 and 01 cover the foundations) up through production-grade agent design. If this is your entry point, start at episode 00 to ground the concepts in how the model itself operates before layering the architecture on top. The full series is available via all tutorials.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →