How AI Agents Use Tools: The Model vs. Orchestrator Split
▶ Watch on YouTube & subscribe to The Stack Underflow
You ask an AI coding agent to read a file, edit some code, and run the build. It does all of it. But here is the part that surprises most developers the first time they really think about it: the language model inside that agent never touched your file system. It never executed a shell command. It never called an API. All it did was write text.
Understanding this split — between what the model does and what the orchestrator does — is the foundation for understanding every AI coding agent in existence, from Cursor and Copilot to Claude Code and Windsurf.
The one-sentence version: A language model can only produce text, so when an agent “uses a tool,” the model emits a structured text object describing what it wants, and a separate piece of software called the orchestrator actually executes the action and hands the result back as more text.
The model is a text transformer, nothing more
A language model has one job: take text in, produce text out. That is its entire interface with the world. It cannot open a file descriptor. It cannot make an HTTP request. It cannot fork a process. Every capability an AI agent appears to have is built on top of — and despite — this fundamental constraint.
This is not a limitation that will be “fixed” in the next model version. It is the architecture. Inference is a forward pass through a neural network that produces a probability distribution over tokens. There is no syscall in there.
How the orchestrator introduces tools
Before any agentic loop begins, the orchestrator loads a set of tool definitions and injects them into the model’s context window. Each tool is described by a structured schema — typically JSON Schema — that specifies:
- the tool’s name
- a description of what it does
- the arguments it accepts and their types
{
"name": "read_file",
"description": "Read the contents of a file at the given path",
"parameters": {
"type": "object",
"properties": {
"path": { "type": "string", "description": "Absolute path to the file" }
},
"required": ["path"]
}
}
The model reads this schema the same way it reads any other text in its context. It now knows: “If I want to read a file, I should emit text that looks exactly like a call to read_file with a path argument.”
The tool-call loop, step by step
┌────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ │
│ 1. Inject tool schemas 4. Execute actual action │
│ into context window (disk / shell / API) │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ text out ┌──────────────────┐ │
│ │ MODEL │ ──────────► │ Parse & Validate │ │
│ │ (LLM only) │ │ JSON tool call │ │
│ └─────────────┘ └──────────────────┘ │
│ ▲ │ │
│ │ result as text │ │
│ └───────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
- Model decides it needs a tool. Rather than producing prose, it produces a structured JSON object — the tool name and arguments — as its output tokens.
- Orchestrator catches the output. It reads the model’s text, recognises it as a tool call, and validates the arguments against the schema.
- Orchestrator executes the action. The orchestrator is the component that actually opens the file, runs the shell command, or fires the HTTP request. The model is completely uninvolved at this point.
- Result wraps back as text. The orchestrator formats the result (file contents, command output, API response) into a new message and appends it to the context. From the model’s point of view, it just received more input tokens to process.
- Loop continues. The model reads the result and decides whether to call another tool, produce a final answer, or ask for clarification.
Everything interesting lives in the orchestrator
Because the model only writes text, every consequential decision about what is allowed to happen sits in the orchestrator layer:
| Concern | Where it lives |
|---|---|
| Permissions & sandboxing | Orchestrator |
| Tool schema validation | Orchestrator |
| Cost tracking (token usage) | Orchestrator |
| Rate limiting | Orchestrator |
| Safety filtering on results | Orchestrator |
| Retry logic on failed calls | Orchestrator |
| MCP protocol wiring | Orchestrator |
This is why understanding the model–orchestrator split is load-bearing knowledge. When you ask “why does Cursor need filesystem access?” or “how does Claude Code decide whether to run a destructive command?”, the answer is always in the orchestrator, not in the weights.
Common misconceptions
- “The model is executing code.” It is not. The model is generating text that describes code to execute. Execution is handled by the orchestrator, which may or may not honour the request depending on its permission model.
- “Tool calls are a special API feature, not real text.” They feel special in SDKs, but at the token level they are structured text. The model produces them the same way it would produce any other structured output; the API and orchestrator give them special treatment after the fact.
- “The model ‘sees’ the file system.” The model sees the contents of files only after the orchestrator has fetched them, serialised them to text, and inserted them into the context window. It never has a live view of anything.
- “If the orchestrator is doing all the work, the model is not important.” The model is doing all the reasoning: deciding which tool to call, with which arguments, in which order, and what to conclude from the results. The orchestrator is the hands; the model is the brain.
Frequently asked questions
Why does the model emit JSON for tool calls rather than natural language? Because JSON is unambiguous and machine-parseable. The orchestrator needs to extract the tool name and arguments reliably. If the model said “please read the file at src/index.ts” in plain English, parsing that reliably at scale is a nightmare. A JSON schema gives both the model and the orchestrator a shared contract.
What happens if the model produces a malformed tool call? The orchestrator catches the validation error. Depending on the implementation, it may retry by re-prompting the model with the error message, surface the error to the user, or fall back gracefully. The model does not “know” the call failed until the orchestrator tells it in the next context turn.
Do all agents — Cursor, Copilot, Claude Code — work this way? Yes. The model–orchestrator architecture is universal. The differences between agents are entirely in the orchestrator: which tools are exposed, what the permission model looks like, how results are chunked back into context, and which protocol (native function calling, MCP, or something custom) is used to wire tools in.
What is MCP and how does it fit? MCP (Model Context Protocol) is a standardised wire format that lets any tool provider plug into any orchestrator without custom glue code. The next episode in this series covers it in depth. Think of it as the USB-C of AI tool integration — a universal adapter for the orchestrator layer.
Where this fits in the series
This episode is the starting point for the “AI Agent Internals: How Coding Agents Really Work” course on How Claude Actually Works. Everything else in the series — MCP, permissions, cost structure, safety — builds on the model–orchestrator split introduced here. Once you can see this architectural seam clearly, the rest of the system clicks into place. Browse all tutorials to continue.
Found this useful? The deep version lives on YouTube — new breakdowns of how AI dev tools actually work, weekly.
Subscribe on YouTube →