What is code execution with MCP?

It is a pattern where an agent interacts with MCP servers through code interfaces inside an execution environment, rather than loading every tool definition into the prompt and calling tools directly one by one.

Why does this reduce token usage?

It reduces token use by loading only relevant tool definitions and by keeping large intermediate data, such as documents or spreadsheets, inside the runtime unless the model needs to reason over the raw content.

When should builders avoid this pattern?

Teams should be cautious when workflows are simple, when write actions are high risk, or when they lack sandboxing, permission scoping, audit logs, and validation gates for generated code.

Code Execution Is Becoming the Tool Router for MCP Agents

The next scaling problem for AI agents is not only model quality. It is plumbing. Once an agent has access to hundreds of MCP tools, the old pattern of loading every tool definition into context becomes expensive, slow, and brittle. Anthropic's new engineering note on code execution with MCP is a useful signal because it names a problem many agent teams are already feeling: the prompt is becoming a bad place to store the whole integration graph.

The important idea is not vendor specific. The builder takeaway is that tool use is moving from direct model calls toward a runtime layer where the model writes small programs, imports only the interfaces it needs, and lets the execution environment move large objects between systems. In Anthropic's example, a Google Drive document can be read and attached to a Salesforce record without the full transcript being copied through the model context. The claimed token reduction in one scenario is 98.7 percent, from roughly 150,000 tokens to 2,000 tokens, but the deeper point is architectural: agent memory and tool routing need separation.

Key Takeaways

MCP adoption creates a new context bottleneck: every connected server can add tool schemas, descriptions, and results that compete with the actual user task.
Code execution turns MCP tools into callable interfaces that an agent can discover on demand, instead of frontloading the entire tool catalog into the prompt.
The biggest savings come from keeping large intermediate data inside the runtime, then returning only summaries, counts, selected records, or final status to the model.
This pattern improves scale, but it also creates new operational requirements: sandboxing, permission checks, audit logs, dependency control, test fixtures, and runtime observability.

A field report packet with clipped pages symbolizing selective tool discovery

The architecture change: from prompt inventory to runtime inventory

Direct tool calling made sense when agents had a handful of tools. The model could see every function description, choose one, receive the result, and decide the next step. That pattern starts to break when the agent is connected to dozens of MCP servers and thousands of actions. Tool descriptions become a permanent tax on every request. Worse, large responses get routed through the model even when the model does not need to read them. A transcript, spreadsheet, attachment, or JSON blob can be passed from one business system to another, but direct tool loops often force it through context like a toll booth.

Signal	Why it matters
Tools exposed as files or modules	The model can inspect a narrow interface only when the task requires it, reducing schema clutter in context.
Runtime handles intermediate objects	Large documents and datasets can move between MCP servers without being copied into model tokens.
Agent writes glue logic	Filtering, transformation, retries, pagination, and batching can happen in code instead of multi-turn tool chatter.
Tool search becomes infrastructure	Teams need indexing, naming discipline, and relevance ranking for tools, not just a pile of server connections.

This is progressive disclosure applied to agent tools. Instead of giving the model the complete operating manual at startup, the system gives it a map. The agent can list servers, search available capabilities, read a small interface, then write a short program to perform the job. That sounds mundane, but it changes the cost curve. The model no longer pays for every unused integration on every turn. It pays for discovery, planning, and the specific interfaces involved in the task. For builders, this suggests that the MCP client is becoming less like a static adapter and more like a tool operating system.

The prompt should not be the warehouse for every tool definition and intermediate payload. It should be the planner, not the loading dock.

A worn machine part beside inspection tags representing sandbox risk in agent runtimes

Where the tokens actually disappear

There are two separate savings that teams should measure. First, interface savings: only load the tool descriptions relevant to a task. If an agent has 1,000 available tools but needs two, the model should not read 998 irrelevant schemas before it starts. Second, payload savings: do not ask the model to relay large data unless reasoning over the raw data is necessary. If the task is to attach a transcript, compute a count, filter pending rows, or copy an object to storage, the runtime can do that work and report a compact result. This is different from hiding information from the model. It is about not converting bytes into tokens unless language reasoning is needed.

Builder note

Do not adopt code execution only because a benchmark shows a large token reduction. Adopt it where your workflows have high tool cardinality, large intermediate payloads, repeated data transformations, or multi-system handoffs. If your agent only calls three small tools, direct tool calling may be simpler and safer. If your agent connects to customer data stores, CRMs, ticketing systems, email, cloud storage, and internal APIs, the runtime layer becomes a control plane for cost, latency, and correctness.

The tradeoff: cheaper context, larger blast radius

Code execution gives the agent more leverage. That is both the point and the risk. A direct tool call is relatively constrained: the model chooses a tool and arguments. A code runtime can loop, branch, transform, cache, retry, and compose actions. This reduces model turns, but it can also multiply mistakes if the generated program is wrong. The agent might filter the wrong column, mishandle time zones, overwrite a field, paginate incorrectly, or retry a non-idempotent operation. The failure mode shifts from token waste to software behavior. That means agent teams need engineering controls that look closer to production automation than chatbot prompting.

Sandbox the execution environment. Generated code should run with tight filesystem, network, time, memory, and package limits. Treat the runtime as an untrusted worker, even when the model is trusted.
Scope credentials by task. The runtime should receive least-privilege tokens for the MCP servers involved, not a broad ambient credential that can touch every connected system.
Make data movement visible. If a document moves from Drive to Salesforce, log source object, destination object, byte size, actor, tool version, and approval state without storing sensitive content unnecessarily.
Separate read planning from write execution. For risky workflows, let the model draft a plan, inspect selected interfaces, and produce a write set that can be validated before any mutation runs.
Design for idempotency. Generated code will retry. Your MCP tools should support request IDs, dry runs, conflict detection, and safe repeat behavior wherever state changes are possible.
Capture compact artifacts. Instead of dumping raw payloads back into context, return hashes, row counts, sample records, validation summaries, and links to stored artifacts for human review.

Adoption guidance for agent teams

Inventory your token waste before changing architecture. Measure average tool schema tokens per request, intermediate result tokens, number of tool turns, and latency contribution from tool loops.
Start with read-heavy workflows. Reporting, search, summarization over retrieved subsets, and data cleanup are good candidates because the runtime can filter large inputs while limiting destructive actions.
Build a tool index, not just a folder. Use consistent naming, descriptions, tags, input schemas, output summaries, and examples so the agent can find the right capability without reading every file.
Add a detail ladder for discovery. Let the agent request name only, name plus description, full schema, or examples. This keeps search cheap while preserving access to precision when needed.
Introduce write gates. For CRM updates, ticket changes, emails, billing actions, or permission changes, require validation policies, human approval, or deterministic checks before execution.
Evaluate with task suites that include ugly data. Test large spreadsheets, missing fields, duplicate records, malformed documents, partial outages, slow MCP servers, and permission errors. The runtime pattern is only useful if it fails predictably.

Source Card

Code execution with MCP: Building more efficient agents

Anthropic's post matters because it reframes MCP scale as a context engineering problem, not just an integration problem. The source describes how agents can access MCP servers through code interfaces, load tool definitions on demand, and keep large intermediate results inside an execution environment. Agent builders should treat the post as an infrastructure pattern to evaluate, not as a universal prescription.

Anthropic Engineering

The open question is how this pattern standardizes. MCP gives teams a common way to connect tools and data, but code execution adds another layer where conventions are still forming: filesystem layouts, generated client libraries, permission models, audit formats, tool search, and policy enforcement. Anthropic points to Cloudflare's similar framing of code-mode style agents, which suggests the pattern is spreading. Still, the winning implementation will not be the one with the flashiest demo. It will be the one that lets operators answer boring questions fast: what did the agent run, which data did it touch, why was it allowed, and how do we replay or roll it back?

For founders and engineering leads, the practical decision is not whether code execution is more elegant. It is where the complexity belongs. Direct tool calling pushes complexity into the model context and pays for it with tokens, latency, and copying errors. Code execution pushes complexity into a controlled runtime and pays for it with sandboxing, observability, and software discipline. If you are building agents that must operate across many enterprise systems, that trade is likely coming either way. Better to make it explicit now than discover later that your context window has become your integration platform.

Anthropic Engineering, "Code execution with MCP: Building more efficient agents," published Nov. 4, 2025, https://www.anthropic.com/engineering/code-execution-with-mcp
Model Context Protocol documentation, linked from the Anthropic source, https://modelcontextprotocol.io/
Cloudflare Code Mode discussion, cited in the Anthropic source as a similar finding, https://blog.cloudflare.com/code-mode/

Stay in the know

Code Execution Is Becoming the Tool Router for MCP Agents

The architecture change: from prompt inventory to runtime inventory

Where the tokens actually disappear

The tradeoff: cheaper context, larger blast radius

Adoption guidance for agent teams

Frequently Asked

What is code execution with MCP?

Why does this reduce token usage?

When should builders avoid this pattern?

References

Related on Agent Mag

Keep Reading