Recognition: unknown
Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows
Pith reviewed 2026-05-09 21:39 UTC · model grok-4.3
The pith
Tool Attention reduces per-turn tool tokens by 95% in simulated LLM agent workflows
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tool Attention generalizes the attention paradigm from tokens to tools through an Intent Schema Overlap score from sentence embeddings, a state-aware gating function enforcing preconditions and access scopes, and a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. On a simulated 120-tool, six-server benchmark calibrated to real MCP audits, it reduces per-turn tool tokens by 95.0% from 47.3k to 2.4k and raises effective context utilization from 24% to 91%, with end-to-end figures for task success, latency, cost, and reasoning quality reported as projections from these token measurements.
What carries the argument
Tool Attention middleware combining intent schema overlap scoring from sentence embeddings, state-aware gating for preconditions and scopes, and two-phase lazy schema loading to selectively manage tool schemas.
If this is right
- Per-turn tool token usage drops 95.0% from 47.3k to 2.4k.
- Effective context utilization rises from 24% to 91%.
- Projected gains in task success, reduced latency, lower cost, and improved reasoning quality follow from the measured token savings.
- Protocol-level efficiency, rather than raw context length, acts as a binding constraint on scalable agentic systems.
Where Pith is reading between the lines
- Agents could scale to hundreds of tools without token costs growing proportionally.
- Similar gating and lazy-loading logic could apply to other resource-heavy agent components such as external memory stores or knowledge bases.
- Widespread adoption might lower recurring operational costs for complex multi-tool workflows in production.
- The provided simulation benchmark offers a controlled testbed for evaluating alternative tool-management techniques before live deployment.
Load-bearing premise
The simulated 120-tool six-server benchmark with token counts calibrated to public MCP audits accurately represents real deployments and that the projected end-to-end benefits will hold when measured on live LLM agents.
What would settle it
A direct measurement on live LLM agents comparing task success rates, latency, cost, and reasoning quality under Tool Attention versus standard eager schema injection in the same multi-server setup.
read the original abstract
The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Tool Attention, a middleware combining Intent Schema Overlap scoring from sentence embeddings, state-aware gating with preconditions, and two-phase lazy schema loading, eliminates the MCP/Tools Tax by reducing per-turn tool token overhead in LLM agents. On a simulated 120-tool, six-server benchmark calibrated to public audits of real MCP deployments, it reports a 95% reduction in tool tokens (47.3k to 2.4k) and an increase in effective context utilization from 24% to 91%, with all end-to-end metrics for task success, latency, cost, and reasoning quality presented as projections derived from these token counts plus external telemetry rather than direct live-agent measurements. The code is released publicly.
Significance. If the simulation generalizes and gating preserves downstream agent performance, the work could meaningfully improve scalability and reduce operational costs for MCP-based agentic systems by addressing context bloat at the protocol level rather than relying on longer contexts. Strengths include calibration of the benchmark to public audits, explicit labeling of all projections, and public code release, which aid reproducibility and external validation.
major comments (2)
- [Evaluation section] Evaluation section: The 95% token reduction and context utilization gains are measured only in simulation on a calibrated 120-tool benchmark; no direct experiments are reported that exercise the full gating (ISO + state-aware + lazy loader) inside live LLM agent loops, so false-negative tool omissions or altered reasoning paths remain invisible to the token-count metric. This is load-bearing for the central thesis that the mechanism eliminates the tax in scalable workflows without performance degradation.
- [Abstract] Abstract and end-to-end projections: All figures for task success, latency, cost, and reasoning quality are derived projections combining simulated token savings with published deployment telemetry; they are not measured on live agents. The central performance claims therefore rest on the untested assumption that the simulated benchmark accurately represents real MCP deployments and that token reductions translate directly to end-to-end benefits.
minor comments (2)
- [Abstract] Abstract: 'gentic systems' is a typo and should read 'agentic systems'.
- [Evaluation] Evaluation: The precise definition or formula for 'effective context utilization' (described as a token-ratio quantity) should be stated explicitly, including how the 24% to 91% figures are computed, to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify the scope of our evaluation. We respond to each major comment below.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: The 95% token reduction and context utilization gains are measured only in simulation on a calibrated 120-tool benchmark; no direct experiments are reported that exercise the full gating (ISO + state-aware + lazy loader) inside live LLM agent loops, so false-negative tool omissions or altered reasoning paths remain invisible to the token-count metric. This is load-bearing for the central thesis that the mechanism eliminates the tax in scalable workflows without performance degradation.
Authors: We acknowledge that the evaluation relies on simulation rather than live agent loops and that this leaves certain downstream effects unmeasured. The simulation is calibrated to public audits of real MCP deployments to isolate the protocol-level token overhead, which is the defined scope of the MCP/Tools Tax. The public code release is intended to support independent live-agent validation. We will revise the Evaluation section to add an explicit limitations discussion covering risks such as false-negative omissions and the role of conservative gating thresholds in mitigating them. revision: partial
-
Referee: [Abstract] Abstract and end-to-end projections: All figures for task success, latency, cost, and reasoning quality are derived projections combining simulated token savings with published deployment telemetry; they are not measured on live agents. The central performance claims therefore rest on the untested assumption that the simulated benchmark accurately represents real MCP deployments and that token reductions translate directly to end-to-end benefits.
Authors: The manuscript already labels these figures as projections and distinguishes them from the directly measured token reductions. The benchmark calibration draws on public deployment audits, and the projections incorporate established relationships between context utilization and performance from prior literature. We will revise the abstract and Evaluation section to state the assumptions more prominently and to include a brief sensitivity discussion on how variations in the calibration would affect the projections. revision: partial
Circularity Check
No significant circularity; empirical results from direct simulation measurements
full rationale
The paper's central claims rest on direct measurements of per-turn tool token counts and context utilization within a simulated 120-tool benchmark whose inputs are calibrated to external public audits. Token reductions (47.3k → 2.4k) and utilization gains (24% → 91%) are reported as observed outputs of the simulation, not as quantities derived from fitted parameters or internal equations. End-to-end projections for task success, latency, cost, and reasoning quality are explicitly labeled as combinations of these measured values with separate published deployment telemetry and are not presented as internal predictions. No self-citations, uniqueness theorems, or ansatzes appear as load-bearing steps; the generalization of the attention paradigm is a naming reference only. The derivation chain therefore consists of independent empirical evaluation rather than any reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- top-k selection threshold
axioms (2)
- domain assumption Sentence embeddings from standard models accurately measure intent schema overlap for tool relevance
- domain assumption The simulated token counts calibrated to public audits match real MCP multi-server deployments
Reference graph
Works this paper leans on
-
[1]
Introducing the Model Context Protocol
Anthropic. Introducing the Model Context Protocol. Anthropic Engineering Blog, Nov. 2024. URLhttps://www.anthropic.com/news/model-context-protocol. accessed 15 April 2026
2024
-
[2]
Claude code: Agentic coding at the terminal
Anthropic. Claude code: Agentic coding at the terminal. Anthropic Documentation, 2025. URLhttps://docs.claude.com/en/docs/claude-code/overview. accessed 15 April 2026
2025
-
[3]
Prompt caching for the Claude API
Anthropic. Prompt caching for the Claude API. Anthropic Documentation, 2025. URL https://platform.claude.com/docs/en/build-with-claude/prompt-caching. accessed 15 April 2026
2025
-
[4]
LLM context window limitations in 2026
Atlan. LLM context window limitations in 2026. Atlan Knowledge Base, 2026. URLhttps: //atlan.com/know/llm-context-window-limitations/. accessed 15 April 2026
2026
-
[5]
Generating Long Sequences with Sparse Transformers
R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review arXiv 1904
-
[6]
Poison everywhere: No output from your MCP server is safe
CyberArk Threat Research. Poison everywhere: No output from your MCP server is safe. CyberArk Labs, 2025. URLhttps://www.cyberark.com/resources/threat-research- blog/poison-everywhere-no-output-from-your-mcp-server-is-safe. accessed 15 April 2026. 15
2025
-
[7]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
2022
-
[8]
Agent communication gateway for semantic routing and working memory
IETF Agent-GW Authors. Agent communication gateway for semantic routing and working memory. Technical Report draft-agent-gw-01, IETF, 2026. URLhttps://datatracker. ietf.org/doc/draft-agent-gw/. accessed 15 April 2026
2026
-
[9]
Jennings, I
C. Jennings, I. Swett, J. Rosenberg, and S. Nandakumar. Model context protocol and agent skills over media over QUIC transport. Technical Report draft-jennings-ai-mcp-over-moq-00, IETF, 2025. URLhttps://datatracker.ietf.org/doc/draft-jennings-ai-mcp-over- moq/. accessed 15 April 2026
2025
-
[10]
Jennings, I
C. Jennings, I. Swett, J. Rosenberg, and S. Nandakumar. Model context protocol over media over QUIC transport. Technical Report draft-jennings-mcp-over-moqt-00, IETF, 2025. URL https://datatracker.ietf.org/doc/draft-jennings-mcp-over-moqt/. accessed 15 April 2026
2025
-
[11]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE- bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[12]
Johnson, M
J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs.IEEE Trans- actions on Big Data, 7(3), 2019
2019
-
[13]
Kaplan and Anthropic Engineering
A. Kaplan and Anthropic Engineering. Code execution with MCP: Building more efficient AI agents. Anthropic Engineering, Nov. 2025. URLhttps://www.anthropic.com/engineering/ code-execution-with-mcp. accessed 15 April 2026
2025
-
[14]
M. Kloski. MCP faces its reckoning as cracks show in Anthropic’s universal protocol. DEV Community, 2026. URLhttps://dev.to/mjkloski/mcp-faces-its-reckoning-as- cracks-show-in-anthropics-universal-protocol-1ghj. accessed 15 April 2026
2026
-
[15]
LangChain agents and middleware documentation
LangChain, Inc. LangChain agents and middleware documentation. LangChain Docs, 2026. URLhttps://docs.langchain.com/oss/python/langchain/agents. accessed 15 April 2026
2026
-
[16]
Lewis, E
P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
2020
-
[17]
Semantic Kernel agent orchestration
Microsoft. Semantic Kernel agent orchestration. Microsoft Learn, 2026. URLhttps://learn. microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-orchestration/. ac- cessed 15 April 2026
2026
-
[18]
Claude Code MCP servers and token overhead: What you need to know
MindStudio Team. Claude Code MCP servers and token overhead: What you need to know. MindStudio Blog, Apr. 2026. URLhttps://www.mindstudio.ai/blog/claude-code-mcp- server-token-overhead. accessed 15 April 2026
2026
-
[20]
Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,
URLhttps://arxiv.org/abs/2502.05167. accessed 15 April 2026. 16
-
[21]
[RFC] secure model context protocol (SMCP) v1.0
Model Context Protocol Community. [RFC] secure model context protocol (SMCP) v1.0. GitHub Discussion #689,modelcontextprotocolorganization, 2026. URLhttps://github. com/orgs/modelcontextprotocol/discussions/689. accessed 15 April 2026
2026
-
[22]
Model context protocol specification, 2025
Model Context Protocol Working Group. Model context protocol specification, 2025. URL https://modelcontextprotocol.io/docs/concepts/tools. accessed 15 April 2026
2025
-
[23]
T. Pan. Why your AI agent wastes most of its context window on tools. TianPan.co Blog, Jan
-
[24]
accessed 15 April 2026
URLhttps://tianpan.co/blog/2026-01-30-advanced-tool-use-production-ai- agents. accessed 15 April 2026
2026
-
[25]
LLM context windows: What they are and how they work
Redis. LLM context windows: What they are and how they work. Redis Engineering Blog,
-
[26]
accessed 15 April 2026
URLhttps://redis.io/blog/llm-context-windows/. accessed 15 April 2026
2026
-
[27]
Reimers and I
N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. InProceedings of EMNLP-IJCNLP, 2019
2019
-
[28]
AI agent routing: Tutorial and examples
Safe Software. AI agent routing: Tutorial and examples. FME by Safe Software, 2026. URL https://fme.safe.com/guides/ai-agent-architecture/ai-agent-routing/. accessed 15 April 2026
2026
-
[29]
M. K. Saha. Within the context-engineered realm of agentic AI, can MCP reinvent en- terprise integration? AgenticAI—The Autonomous Intelligence, Medium, 2026. URL https://medium.com/p/4e2723a07ad6. accessed 15 April 2026
2026
-
[30]
Schick, J
T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[31]
Vaswani, N
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I.Polosukhin. Attentionisallyouneed. InAdvances in Neural Information Processing Systems, volume 30, 2017
2017
-
[33]
Z.Wang, H.Du, G.Shi, J.Zhang, H.Cheng, Y.Yao, K.Guo, andX.-Y.Li. MindGuard: Intrin- sic decision inspection for securing LLM agents against metadata poisoning.arXiv preprint arXiv:2508.20412v3, 2026. URLhttps://arxiv.org/abs/2508.20412. accessed 15 April 2026
-
[34]
""IntentRouter: embeds a query, ranks tool summaries, returns gated top-k
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations (ICLR), 2023. A Reference Implementation The complete runnable implementation accompanying this paper is released as a companion code bundle. The core modules are reprodu...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.