arxiv: 2604.21816 · v1 · submitted 2026-04-23 · 💻 cs.AI

Recognition: unknown

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Anuj Sadani , Deepak Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords Tool AttentionModel Context ProtocolLLM agentstoken efficiencydynamic gatinglazy schema loadingagentic workflowscontext utilization

0 comments

The pith

Tool Attention reduces per-turn tool tokens by 95% in simulated LLM agent workflows

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Tool Attention as a middleware mechanism for LLM agents connected via the Model Context Protocol. It dynamically determines which tool schemas to load using embedding-based relevance scoring and state-aware gates, while lazily fetching full details only when necessary. This approach addresses the substantial token overhead from loading all tool schemas on every turn. In a calibrated simulation, it achieves a 95% reduction in tool tokens and improves context utilization significantly, indicating that smarter protocol design can enhance agent scalability.

Core claim

Tool Attention generalizes the attention paradigm from tokens to tools through an Intent Schema Overlap score from sentence embeddings, a state-aware gating function enforcing preconditions and access scopes, and a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. On a simulated 120-tool, six-server benchmark calibrated to real MCP audits, it reduces per-turn tool tokens by 95.0% from 47.3k to 2.4k and raises effective context utilization from 24% to 91%, with end-to-end figures for task success, latency, cost, and reasoning quality reported as projections from these token measurements.

What carries the argument

Tool Attention middleware combining intent schema overlap scoring from sentence embeddings, state-aware gating for preconditions and scopes, and two-phase lazy schema loading to selectively manage tool schemas.

If this is right

Per-turn tool token usage drops 95.0% from 47.3k to 2.4k.
Effective context utilization rises from 24% to 91%.
Projected gains in task success, reduced latency, lower cost, and improved reasoning quality follow from the measured token savings.
Protocol-level efficiency, rather than raw context length, acts as a binding constraint on scalable agentic systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents could scale to hundreds of tools without token costs growing proportionally.
Similar gating and lazy-loading logic could apply to other resource-heavy agent components such as external memory stores or knowledge bases.
Widespread adoption might lower recurring operational costs for complex multi-tool workflows in production.
The provided simulation benchmark offers a controlled testbed for evaluating alternative tool-management techniques before live deployment.

Load-bearing premise

The simulated 120-tool six-server benchmark with token counts calibrated to public MCP audits accurately represents real deployments and that the projected end-to-end benefits will hold when measured on live LLM agents.

What would settle it

A direct measurement on live LLM agents comparing task success rates, latency, cost, and reasoning quality under Tool Attention versus standard eager schema injection in the same multi-server setup.

read the original abstract

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the "Attention Is All You Need" paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k -> 2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tool Attention shows a 95% tool-token cut in a 120-tool simulation via embeddings, gating, and lazy loading, but end-to-end gains stay unmeasured projections.

read the letter

The main thing to know is that this paper gives a working middleware approach to shrink the per-turn tool schema overhead that MCP agents carry. It scores schemas with intent-overlap embeddings, gates on preconditions and state, then lazy-loads full JSON only for the top-k matches while keeping summaries in context. In their calibrated 120-tool, six-server simulation the token count falls from 47k to 2.4k and context utilization rises from 24% to 91%. They release code, which is useful for anyone who wants to test it directly.

Referee Report

2 major / 2 minor

Summary. The paper claims that Tool Attention, a middleware combining Intent Schema Overlap scoring from sentence embeddings, state-aware gating with preconditions, and two-phase lazy schema loading, eliminates the MCP/Tools Tax by reducing per-turn tool token overhead in LLM agents. On a simulated 120-tool, six-server benchmark calibrated to public audits of real MCP deployments, it reports a 95% reduction in tool tokens (47.3k to 2.4k) and an increase in effective context utilization from 24% to 91%, with all end-to-end metrics for task success, latency, cost, and reasoning quality presented as projections derived from these token counts plus external telemetry rather than direct live-agent measurements. The code is released publicly.

Significance. If the simulation generalizes and gating preserves downstream agent performance, the work could meaningfully improve scalability and reduce operational costs for MCP-based agentic systems by addressing context bloat at the protocol level rather than relying on longer contexts. Strengths include calibration of the benchmark to public audits, explicit labeling of all projections, and public code release, which aid reproducibility and external validation.

major comments (2)

[Evaluation section] Evaluation section: The 95% token reduction and context utilization gains are measured only in simulation on a calibrated 120-tool benchmark; no direct experiments are reported that exercise the full gating (ISO + state-aware + lazy loader) inside live LLM agent loops, so false-negative tool omissions or altered reasoning paths remain invisible to the token-count metric. This is load-bearing for the central thesis that the mechanism eliminates the tax in scalable workflows without performance degradation.
[Abstract] Abstract and end-to-end projections: All figures for task success, latency, cost, and reasoning quality are derived projections combining simulated token savings with published deployment telemetry; they are not measured on live agents. The central performance claims therefore rest on the untested assumption that the simulated benchmark accurately represents real MCP deployments and that token reductions translate directly to end-to-end benefits.

minor comments (2)

[Abstract] Abstract: 'gentic systems' is a typo and should read 'agentic systems'.
[Evaluation] Evaluation: The precise definition or formula for 'effective context utilization' (described as a token-ratio quantity) should be stated explicitly, including how the 24% to 91% figures are computed, to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify the scope of our evaluation. We respond to each major comment below.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The 95% token reduction and context utilization gains are measured only in simulation on a calibrated 120-tool benchmark; no direct experiments are reported that exercise the full gating (ISO + state-aware + lazy loader) inside live LLM agent loops, so false-negative tool omissions or altered reasoning paths remain invisible to the token-count metric. This is load-bearing for the central thesis that the mechanism eliminates the tax in scalable workflows without performance degradation.

Authors: We acknowledge that the evaluation relies on simulation rather than live agent loops and that this leaves certain downstream effects unmeasured. The simulation is calibrated to public audits of real MCP deployments to isolate the protocol-level token overhead, which is the defined scope of the MCP/Tools Tax. The public code release is intended to support independent live-agent validation. We will revise the Evaluation section to add an explicit limitations discussion covering risks such as false-negative omissions and the role of conservative gating thresholds in mitigating them. revision: partial
Referee: [Abstract] Abstract and end-to-end projections: All figures for task success, latency, cost, and reasoning quality are derived projections combining simulated token savings with published deployment telemetry; they are not measured on live agents. The central performance claims therefore rest on the untested assumption that the simulated benchmark accurately represents real MCP deployments and that token reductions translate directly to end-to-end benefits.

Authors: The manuscript already labels these figures as projections and distinguishes them from the directly measured token reductions. The benchmark calibration draws on public deployment audits, and the projections incorporate established relationships between context utilization and performance from prior literature. We will revise the abstract and Evaluation section to state the assumptions more prominently and to include a brief sensitivity discussion on how variations in the calibration would affect the projections. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results from direct simulation measurements

full rationale

The paper's central claims rest on direct measurements of per-turn tool token counts and context utilization within a simulated 120-tool benchmark whose inputs are calibrated to external public audits. Token reductions (47.3k → 2.4k) and utilization gains (24% → 91%) are reported as observed outputs of the simulation, not as quantities derived from fitted parameters or internal equations. End-to-end projections for task success, latency, cost, and reasoning quality are explicitly labeled as combinations of these measured values with separate published deployment telemetry and are not presented as internal predictions. No self-citations, uniqueness theorems, or ansatzes appear as load-bearing steps; the generalization of the attention paradigm is a naming reference only. The derivation chain therefore consists of independent empirical evaluation rather than any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach assumes sentence embeddings reliably capture tool-task relevance and that the simulation faithfully models real deployments; no free parameters are explicitly fitted in the abstract, but top-k selection and gating thresholds are implicit design choices.

free parameters (1)

top-k selection threshold
Number of top-gated tools for which full schemas are loaded; value not stated in abstract but central to the 95% reduction claim.

axioms (2)

domain assumption Sentence embeddings from standard models accurately measure intent schema overlap for tool relevance
Used to compute the ISO score that drives gating.
domain assumption The simulated token counts calibrated to public audits match real MCP multi-server deployments
Basis for claiming 95% reduction and 24% to 91% utilization improvement.

pith-pipeline@v0.9.0 · 5657 in / 1482 out tokens · 117352 ms · 2026-05-09T21:39:54.046165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. Anthropic Engineering Blog, Nov. 2024. URLhttps://www.anthropic.com/news/model-context-protocol. accessed 15 April 2026

2024
[2]

Claude code: Agentic coding at the terminal

Anthropic. Claude code: Agentic coding at the terminal. Anthropic Documentation, 2025. URLhttps://docs.claude.com/en/docs/claude-code/overview. accessed 15 April 2026

2025
[3]

Prompt caching for the Claude API

Anthropic. Prompt caching for the Claude API. Anthropic Documentation, 2025. URL https://platform.claude.com/docs/en/build-with-claude/prompt-caching. accessed 15 April 2026

2025
[4]

LLM context window limitations in 2026

Atlan. LLM context window limitations in 2026. Atlan Knowledge Base, 2026. URLhttps: //atlan.com/know/llm-context-window-limitations/. accessed 15 April 2026

2026
[5]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[6]

Poison everywhere: No output from your MCP server is safe

CyberArk Threat Research. Poison everywhere: No output from your MCP server is safe. CyberArk Labs, 2025. URLhttps://www.cyberark.com/resources/threat-research- blog/poison-everywhere-no-output-from-your-mcp-server-is-safe. accessed 15 April 2026. 15

2025
[7]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[8]

Agent communication gateway for semantic routing and working memory

IETF Agent-GW Authors. Agent communication gateway for semantic routing and working memory. Technical Report draft-agent-gw-01, IETF, 2026. URLhttps://datatracker. ietf.org/doc/draft-agent-gw/. accessed 15 April 2026

2026
[9]

Jennings, I

C. Jennings, I. Swett, J. Rosenberg, and S. Nandakumar. Model context protocol and agent skills over media over QUIC transport. Technical Report draft-jennings-ai-mcp-over-moq-00, IETF, 2025. URLhttps://datatracker.ietf.org/doc/draft-jennings-ai-mcp-over- moq/. accessed 15 April 2026

2025
[10]

Jennings, I

C. Jennings, I. Swett, J. Rosenberg, and S. Nandakumar. Model context protocol over media over QUIC transport. Technical Report draft-jennings-mcp-over-moqt-00, IETF, 2025. URL https://datatracker.ietf.org/doc/draft-jennings-mcp-over-moqt/. accessed 15 April 2026

2025
[11]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE- bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representations (ICLR), 2024

2024
[12]

Johnson, M

J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs.IEEE Trans- actions on Big Data, 7(3), 2019

2019
[13]

Kaplan and Anthropic Engineering

A. Kaplan and Anthropic Engineering. Code execution with MCP: Building more efficient AI agents. Anthropic Engineering, Nov. 2025. URLhttps://www.anthropic.com/engineering/ code-execution-with-mcp. accessed 15 April 2026

2025
[14]

M. Kloski. MCP faces its reckoning as cracks show in Anthropic’s universal protocol. DEV Community, 2026. URLhttps://dev.to/mjkloski/mcp-faces-its-reckoning-as- cracks-show-in-anthropics-universal-protocol-1ghj. accessed 15 April 2026

2026
[15]

LangChain agents and middleware documentation

LangChain, Inc. LangChain agents and middleware documentation. LangChain Docs, 2026. URLhttps://docs.langchain.com/oss/python/langchain/agents. accessed 15 April 2026

2026
[16]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

2020
[17]

Semantic Kernel agent orchestration

Microsoft. Semantic Kernel agent orchestration. Microsoft Learn, 2026. URLhttps://learn. microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-orchestration/. ac- cessed 15 April 2026

2026
[18]

Claude Code MCP servers and token overhead: What you need to know

MindStudio Team. Claude Code MCP servers and token overhead: What you need to know. MindStudio Blog, Apr. 2026. URLhttps://www.mindstudio.ai/blog/claude-code-mcp- server-token-overhead. accessed 15 April 2026

2026
[20]

Nolima: Long-context evaluation beyond literal matching.arXiv preprint arXiv:2502.05167,

URLhttps://arxiv.org/abs/2502.05167. accessed 15 April 2026. 16

work page arXiv 2026
[21]

[RFC] secure model context protocol (SMCP) v1.0

Model Context Protocol Community. [RFC] secure model context protocol (SMCP) v1.0. GitHub Discussion #689,modelcontextprotocolorganization, 2026. URLhttps://github. com/orgs/modelcontextprotocol/discussions/689. accessed 15 April 2026

2026
[22]

Model context protocol specification, 2025

Model Context Protocol Working Group. Model context protocol specification, 2025. URL https://modelcontextprotocol.io/docs/concepts/tools. accessed 15 April 2026

2025
[23]

T. Pan. Why your AI agent wastes most of its context window on tools. TianPan.co Blog, Jan
[24]

accessed 15 April 2026

URLhttps://tianpan.co/blog/2026-01-30-advanced-tool-use-production-ai- agents. accessed 15 April 2026

2026
[25]

LLM context windows: What they are and how they work

Redis. LLM context windows: What they are and how they work. Redis Engineering Blog,
[26]

accessed 15 April 2026

URLhttps://redis.io/blog/llm-context-windows/. accessed 15 April 2026

2026
[27]

Reimers and I

N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. InProceedings of EMNLP-IJCNLP, 2019

2019
[28]

AI agent routing: Tutorial and examples

Safe Software. AI agent routing: Tutorial and examples. FME by Safe Software, 2026. URL https://fme.safe.com/guides/ai-agent-architecture/ai-agent-routing/. accessed 15 April 2026

2026
[29]

M. K. Saha. Within the context-engineered realm of agentic AI, can MCP reinvent en- terprise integration? AgenticAI—The Autonomous Intelligence, Medium, 2026. URL https://medium.com/p/4e2723a07ad6. accessed 15 April 2026

2026
[30]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[31]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I.Polosukhin. Attentionisallyouneed. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[33]

MindGuard: Intrin- sic decision inspection for securing LLM agents against metadata poisoning.arXiv preprint arXiv:2508.20412v3, 2026

Z.Wang, H.Du, G.Shi, J.Zhang, H.Cheng, Y.Yao, K.Guo, andX.-Y.Li. MindGuard: Intrin- sic decision inspection for securing LLM agents against metadata poisoning.arXiv preprint arXiv:2508.20412v3, 2026. URLhttps://arxiv.org/abs/2508.20412. accessed 15 April 2026

work page arXiv 2026
[34]

""IntentRouter: embeds a query, ranks tool summaries, returns gated top-k

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Represen- tations (ICLR), 2023. A Reference Implementation The complete runnable implementation accompanying this paper is released as a companion code bundle. The core modules are reprodu...

2023