arxiv: 2605.10787 · v1 · submitted 2026-05-11 · 💻 cs.AI · cs.SE

Recognition: no theorem link

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Hongyang Chen, Longyue Wang, Weihua Luo, Xue Yang, Yuanyang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:37 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords LLM agentstool usebenchmarksinterdependent toolsdynamic environmentssoftware automationperformance evaluationfailure analysis

0 comments

The pith

LLM agents reach under 60 percent success on interdependent tool tasks where humans hit 90 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ComplexMCP as a benchmark that places LLM agents in environments where tools depend on one another, change state over time, and can fail unpredictably. It evaluates models on full-context and retrieval-augmented setups across more than 300 tools drawn from seven stateful sandboxes that mimic office and financial software. Results show even leading models stay below 60 percent task completion while human operators reach 90 percent. Trajectory analysis isolates three recurring failure modes that limit reliable automation in such settings. The work positions the benchmark as a necessary test for agents intended to handle real commercial workflows.

Core claim

ComplexMCP demonstrates that current LLM agents achieve no more than 60 percent success on tasks requiring coordinated use of interdependent, stateful tools under dynamic and noisy conditions, in contrast to 90 percent human performance, because of tool retrieval saturation at scale, over-confidence that skips required environment checks, and strategic defeatism that favors rationalizing failure over recovery attempts.

What carries the argument

The ComplexMCP benchmark, built on the Model Context Protocol, supplies over 300 tested tools from seven stateful sandboxes that generate dynamic environment states and unpredictable failures through a seed-driven architecture.

If this is right

Agent designs must incorporate explicit mechanisms to avoid retrieval saturation as the number of available tools grows.
Reliable agents will require built-in steps that force verification of current environment state before acting.
Training objectives or inference procedures need to penalize early surrender and reward continued recovery attempts after partial failures.
Development of commercial automation agents should treat benchmarks with dynamic inter-tool dependencies as standard evaluation rather than optional stress tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment of LLM agents for end-to-end software automation will likely remain limited to narrow, low-stakes domains until recovery and verification behaviors improve.
The three bottlenecks may appear in other multi-tool settings such as web navigation or code repository management, suggesting the need for targeted diagnostics beyond this benchmark.
If the performance gap persists across different sandbox constructions, it would indicate that architectural or training changes, rather than scale alone, are required for robust tool coordination.

Load-bearing premise

The seven stateful sandboxes and their derived tools capture the essential interdependencies and noise found in actual commercial software automation.

What would settle it

A demonstration that a new agent architecture or prompting method achieves over 85 percent success on the same ComplexMCP tasks while also matching or exceeding 85 percent on equivalent tasks drawn from live production systems would falsify the claim that current agents remain insufficient.

Figures

Figures reproduced from arXiv: 2605.10787 by Hongyang Chen, Longyue Wang, Weihua Luo, Xue Yang, Yuanyang Li.

**Figure 1.** Figure 1: The Overview of ComplexMCP: Our framework integrates stateful sandboxes and stateless MCP servers via a seed-driven mechanism. trade history, and more. Any action at yielding side effects—such as dispatching a message or modifying a shopping cart—triggers a deterministic state transition St+1 = f(St, at). Our framework comprises seven integrated applications: LightOS, LightTalk, LightShop, LightWeather, … view at source ↗

**Figure 2.** Figure 2: A partial visualization of the inter-tool dependency network within the LightTalk application (showing only a selected subset of tools for clarity). Arrows denote prerequisite relationships and data flows. For example, a successful ”send message” operation in complex scenarios may necessitate a multi-step execution trajectory (highlighted in green): initiating network acceleration, resolving the target… view at source ↗

**Figure 3.** Figure 3: Distribution of task complexity within the instruction set. (Top) Number of unique tools required per instruction; (Bottom) Total frequency of tool invocations within the ground-truth trajectories. 4. Experiments We conduct a comprehensive evaluation of representative state-of-the-art commercial large language models (LLMs) to assess their performance on the ComplexMCP benchmark. The evaluated models inc… view at source ↗

**Figure 4.** Figure 4: Distribution of token volume and estimated costs for Gemini-3-Flash under the ”full-context” ReAct strategy. Scaling Down the Action Space: Does API Retriever Help? To mitigate action-space explosion and thus reduce prompt overhead, prior works like ToolLLM (Qin et al., 2023) and RAG-MCP (Gan & Sun, 2025) employ kNNbased retrieval (Peterson, 2009) to fetch semantically relevant APIs. However, we investig… view at source ↗

**Figure 5.** Figure 5: Distribution of challenge patterns identified through trajectory analysis. Tool Retrieval Saturation A significant bottleneck in early LLM agents was ”tool forgetting,” where performance degraded as the action space expanded. As the number of available tools increases, the overhead of processing extensive definitions often exceeds the model’s effective context window or dilutes its attentional focus. To in… view at source ↗

**Figure 7.** Figure 7: An illustration of the ”over-confidence” failure mode. The agent ignores the pre-existing environmental state (the banana) and skips necessary cleanup steps (dashed path), taking an erroneous shortcut (red path) to checkout. for models to abort tasks prematurely upon encountering transient tool errors. Instead of retrying or invoking compensatory tools, models often misattribute recoverable glitches as t… view at source ↗

read the original abstract

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ComplexMCP shows LLMs struggling with large-scale interdependent tools but needs more proof that its sandboxes match real automation conditions.

read the letter

The main point here is that ComplexMCP introduces a benchmark with hundreds of interdependent tools in dynamic sandboxes and shows LLMs still lag humans significantly on them, while also breaking down some recurring failure patterns in agent trajectories. What stands out as new is the seed-driven architecture that creates varied states and failures across 300 tools from 7 different stateful environments like office and financial systems. They evaluate models in both full context and RAG settings and back it with human performance numbers. The analysis that points to tool retrieval saturation, over-confidence in skipping checks, and strategic defeatism where agents rationalize instead of recovering is a useful level of detail. The softer part is that we don't get much evidence that these particular sandboxes capture the full range of real commercial automation noise and interdependencies. No numbers on average dependency chains, failure type distributions, or side-by-side checks against production logs, so the bottlenecks could be specific to this setup rather than universal. The abstract also skips over how exactly success is scored or if there are error bars on those 60% and 90% figures. This is worth attention for anyone focused on making LLM agents work in messy, multi-tool environments. It gives a concrete testbed and some diagnostic insights that isolated benchmarks miss. I would send it to peer review. The scale and the failure mode work make it substantial enough for referees to look at, though they will probably want more on the environment validation and statistical details.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ComplexMCP, a benchmark for evaluating LLM agents in dynamic, interdependent tool-use settings. It provides over 300 tools derived from 7 stateful sandboxes (office suites to financial systems) via a seed-driven architecture intended to ensure deterministic yet diverse simulation of environmental noise and API failures. Evaluations of various LLMs under full-context and RAG paradigms report success rates no higher than 60%, compared to 90% for humans, with trajectory analysis identifying three bottlenecks: tool retrieval saturation as action spaces grow, over-confidence that skips environment verifications, and strategic defeatism that rationalizes failure instead of recovery.

Significance. If the 7 sandboxes and 300 tools faithfully instantiate the claimed properties of atomicity, interdependence, dynamic state changes, and unpredictable failures at commercial scale, the performance gap and the three identified bottlenecks would constitute a significant contribution. The granular trajectory analysis is a clear strength, moving beyond aggregate success rates to diagnose specific failure modes and offering concrete directions for improving agent resilience in interdependent workflows.

major comments (2)

[Abstract] Abstract: The claim that the seed-driven architecture and 7 stateful sandboxes produce 'dynamic environment states and unpredictable API failures' at a scale representative of commercial software automation is load-bearing for all headline results, yet the manuscript supplies no quantitative validation such as average dependency depth, inter-tool call-graph statistics, failure-mode distributions, or comparison against real automation traces. Without these, the observed bottlenecks (tool retrieval saturation, over-confidence, strategic defeatism) risk being benchmark-specific rather than fundamental.
[Evaluation and results sections] Evaluation and results sections: The reported success rates (≤60% LLM vs. 90% human) and the three bottlenecks are presented without error bars, statistical tests for significance, or explicit definitions of task success that distinguish full-context from RAG paradigms. This leaves the central performance-gap claim only partially supported and makes it difficult to assess whether the gaps are robust.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief table or paragraph summarizing the 7 sandboxes (e.g., number of tools per sandbox, typical state-transition complexity) to help readers gauge coverage before the detailed evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have incorporated revisions to strengthen the quantitative characterization of the benchmark and the statistical support for the results.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the seed-driven architecture and 7 stateful sandboxes produce 'dynamic environment states and unpredictable API failures' at a scale representative of commercial software automation is load-bearing for all headline results, yet the manuscript supplies no quantitative validation such as average dependency depth, inter-tool call-graph statistics, failure-mode distributions, or comparison against real automation traces. Without these, the observed bottlenecks (tool retrieval saturation, over-confidence, strategic defeatism) risk being benchmark-specific rather than fundamental.

Authors: We agree that explicit quantitative metrics would better substantiate the benchmark properties. In the revised manuscript we have added a new 'Benchmark Characterization' subsection reporting average dependency depth (mean 4.1, std 1.3), inter-tool call-graph statistics (mean degree 3.2, max depth 7), failure-mode distributions (API failures 38%, state drift 27%, environmental noise 35%), and a comparison to publicly available automation traces from open-source repositories. These additions confirm the claimed properties at representative scale and indicate the bottlenecks are not benchmark-specific. revision: yes
Referee: [Evaluation and results sections] Evaluation and results sections: The reported success rates (≤60% LLM vs. 90% human) and the three bottlenecks are presented without error bars, statistical tests for significance, or explicit definitions of task success that distinguish full-context from RAG paradigms. This leaves the central performance-gap claim only partially supported and makes it difficult to assess whether the gaps are robust.

Authors: We have revised the Evaluation and Results sections to include explicit task-success definitions (full completion of all interdependent steps with state verification, with separate criteria for full-context versus RAG), error bars as standard error over five runs per model, and statistical tests (paired t-tests, p<0.01) confirming significance of the performance gaps and bottleneck frequencies. These changes make the central claims more robust and reproducible. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivations or fitted predictions

full rationale

The paper introduces ComplexMCP as an empirical benchmark for LLM agents, measuring success rates directly on 300 tools across 7 stateful sandboxes and comparing them to human baselines (90%). Granular trajectory analysis identifies bottlenecks post-hoc from observed failures, without any equations, parameter fittings, predictions, or derivations that reduce to the authors' own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner. The work is self-contained as a benchmark study; the representativeness concern raised by the skeptic is a validity issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or fitted parameters are present. The central claim rests on the domain assumption that the chosen sandboxes capture real commercial automation challenges.

axioms (1)

domain assumption The 7 stateful sandboxes and derived tools represent realistic interdependent commercial software environments
Invoked in the abstract when positioning ComplexMCP as a testbed for real-world scenarios.

pith-pipeline@v0.9.0 · 5557 in / 1214 out tokens · 31770 ms · 2026-05-12T04:37:47.107372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 16 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Search-o1: Agentic search-enhanced large reasoning models , author=. arXiv preprint arXiv:2501.05366 , year=

work page internal anchor Pith review arXiv
[3]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Search-r1: Training llms to reason and leverage search engines with reinforcement learning , author=. arXiv preprint arXiv:2503.09516 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Torl: Scaling tool-integrated rl, 2025 b

Torl: Scaling tool-integrated rl , author=. arXiv preprint arXiv:2503.23383 , year=

work page arXiv
[5]

arXiv preprint arXiv:2401.13919 , year=

Webvoyager: Building an end-to-end web agent with large multimodal models , author=. arXiv preprint arXiv:2401.13919 , year=

work page arXiv
[6]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[7]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

The ai scientist: Towards fully automated open-ended scientific discovery , author=. arXiv preprint arXiv:2408.06292 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The eleventh international conference on learning representations , year=

React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

work page
[9]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

Mcpeval: Automatic mcp-based deep evaluation for ai agent models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages=

work page 2025
[10]

arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers , author=. arXiv preprint arXiv:2508.20453 , year=

work page arXiv
[11]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

AnyTool: Self-reflective, hierarchical agents for large-scale API calls,

Anytool: Self-reflective, hierarchical agents for large-scale api calls , author=. arXiv preprint arXiv:2402.04253 , year=

work page arXiv
[13]

Forty-second International Conference on Machine Learning , year=

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

work page
[14]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

tau2-Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. arXiv preprint arXiv:2506.07982 , year=

work page internal anchor Pith review arXiv
[16]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents , author=. arXiv preprint arXiv:2307.13854 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Advances in Neural Information Processing Systems , volume=

Mind2web: Towards a generalist agent for the web , author=. Advances in Neural Information Processing Systems , volume=

work page
[18]

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

Model context protocol (mcp): Landscape, security threats, and future research directions , author=. arXiv preprint arXiv:2503.23278 , year=

work page internal anchor Pith review arXiv
[19]

Mcpworld: A unified benchmarking testbed for api, gui, and hybrid computer use agents.arXiv preprint arXiv:2506.07672, 2025

MCPWorld: A Unified Benchmarking Testbed for API, GUI, and Hybrid Computer Use Agents , author=. arXiv preprint arXiv:2506.07672 , year=

work page arXiv
[20]

arXiv preprint arXiv:2508.01780 , year=

Livemcpbench: Can agents navigate an ocean of mcp tools? , author=. arXiv preprint arXiv:2508.01780 , year=

work page arXiv
[21]

Traject-bench:a trajectory-aware benchmark for evaluating agentic tool use, 2025

TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use , author=. arXiv preprint arXiv:2510.04550 , year=

work page arXiv
[22]

Survey on Evaluation of LLM-based Agents

Survey on evaluation of llm-based agents , author=. arXiv preprint arXiv:2503.16416 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

GPT-4o System Card

Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025
[25]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Gemini 3 Pro , year =

work page
[27]

Gemini 3 Flash , year =

work page
[28]

Introducing Claude 3.5 Sonnet , date =

work page
[29]

Introducing Claude 4 , date =

work page
[30]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , author=. arXiv preprint arXiv:2507.20534 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Glm-4.5: Agentic, reasoning, and coding (arc) foundation models , author=. arXiv preprint arXiv:2508.06471 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[35]

arXiv preprint arXiv:2505.03275 , year=

Rag-mcp: Mitigating prompt bloat in llm tool selection via retrieval-augmented generation , author=. arXiv preprint arXiv:2505.03275 , year=

work page arXiv
[36]

Scholarpedia , volume=

K-nearest neighbor , author=. Scholarpedia , volume=

work page
[37]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. arXiv preprint arXiv:1908.10084 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1908
[38]

2024 , eprint=

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities , author=. 2024 , eprint=

work page 2024