MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
Pith reviewed 2026-05-21 15:03 UTC · model grok-4.3
The pith
A benchmark of 1,000 expert tasks on 36 real servers shows frontier models reach 82.2 percent success on multi-step tool use, with most errors arising from cognitive issues rather than tool calls.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MCP-Atlas contains 1,000 natural-language tasks spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 0.2
What carries the argument
The claim-level rubric that scores final answers against atomic factual claims from tool outputs, paired with an 11-category diagnostic taxonomy separating tool-call errors from cognitive errors in understanding, synthesis, parsing, and stopping.
If this is right
- Different valid sequences of tool calls for the same task receive equal credit when they produce answers that cover the required factual claims.
- Automated diagnostics can separate failures caused by incorrect tool selection or formatting from those caused by misunderstanding the task or mishandling results.
- A three-tier performance structure appears consistently across providers when models are tested under identical task conditions.
- Several high-performing models still lose points by stopping before they have synthesized the necessary information even after successful tool executions.
Where Pith is reading between the lines
- Training methods that emphasize deciding when to stop and combining outputs from multiple sources could raise overall success rates more than further improvements in tool-calling accuracy alone.
- Claim-based scoring may reduce bias in other agent evaluations where style or length currently influences scores.
- Testing on live production servers rather than mocks surfaces variability that future agent work should address directly.
- The released harness and evaluator make it possible to add new servers or tasks while keeping the same scoring standard.
Load-bearing premise
The 1,000 tasks written and verified by human experts accurately represent realistic multi-step, cross-server tool-use scenarios and the claim-level rubric measures competency independently of agent verbosity or answer style.
What would settle it
Re-evaluating the same models on tasks rephrased to alter verbosity or stylistic cues and finding that pass rates or the share of cognitive failures shift by more than 10 percentage points.
Figures
read the original abstract
The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level rubric, where final answers are scored against atomic factual claims grounded in tool outputs. This answer-centric scoring permits valid alternative tool-call trajectories to receive credit. We pair this with an 11-category diagnostic taxonomy that disentangles tool-call failures from cognitive failures in task understanding, synthesis, parsing, and stopping. Evaluating 20 frontier models from six providers under matched task-level conditions, we find pass rates up to 82.2% at a 0.75 claim coverage threshold and a clear three-tier performance structure. Automated diagnostics show that 63.3% of diagnosed failures are cognitive rather than tool-call related. Notably, several high-performing models fail after successful tool execution due to premature stopping or incorrect synthesis. We release the task schema, containerized harness, claim evaluator, and a 500-task public split, while reserving a 500-task private split to preserve leaderboard integrity. The code is at https://github.com/scaleapi/mcp-atlas.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MCP-Atlas, a benchmark for LLM agent tool-use competency consisting of 1,000 natural-language tasks written and verified by human experts across 36 real production MCP servers and 220 tools. Prompts require agents to discover relevant tools among distractors and compose multi-step cross-server workflows without explicit server or parameter hints. Scoring uses an answer-centric claim-level rubric that credits valid alternative trajectories, paired with an 11-category diagnostic taxonomy separating cognitive failures from tool-call errors. Evaluation of 20 frontier models from six providers under matched conditions reports pass rates up to 82.2% at a 0.75 claim-coverage threshold, a three-tier performance structure, and that 63.3% of diagnosed failures are cognitive rather than tool-call related. The authors release the task schema, containerized harness, claim evaluator, and a 500-task public split.
Significance. If the tasks and rubric are shown to be representative and style-independent, MCP-Atlas would fill a clear gap by providing the first large-scale benchmark grounded in authentic MCP servers rather than mocks, with reproducible claim-level scoring and failure diagnostics. The public release of code and partial data strengthens its potential utility for the community.
major comments (3)
- [Abstract / Task Construction] Abstract and task-construction description: no inter-rater reliability statistics, agreement metrics, or comparison against real MCP usage logs are reported for the 1,000 expert-authored tasks. This is load-bearing because the central claim that MCP-Atlas validly measures tool-use competency rests on these tasks accurately representing realistic multi-step cross-server workflows.
- [Scoring Methodology] Scoring methodology (claim-level rubric): no ablation or sensitivity analysis is supplied for the 0.75 claim-coverage threshold or the claim-extraction process. The reported 82.2% pass rate, three-tier structure, and 63.3% cognitive-failure statistic all inherit this choice; without evidence that the rubric scores competency independently of verbosity or trajectory style, the results remain difficult to interpret.
- [Evaluation and Diagnostics] Diagnostic taxonomy application: the 11-category taxonomy and automated diagnostics are used to conclude that 63.3% of failures are cognitive, yet no human validation or inter-annotator agreement on failure categorization is provided. This directly affects the reliability of the failure-mode breakdown.
minor comments (2)
- [Data Release] Clarify how the 500-task public split was sampled to ensure representativeness across servers and task complexity.
- [Scoring Rubric] Define the exact procedure for extracting atomic claims from tool outputs and final answers in the main text rather than only in supplementary material.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We have carefully considered each major comment and provide point-by-point responses below. We agree with the need for additional validation and will make revisions to address the concerns about task construction, scoring sensitivity, and diagnostic reliability. These changes will enhance the robustness of our claims regarding MCP-Atlas as a benchmark for tool-use competency.
read point-by-point responses
-
Referee: [Abstract / Task Construction] Abstract and task-construction description: no inter-rater reliability statistics, agreement metrics, or comparison against real MCP usage logs are reported for the 1,000 expert-authored tasks. This is load-bearing because the central claim that MCP-Atlas validly measures tool-use competency rests on these tasks accurately representing realistic multi-step cross-server workflows.
Authors: We acknowledge this limitation in our current manuscript. The 1,000 tasks were developed through a rigorous process involving multiple human experts with experience in MCP server development and usage. Each task was authored by one expert and verified by at least one other for realism, correctness, and multi-step nature. However, we did not report formal inter-rater reliability metrics such as Fleiss' kappa. We will revise the manuscript to include a detailed description of the task creation and verification protocol, along with agreement statistics computed on a held-out subset of tasks. Regarding comparison to real MCP usage logs, such logs are proprietary and not accessible for benchmarking purposes. We will add a discussion explaining why expert curation was used as a proxy for realism and note this as a limitation. This addresses the core concern while maintaining the benchmark's value. revision: partial
-
Referee: [Scoring Methodology] Scoring methodology (claim-level rubric): no ablation or sensitivity analysis is supplied for the 0.75 claim-coverage threshold or the claim-extraction process. The reported 82.2% pass rate, three-tier structure, and 63.3% cognitive-failure statistic all inherit this choice; without evidence that the rubric scores competency independently of verbosity or trajectory style, the results remain difficult to interpret.
Authors: We agree that providing sensitivity analysis would improve interpretability. The 0.75 threshold was selected after initial pilot studies to allow for reasonable variation in answer completeness while maintaining high standards. We will add an ablation study in the revised manuscript, varying the threshold across 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0, and report how the pass rates, performance tiers, and failure distributions shift. Additionally, we will describe the claim-extraction process in greater detail, including how claims are automatically extracted from tool responses and manually spot-checked. This will demonstrate that the scoring is robust and not overly sensitive to specific choices or agent verbosity. revision: yes
-
Referee: [Evaluation and Diagnostics] Diagnostic taxonomy application: the 11-category taxonomy and automated diagnostics are used to conclude that 63.3% of failures are cognitive, yet no human validation or inter-annotator agreement on failure categorization is provided. This directly affects the reliability of the failure-mode breakdown.
Authors: We concur that validation of the diagnostic taxonomy is important for the reliability of the 63.3% statistic. The taxonomy was iteratively refined by the research team based on manual inspection of model outputs. The automated diagnostics combine heuristic rules for tool-call errors with LLM-based classification for cognitive categories. To address this, we will conduct a human validation study on a sample of 200 failure cases, involving two independent annotators, and report inter-annotator agreement (e.g., Cohen's kappa) as well as agreement with the automated system. We will update the results and discussion accordingly in the revision. If the agreement is substantial, it will support our conclusions; we will qualify any findings as needed. revision: yes
Circularity Check
No circularity: empirical benchmark construction and model evaluation
full rationale
The paper introduces MCP-Atlas as an empirical benchmark consisting of 1,000 human-expert-authored tasks across real MCP servers, evaluated via direct model runs under matched conditions to produce pass rates, tiered performance, and diagnostic failure breakdowns. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Task construction, claim-level rubrics, and automated diagnostics are defined externally to the reported results; outcomes (e.g., 82.2% pass rate, 63.3% cognitive failures) are measured against production servers rather than fitted or self-referentially defined. The work contains no load-bearing self-citations, uniqueness theorems, or ansatzes that would trigger circularity patterns. This is a standard self-contained benchmark study.
Axiom & Free-Parameter Ledger
free parameters (1)
- claim coverage threshold
axioms (1)
- domain assumption Human experts can create and verify natural-language tasks that require genuine tool discovery and multi-step orchestration among semantically plausible distractors.
Forward citations
Cited by 5 Pith papers
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perf...
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
Reference graph
Works this paper leans on
-
[1]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Massive Multitask Language Understanding. In ICLR, 2021. arXiv:2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Holistic Evaluation of Language Models
P . Liang et al. Holistic Evaluation of Language Models (HELM). arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
WebArena: A Realistic Web Environment for Building Autonomous Agents
S. Zhou et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
E. Z. Liu, K. Guu, P . Pasupat, T. Shi, and P . Liang. Reinforcement Learning on Web Interfaces using Workflow- Guided Exploration. In ICLR, 2018. arXiv:1802.08802. (Introduces MiniWoB++)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
T. Xie et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environ- ments. arXiv:2404.07972, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Y. Chai et al. A3: Android Agent Arena for Mobile GUI Agents. arXiv:2501.01149, 2025
-
[7]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Y. Qin et al. ToolLLM: Facilitating Large Language Models to Master 16,464 Real-World APIs.arXiv:2307.16789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
(Introduces ToolBench dataset)
- [9]
-
[10]
S. G. Patil et al. The Berkeley Function-Calling Leaderboard (BFCL): From Benchmarks to Real-World Evaluation. OpenReview, 2024/2025. (Leaderboard and methodology). 14
work page 2024
-
[11]
S. Yao, N. Shinn, P . Razavi, and K. Narasimhan.λ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
C. E. Jimenez et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
GAIA: a benchmark for General AI Assistants
G. Mialon et al. GAIA: A Benchmark for General AI Assistants. arXiv: 2311.12983, 2023. (ICLR 2024 version available)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
modelcontextprotocol.io/specification/2025-03-26, 2025
Model Context Protocol (MCP) Specification. modelcontextprotocol.io/specification/2025-03-26, 2025
work page 2025
-
[15]
Introducing the Model Context Protocol
Anthropic. Introducing the Model Context Protocol. anthropic.com/news/model-context-protocol, Nov 2024
work page 2024
- [16]
-
[17]
Z. Wang et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Real MCP Servers and Fuzzy Prompts. arXiv:2508.20453, 2025
- [18]
- [19]
- [20]
-
[21]
Evaluating Large Language Models Trained on Code
M. Chen et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021. (Introduces HumanEval)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [22]
- [23]
-
[24]
Y. Zhao et al. MCPVerse: Expanding the Action Space for Agentic LLMs. arXiv:2507.xxxxx, 2025
work page 2025
-
[25]
L. Chen et al. MSC-Bench: A Curriculum for Multi-Server Coordination in MCP Agents. arXiv:2508.xxxxx, 2025
work page 2025
-
[26]
H. Zhang et al. MCPToolBench++: Large-Scale Multilingual MCP Server Evaluation. arXiv:2509.xxxxx, 2025
work page 2025
-
[27]
Y. Huang et al. MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use. Proceedings of ICLR, 2024
work page 2024
- [28]
-
[29]
Survey on Evaluation of LLM-based Agents
A. Yehudai et al. Survey on Evaluation of LLM-based Agents. arXiv preprint arXiv:2503.16416, 2025. A. Appendix A: Environment Buckets and Detailed Diagnostics Bucket Shares and Target Mix.The distribution of tasks across the environment buckets is as follows: BASIC(32%), ANALYTICS(12%), PRODUCTIVITY(22%), FINANCIAL(12%), and CODING(22%). Representative se...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Complexity:The task must require multiple tool calls (target 3-6) and ideally involve cross-server orchestration or conditional logic. C.1 Example Task To illustrate the task schema, consider the following example: Prompt:“I’m researching papers on advertisement effectiveness and comparing it to our own online database advertising data. There’s a 2024 pap...
work page 2024
-
[31]
jane castleman ad locality 2024
arxiv_search_papers (“jane castleman ad locality 2024”)→paper abstract
work page 2024
- [32]
-
[33]
21b97551-844e-8068-b387-fe7a56b04348
notion_API-post-database-query (database_id: “21b97551-844e-8068-b387-fe7a56b04348”)→campaign date Claims List:
-
[34]
“There’s a 2024 paper by Jane Castleman with the title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’.”
work page 2024
-
[35]
“The abstract of the paper with title ‘Why am I Still Seeing This: Measuring the Effectiveness Of Ad Controls and Explanations in AI-Mediated Ad Targeting Systems’ is: ‘Recently, Meta has shifted towards AI-mediated ad targeting mechanisms [... abridged for paper]’.”
-
[36]
There’s a tie between three advertising campaigns with an engagement rate of 15%
“There’s a tie between three advertising campaigns with an engagement rate of 15%.”
-
[37]
“The starting dates of the three winning advertising campaigns are: 2022-06-24, 2019-09-20 and 2017-09-09.”
work page 2022
-
[38]
“The localities of the three winning advertisement campaigns are: ‘National’, ‘International’ and ‘International’.” D. Appendix D: Extended Results Per-Server Error Rates.We observe significant variation in syntax and type error rates across servers. Financial servers exhibit the highest error rates (up to 45%), often due to strict requirements for date f...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.