The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025a

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, et al · 2025 · arXiv 2510.25726

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

citation-role summary

background 2 dataset 2

citation-polarity summary

background 2 use dataset 2

representative citing papers

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

cs.AI · 2026-05-04 · conditional · novelty 8.0

PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

cs.CR · 2026-05-17 · unverdicted · novelty 7.0

Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.

RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.

Qwen-AgentWorld: Language World Models for General Agents

cs.CL · 2026-06-23 · unverdicted · novelty 6.0

Qwen-AgentWorld are language world models that simulate multi-domain agent environments and boost general agent capabilities via decoupled RL simulation and unified foundation model training.

PhoneBuddy: Training Open Models for Agentic Phone Use

cs.CL · 2026-06-22 · unverdicted · novelty 6.0

PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

cs.AI · 2026-05-31 · unverdicted · novelty 6.0

SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.

ComplexConstraints and Beyond: Expert Rubrics for RLVR

cs.AI · 2026-06-08 · unverdicted · novelty 5.0

Expert-curated rubrics in the new ComplexConstraints dataset improve LLM instruction following by 12-15% when used as RL training signals, with gains transferring to out-of-distribution agentic benchmarks.

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

cs.AI · 2026-05-16 · unverdicted · novelty 5.0

MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.

Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving

cs.DC · 2026-05-07 · unverdicted · novelty 5.0

Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.

GLM-5: from Vibe Coding to Agentic Engineering

cs.LG · 2026-02-17 · unverdicted · novelty 5.0

GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

cs.CL · 2025-12-02 · unverdicted · novelty 5.0

DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.

citing papers explorer

Showing 16 of 16 citing papers.

PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments cs.AI · 2026-05-04 · conditional · none · ref 19
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback cs.CR · 2026-05-17 · unverdicted · none · ref 4
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents cs.AI · 2026-05-13 · unverdicted · none · ref 35
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks cs.AI · 2026-06-28 · unverdicted · none · ref 47
OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
Qwen-AgentWorld: Language World Models for General Agents cs.CL · 2026-06-23 · unverdicted · none · ref 37
Qwen-AgentWorld are language world models that simulate multi-domain agent environments and boost general agent capabilities via decoupled RL simulation and unified foundation model training.
PhoneBuddy: Training Open Models for Agentic Phone Use cs.CL · 2026-06-22 · unverdicted · none · ref 26
PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.
SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision cs.AI · 2026-05-31 · unverdicted · none · ref 22
SkillRevise iteratively refines initial LLM-generated agent skills using execution traces to diagnose defects and apply repairs, raising success rates from 36.05% to 61.63% on SkillsBench across three benchmarks and five LLMs.
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows? cs.CL · 2026-05-15 · unverdicted · none · ref 33
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows cs.AI · 2026-05-14 · unverdicted · none · ref 7
EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence cs.AI · 2026-04-20 · unverdicted · none · ref 44
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents cs.AI · 2026-04-07 · unverdicted · none · ref 13
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
ComplexConstraints and Beyond: Expert Rubrics for RLVR cs.AI · 2026-06-08 · unverdicted · none · ref 12
Expert-curated rubrics in the new ComplexConstraints dataset improve LLM instruction following by 12-15% when used as RL training signals, with gains transferring to out-of-distribution agentic benchmarks.
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents cs.AI · 2026-05-16 · unverdicted · none · ref 7
MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving cs.DC · 2026-05-07 · unverdicted · none · ref 18
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
GLM-5: from Vibe Coding to Agentic Engineering cs.LG · 2026-02-17 · unverdicted · none · ref 22
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 3
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025a

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer