pith. the verified trust layer for science. sign in

arxiv: 2601.11044 · v4 · submitted 2026-01-16 · 💻 cs.AI

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Pith reviewed 2026-05-16 14:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords autonomous agentsLLM benchmarkstool uselong-horizon tasksagent scaffoldsclosed-source vs open-sourceuser simulation
0
0 comments X p. Extension

The pith

AgencyBench shows closed-source models outperforming open-source ones 48.4% to 32.1% on long-horizon tasks averaging one million tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AgencyBench introduces a benchmark for autonomous agents drawn from daily AI usage, testing six core capabilities across 32 real-world scenarios and 138 tasks. Each task demands an average of 90 tool calls, one million tokens, and hours of runtime, which existing single-capability benchmarks cannot capture. The benchmark replaces human-in-the-loop feedback with a user simulation agent and adds a Docker sandbox for visual and functional rubric scoring, allowing automated rollout and evaluation at scale. Experiments find closed-source models reach 48.4% success while open-source models reach 32.1%, with further gaps in resource efficiency, feedback-driven correction, and tool preferences. The results also show that proprietary models excel inside their native agent frameworks while open-source models peak in particular execution scaffolds, pointing to the need for joint model-and-framework optimization.

Core claim

AgencyBench establishes that closed-source models significantly outperform open-source models on realistic long-horizon agent tasks and that performance depends on alignment between model and agentic scaffold, with proprietary models strongest in native ecosystems and open-source models showing distinct framework-specific peaks.

What carries the argument

User simulation agent supplying iterative feedback together with a Docker sandbox for rubric-based visual and functional assessment, applied to 138 tasks that average 90 tool calls and one million tokens each.

If this is right

  • Closed-source models achieve substantially higher success rates than open-source models across the benchmark.
  • Models differ markedly in resource efficiency, ability to self-correct from feedback, and preferences for particular tools.
  • Proprietary models reach their highest performance when run inside their native agent frameworks.
  • Open-source models display clear performance peaks when paired with specific execution frameworks.
  • Future gains require co-optimizing model architecture together with the surrounding agentic framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open-source model developers could target feedback-driven self-correction as a concrete route to close the observed gap.
  • The 32 scenarios could serve as a fixed test set for measuring whether larger open models narrow the closed-source lead over time.
  • Embedding AgencyBench tasks directly into agent training loops might accelerate improvement on long-horizon consistency.

Load-bearing premise

The user simulation agent supplies realistic, unbiased iterative feedback that matches human judgment for long-horizon tasks.

What would settle it

A side-by-side run of a random subset of tasks using both the simulation agent and real human users, then comparing the agent success scores and the quality of feedback provided.

Figures

Figures reproduced from arXiv: 2601.11044 by Dayuan Fu, Dequan Wang, Jie Sun, Junhao Shi, Keyu Li, Mohan Jiang, Pengfei Liu, Shijie Xia, Tianze Xu, Weiye Si, Wenjie Li, Xiaojie Cai, Yang Xiao, Yunze Wu.

Figure 1
Figure 1. Figure 1: Overview of AGENCYBENCH. Left: Distribution of the 32 scenarios and 138 tasks across 6 distinct agentic capabilities. Right: Comparison with existing benchmarks. AGENCYBENCH focuses on diverse, long￾horizon real-world tasks, requiring an average of 1M tokens and 90 multi-turn tool uses. It integrates a user simulation agent for iterative feedback and a Docker-based sandbox for automated rubric-based assess… view at source ↗
Figure 2
Figure 2. Figure 2: AGENCYBENCH Rollout Generation and Evaluation Pipeline. Rollout generation takes place within workspace, where the agent receives task queries and deliverables, completing tasks through multi-turn interactions with the environment (e.g., tool execution results and feedback from the user simulation agent). Upon task completion, deliverables are synced to a Docker sandbox for operation execution (e.g., UI ac… view at source ↗
Figure 3
Figure 3. Figure 3: An Illustrative Evaluation Scenario in AGENCYBENCH: Developing a Gomoku Game. The scenario consists of five sequential tasks with increasing complexity, requiring the incremental addition of new features. The primary deliverables include HTML, CSS, and JS source code. Evaluation scripts execute these files within a remote Docker sandbox, performing interactive operations such as clicking, screen recording,… view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency Comparison Across Models. Efficiency is calculated by dividing the average score by the number of attempts and average token consumption, respectively. GPT-5.2 achieves the highest attempt efficiency, while Qwen-3-235B-A22B-Thinking ranks the lowest. For token efficiency, Grok-4.1-Fast performs best, whereas Claude-4.5-Sonnet is the least efficient one. Metric: Average Score (SAvg) Calculated as… view at source ↗
Figure 5
Figure 5. Figure 5: Tool Invocation Patterns Across Models. Claude-4.5-Opus and GPT-5.2 shows a preference for shell execution tools, while Gemini-3-Pro and Qwen-3-235B-A22B-Thinking favor file operation and memory management. Grok-4.1-Fast, GLM-4.6, and Deepseek-V3 series exhibit a strong preference for web search tools. balanced performance, while Qwen-3-235B-A22B-Thinking demonstrates relative strength in research despite … view at source ↗
read the original abstract

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces AgencyBench, a benchmark for LLM-based autonomous agents on long-horizon real-world tasks. It comprises 32 scenarios and 138 tasks (averaging 90 tool calls and 1M tokens each) derived from daily AI usage, evaluating 6 core agentic capabilities. Automated evaluation is enabled via a user simulation agent for iterative feedback and a Docker sandbox for visual/functional rubric assessment. Key results show closed-source models outperforming open-source models (48.4% vs 32.1% success), with further analysis of resource efficiency, self-correction, tool-use preferences, and the effects of agentic scaffolds. The benchmark and toolkit are released publicly.

Significance. If the user simulation agent's feedback is shown to align with human judgment, AgencyBench would provide a scalable, automated testbed for complex agent capabilities that existing single-step benchmarks cannot capture. The reported performance gaps, efficiency disparities, and scaffold interactions could inform co-optimization of models and frameworks, while the public release supports reproducibility and further research in autonomous agents.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Evaluation Setup): The headline result (closed-source 48.4% vs open-source 32.1%) is generated by an automated loop relying on the user simulation agent to supply iterative feedback over ~90 tool calls. No correlation coefficients, inter-rater agreement scores, or ablation comparing simulator feedback to human judgments on the same long-horizon trajectories are reported, leaving the performance ordering vulnerable to simulation-specific artifacts.
  2. [§3] §3 (Benchmark Construction): The 138 tasks and associated rubrics are described at a high level, but the manuscript provides no details on rubric derivation process, pilot validation against real-user outcomes, or measures of simulation fidelity for the 1M-token contexts. This weakens claims about the benchmark capturing genuine real-world agent performance.
  3. [§5] §5 (Analysis of Self-Correction and Tool Use): The reported disparities in feedback-driven self-correction and tool-use preferences across model families are presented without quantitative ablations isolating the contribution of the simulator versus intrinsic model differences; this makes it difficult to attribute the gaps solely to agent capability.
minor comments (3)
  1. [Abstract] The abstract refers to '6 core agentic capabilities' without enumerating them explicitly in the provided text.
  2. [Related Work] Consider adding citations to prior long-horizon agent benchmarks (e.g., WebArena, AgentBench) for clearer positioning in the related work section.
  3. [Figures and Tables] Figure captions and table headers could more explicitly state the number of runs or statistical significance for the reported percentages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We have carefully considered each point and provide point-by-point responses below, along with indications of revisions to be made in the updated manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation Setup): The headline result (closed-source 48.4% vs open-source 32.1%) is generated by an automated loop relying on the user simulation agent to supply iterative feedback over ~90 tool calls. No correlation coefficients, inter-rater agreement scores, or ablation comparing simulator feedback to human judgments on the same long-horizon trajectories are reported, leaving the performance ordering vulnerable to simulation-specific artifacts.

    Authors: We agree that demonstrating alignment between the user simulation agent and human judgment is essential for validating the benchmark's reliability. In the revised version, we will add a dedicated subsection in §4 that presents a human evaluation study on a representative subset of tasks. This study will include correlation coefficients (e.g., Pearson and Spearman) between simulator-provided feedback and human assessments, as well as inter-rater agreement metrics. We will also include an ablation comparing performance under simulator vs. human feedback loops where feasible. These additions will directly address concerns about simulation-specific artifacts. revision: yes

  2. Referee: [§3] §3 (Benchmark Construction): The 138 tasks and associated rubrics are described at a high level, but the manuscript provides no details on rubric derivation process, pilot validation against real-user outcomes, or measures of simulation fidelity for the 1M-token contexts. This weakens claims about the benchmark capturing genuine real-world agent performance.

    Authors: We appreciate this observation and will substantially expand §3 in the revision. The updated section will detail the rubric derivation process, which involved iterative refinement based on real-world AI usage logs and expert review. We will describe the pilot validation conducted with actual users to ensure rubrics reflect realistic outcomes. Additionally, we will report quantitative measures of simulation fidelity, such as the agreement rate between simulated user responses and human responses in sampled interactions, particularly for long-context scenarios. This will provide stronger evidence for the benchmark's real-world relevance. revision: yes

  3. Referee: [§5] §5 (Analysis of Self-Correction and Tool Use): The reported disparities in feedback-driven self-correction and tool-use preferences across model families are presented without quantitative ablations isolating the contribution of the simulator versus intrinsic model differences; this makes it difficult to attribute the gaps solely to agent capability.

    Authors: We acknowledge the need for clearer isolation of factors in our analysis. In the revised manuscript, we will enhance §5 with additional ablations. Specifically, we will include experiments where the same models are evaluated under both the simulator and a human-in-the-loop setup on a subset of tasks to quantify the simulator's influence. We will also present results controlling for the simulator by using fixed feedback templates derived from human data. These ablations will help attribute observed disparities more confidently to differences in model capabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark with direct rubric measurements

full rationale

The paper presents AgencyBench as an empirical evaluation framework consisting of 138 tasks with fixed queries, deliverables, and rubrics. Performance is computed directly via rubric-based assessment inside a Docker sandbox after iterative feedback from the simulation agent. No equations, fitted parameters, or derivations are present that reduce the reported scores (48.4% closed-source vs 32.1% open-source) to inputs by construction. The simulation agent is a methodological tool for automation rather than a self-referential component whose outputs are defined in terms of the measured results. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The derivation chain is therefore self-contained against external task definitions and rubrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on standard assumptions about agent tool use and the fidelity of simulated users; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLM agents can execute multi-step tasks using external tools over long horizons
    Invoked in the definition of the six core agentic capabilities and task construction.
  • domain assumption Simulated user feedback approximates human feedback for evaluation purposes
    Central to the automated evaluation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5655 in / 1292 out tokens · 30666 ms · 2026-05-16T14:07:59.175984+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

    cs.SE 2026-05 unverdicted novelty 7.0

    SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

  2. How to Interpret Agent Behavior

    cs.AI 2026-05 conditional novelty 6.0

    ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.

  3. HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    HMACE deploys Proposer, Generator, Evaluator, and Reflector agents in an evolutionary loop to generate and refine heuristics for NP-hard problems, reporting lower optimality gaps and token costs than baselines on TSP ...

  4. Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems

    cs.MA 2026-04 unverdicted novelty 6.0

    Multi-agent systems amplify minor stochastic biases into systemic polarization via echo-chamber effects in structured workflows, even with neutral agents.

  5. FileGram: Grounding Agent Personalization in File-System Behavioral Traces

    cs.CV 2026-04 unverdicted novelty 6.0

    FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 5 Pith papers · 3 internal anchors

  1. [1]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741

  2. [2]

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516

  3. [3]

    Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, et al. 2025a. Innovatorbench: Evaluating agents’ ability to conduct innovative llm research.arXiv preprint arXiv:2510.27598

  4. [4]

    Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fan- shi Zhang, Yaoqi Ye, Jiawei Wang, et al. 2025b. Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002

  5. [5]

    xAI. 2025. grok-4.1.https://x.ai/news/grok-4-1

  6. [6]

    Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, et al. 2025a. Limi: Less is more for agency.arXiv preprint arXiv:2509.17567

  7. [7]

    Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, and Pengfei Liu. 2025b. Towards dynamic theory of mind: Evaluating llm adaptation to temporal evolution of human states. arXiv preprint arXiv:2505.17663

  8. [8]

    Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, and Pengfei Liu. 2025c. Limopro: Reasoning refinement for efficient and effective test-time scaling.arXiv preprint arXiv:2505.19187

  9. [9]

    Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li, and Pengfei Liu. 2025d. Scale: Selective resource allocation for overcoming performance bottlenecks in mathematical test-time scaling.arXiv preprint arXiv:2512.00466

  10. [10]

    Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, and Pengfei Liu. 2025. Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280

  11. [11]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388

  12. [12]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652

  13. [13]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations. 12 A. Appendix SII-GAIR A Appendix Scenarios Tasks Game 10 50 Front-end 3 15 Back-end 3 15 Code 9 29 Research 5 19 MCP 2 10 Tota...

  14. [14]

    The ”Executors”:There is a striking divergence in how models orient themselves

    The ”Navigators” vs. The ”Executors”:There is a striking divergence in how models orient themselves. GLM-4.6 exhibits a unique ”navigator” strategy, invokinglist directory 158 times—nearly triple the av- erage of other models. This indicates a strong preference for gathering environmental context before taking ac- tion. Conversely, GPT-5.2 and Claude-4.5-...

  15. [15]

    ”Rewriters”:The data reveals a fundamental difference in code modification philosophies

    Editing Styles: ”Surgeons” vs. ”Rewriters”:The data reveals a fundamental difference in code modification philosophies. GPT-5.2 acts as a ”surgeon,” heavily utilizing the replace tool (146 invocations) to make precise, localized edits to existing files. In sharp contrast, GLM-4.6 overwhelmingly prefers thewrite file tool (381 invocations), suggesting a te...

  16. [16]

    It is the only model to record significant usage of update memory bank (22 times) and initialize memory bank (7 times)

    Memory Utilization:Gemini-3-Pro stands out as the sole model to effectively leverage long-term memory capabilities. It is the only model to record significant usage of update memory bank (22 times) and initialize memory bank (7 times). While other models rely entirely on their context window, Gemini attempts to persist state and key information externally...

  17. [17]

    score": 6,

    Information Retrieval:For external knowledge acquisition, GLM-4.6 again shows a distinct profile, us- ing web fetch 96 times, whereas models like Claude-4.5-Opus and GPT-5.2 rely more on their internal knowledge or specific search queries (search file content). 13 A.2 Evaluation Prompts SII-GAIR Claude-4.5-O Claude-4.5-S Gemini-3 GPT-5.2 Grok-4.1 GLM-4.6 ...