arxiv: 2601.11044 · v4 · submitted 2026-01-16 · 💻 cs.AI

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia

show 6 more authors

Xiaojie Cai Tianze Xu Weiye Si Wenjie Li Dequan Wang Pengfei Liu

This is my paper

Pith reviewed 2026-05-16 14:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords autonomous agentsLLM benchmarkstool uselong-horizon tasksagent scaffoldsclosed-source vs open-sourceuser simulation

0 comments p. Extension

The pith

AgencyBench shows closed-source models outperforming open-source ones 48.4% to 32.1% on long-horizon tasks averaging one million tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AgencyBench introduces a benchmark for autonomous agents drawn from daily AI usage, testing six core capabilities across 32 real-world scenarios and 138 tasks. Each task demands an average of 90 tool calls, one million tokens, and hours of runtime, which existing single-capability benchmarks cannot capture. The benchmark replaces human-in-the-loop feedback with a user simulation agent and adds a Docker sandbox for visual and functional rubric scoring, allowing automated rollout and evaluation at scale. Experiments find closed-source models reach 48.4% success while open-source models reach 32.1%, with further gaps in resource efficiency, feedback-driven correction, and tool preferences. The results also show that proprietary models excel inside their native agent frameworks while open-source models peak in particular execution scaffolds, pointing to the need for joint model-and-framework optimization.

Core claim

AgencyBench establishes that closed-source models significantly outperform open-source models on realistic long-horizon agent tasks and that performance depends on alignment between model and agentic scaffold, with proprietary models strongest in native ecosystems and open-source models showing distinct framework-specific peaks.

What carries the argument

User simulation agent supplying iterative feedback together with a Docker sandbox for rubric-based visual and functional assessment, applied to 138 tasks that average 90 tool calls and one million tokens each.

If this is right

Closed-source models achieve substantially higher success rates than open-source models across the benchmark.
Models differ markedly in resource efficiency, ability to self-correct from feedback, and preferences for particular tools.
Proprietary models reach their highest performance when run inside their native agent frameworks.
Open-source models display clear performance peaks when paired with specific execution frameworks.
Future gains require co-optimizing model architecture together with the surrounding agentic framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open-source model developers could target feedback-driven self-correction as a concrete route to close the observed gap.
The 32 scenarios could serve as a fixed test set for measuring whether larger open models narrow the closed-source lead over time.
Embedding AgencyBench tasks directly into agent training loops might accelerate improvement on long-horizon consistency.

Load-bearing premise

The user simulation agent supplies realistic, unbiased iterative feedback that matches human judgment for long-horizon tasks.

What would settle it

A side-by-side run of a random subset of tasks using both the simulation agent and real human users, then comparing the agent success scores and the quality of feedback provided.

Figures

Figures reproduced from arXiv: 2601.11044 by Dayuan Fu, Dequan Wang, Jie Sun, Junhao Shi, Keyu Li, Mohan Jiang, Pengfei Liu, Shijie Xia, Tianze Xu, Weiye Si, Wenjie Li, Xiaojie Cai, Yang Xiao, Yunze Wu.

**Figure 1.** Figure 1: Overview of AGENCYBENCH. Left: Distribution of the 32 scenarios and 138 tasks across 6 distinct agentic capabilities. Right: Comparison with existing benchmarks. AGENCYBENCH focuses on diverse, longhorizon real-world tasks, requiring an average of 1M tokens and 90 multi-turn tool uses. It integrates a user simulation agent for iterative feedback and a Docker-based sandbox for automated rubric-based assess… view at source ↗

**Figure 2.** Figure 2: AGENCYBENCH Rollout Generation and Evaluation Pipeline. Rollout generation takes place within workspace, where the agent receives task queries and deliverables, completing tasks through multi-turn interactions with the environment (e.g., tool execution results and feedback from the user simulation agent). Upon task completion, deliverables are synced to a Docker sandbox for operation execution (e.g., UI ac… view at source ↗

**Figure 3.** Figure 3: An Illustrative Evaluation Scenario in AGENCYBENCH: Developing a Gomoku Game. The scenario consists of five sequential tasks with increasing complexity, requiring the incremental addition of new features. The primary deliverables include HTML, CSS, and JS source code. Evaluation scripts execute these files within a remote Docker sandbox, performing interactive operations such as clicking, screen recording,… view at source ↗

**Figure 4.** Figure 4: Efficiency Comparison Across Models. Efficiency is calculated by dividing the average score by the number of attempts and average token consumption, respectively. GPT-5.2 achieves the highest attempt efficiency, while Qwen-3-235B-A22B-Thinking ranks the lowest. For token efficiency, Grok-4.1-Fast performs best, whereas Claude-4.5-Sonnet is the least efficient one. Metric: Average Score (SAvg) Calculated as… view at source ↗

**Figure 5.** Figure 5: Tool Invocation Patterns Across Models. Claude-4.5-Opus and GPT-5.2 shows a preference for shell execution tools, while Gemini-3-Pro and Qwen-3-235B-A22B-Thinking favor file operation and memory management. Grok-4.1-Fast, GLM-4.6, and Deepseek-V3 series exhibit a strong preference for web search tools. balanced performance, while Qwen-3-235B-A22B-Thinking demonstrates relative strength in research despite … view at source ↗

read the original abstract

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgencyBench gives a useful new long-horizon agent benchmark with concrete numbers, but the closed-vs-open gap rests on an unvalidated user simulator.

read the letter

The main thing here is a new benchmark built from 32 real-world scenarios and 138 tasks that each run to about 90 tool calls and 1M tokens. They automate the loop with a user simulation agent plus Docker-based rubric checks, which removes the human bottleneck and lets them run the full set. That produces the headline split: closed-source models at 48.4% versus open-source at 32.1%, plus side findings on efficiency, self-correction, and how different scaffolds favor different model families. The release of the full benchmark and toolkit is the practical win; anyone building agents can now plug in and compare on longer trajectories than the usual short-task suites allow. What the paper does cleanly is show measurable differences in resource use and framework interactions without obvious circularity in the metrics. The soft spot is exactly the one the stress-test flags. The performance ordering depends on the simulator producing feedback that tracks what real users would say across long sessions. The abstract gives no correlation numbers, no inter-rater checks against humans, and no ablation on the same trajectories, so we cannot yet tell whether the gap is model capability or simulator artifact. Rubric construction and scenario selection also get little space, which leaves open questions about coverage and bias. Minor, but worth a note in review. This is the kind of resource paper that belongs in the agent evaluation literature. Researchers working on long-context agents or scaffold design will want the data and the code. It is solid enough on its own terms to deserve a serious referee, mainly to press on the simulation validation and to tighten the methods section. I would send it out rather than desk-reject.

Referee Report

3 major / 3 minor

Summary. The paper introduces AgencyBench, a benchmark for LLM-based autonomous agents on long-horizon real-world tasks. It comprises 32 scenarios and 138 tasks (averaging 90 tool calls and 1M tokens each) derived from daily AI usage, evaluating 6 core agentic capabilities. Automated evaluation is enabled via a user simulation agent for iterative feedback and a Docker sandbox for visual/functional rubric assessment. Key results show closed-source models outperforming open-source models (48.4% vs 32.1% success), with further analysis of resource efficiency, self-correction, tool-use preferences, and the effects of agentic scaffolds. The benchmark and toolkit are released publicly.

Significance. If the user simulation agent's feedback is shown to align with human judgment, AgencyBench would provide a scalable, automated testbed for complex agent capabilities that existing single-step benchmarks cannot capture. The reported performance gaps, efficiency disparities, and scaffold interactions could inform co-optimization of models and frameworks, while the public release supports reproducibility and further research in autonomous agents.

major comments (3)

[Abstract and §4] Abstract and §4 (Evaluation Setup): The headline result (closed-source 48.4% vs open-source 32.1%) is generated by an automated loop relying on the user simulation agent to supply iterative feedback over ~90 tool calls. No correlation coefficients, inter-rater agreement scores, or ablation comparing simulator feedback to human judgments on the same long-horizon trajectories are reported, leaving the performance ordering vulnerable to simulation-specific artifacts.
[§3] §3 (Benchmark Construction): The 138 tasks and associated rubrics are described at a high level, but the manuscript provides no details on rubric derivation process, pilot validation against real-user outcomes, or measures of simulation fidelity for the 1M-token contexts. This weakens claims about the benchmark capturing genuine real-world agent performance.
[§5] §5 (Analysis of Self-Correction and Tool Use): The reported disparities in feedback-driven self-correction and tool-use preferences across model families are presented without quantitative ablations isolating the contribution of the simulator versus intrinsic model differences; this makes it difficult to attribute the gaps solely to agent capability.

minor comments (3)

[Abstract] The abstract refers to '6 core agentic capabilities' without enumerating them explicitly in the provided text.
[Related Work] Consider adding citations to prior long-horizon agent benchmarks (e.g., WebArena, AgentBench) for clearer positioning in the related work section.
[Figures and Tables] Figure captions and table headers could more explicitly state the number of runs or statistical significance for the reported percentages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We have carefully considered each point and provide point-by-point responses below, along with indications of revisions to be made in the updated manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation Setup): The headline result (closed-source 48.4% vs open-source 32.1%) is generated by an automated loop relying on the user simulation agent to supply iterative feedback over ~90 tool calls. No correlation coefficients, inter-rater agreement scores, or ablation comparing simulator feedback to human judgments on the same long-horizon trajectories are reported, leaving the performance ordering vulnerable to simulation-specific artifacts.

Authors: We agree that demonstrating alignment between the user simulation agent and human judgment is essential for validating the benchmark's reliability. In the revised version, we will add a dedicated subsection in §4 that presents a human evaluation study on a representative subset of tasks. This study will include correlation coefficients (e.g., Pearson and Spearman) between simulator-provided feedback and human assessments, as well as inter-rater agreement metrics. We will also include an ablation comparing performance under simulator vs. human feedback loops where feasible. These additions will directly address concerns about simulation-specific artifacts. revision: yes
Referee: [§3] §3 (Benchmark Construction): The 138 tasks and associated rubrics are described at a high level, but the manuscript provides no details on rubric derivation process, pilot validation against real-user outcomes, or measures of simulation fidelity for the 1M-token contexts. This weakens claims about the benchmark capturing genuine real-world agent performance.

Authors: We appreciate this observation and will substantially expand §3 in the revision. The updated section will detail the rubric derivation process, which involved iterative refinement based on real-world AI usage logs and expert review. We will describe the pilot validation conducted with actual users to ensure rubrics reflect realistic outcomes. Additionally, we will report quantitative measures of simulation fidelity, such as the agreement rate between simulated user responses and human responses in sampled interactions, particularly for long-context scenarios. This will provide stronger evidence for the benchmark's real-world relevance. revision: yes
Referee: [§5] §5 (Analysis of Self-Correction and Tool Use): The reported disparities in feedback-driven self-correction and tool-use preferences across model families are presented without quantitative ablations isolating the contribution of the simulator versus intrinsic model differences; this makes it difficult to attribute the gaps solely to agent capability.

Authors: We acknowledge the need for clearer isolation of factors in our analysis. In the revised manuscript, we will enhance §5 with additional ablations. Specifically, we will include experiments where the same models are evaluated under both the simulator and a human-in-the-loop setup on a subset of tasks to quantify the simulator's influence. We will also present results controlling for the simulator by using fixed feedback templates derived from human data. These ablations will help attribute observed disparities more confidently to differences in model capabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark with direct rubric measurements

full rationale

The paper presents AgencyBench as an empirical evaluation framework consisting of 138 tasks with fixed queries, deliverables, and rubrics. Performance is computed directly via rubric-based assessment inside a Docker sandbox after iterative feedback from the simulation agent. No equations, fitted parameters, or derivations are present that reduce the reported scores (48.4% closed-source vs 32.1% open-source) to inputs by construction. The simulation agent is a methodological tool for automation rather than a self-referential component whose outputs are defined in terms of the measured results. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps. The derivation chain is therefore self-contained against external task definitions and rubrics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on standard assumptions about agent tool use and the fidelity of simulated users; no free parameters or invented entities are introduced.

axioms (2)

domain assumption LLM agents can execute multi-step tasks using external tools over long horizons
Invoked in the definition of the six core agentic capabilities and task construction.
domain assumption Simulated user feedback approximates human feedback for evaluation purposes
Central to the automated evaluation pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5655 in / 1292 out tokens · 30666 ms · 2026-05-16T14:07:59.175984+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
cs.SE 2026-05 unverdicted novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
How to Interpret Agent Behavior
cs.AI 2026-05 conditional novelty 6.0

ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
HMACE: Heterogeneous Multi-Agent Collaborative Evolution for Combinatorial Optimization
cs.AI 2026-05 unverdicted novelty 6.0

HMACE deploys Proposer, Generator, Evaluator, and Reflector agents in an evolutionary loop to generate and refine heuristics for NP-hard problems, reporting lower optimality gaps and token costs than baselines on TSP ...
Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

Multi-agent systems amplify minor stochastic biases into systemic polarization via echo-chamber effects in structured workflows, even with neutral agents.
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
cs.CV 2026-04 unverdicted novelty 6.0

FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 5 Pith papers · 3 internal anchors

[1]

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, et al. 2025a. Innovatorbench: Evaluating agents’ ability to conduct innovative llm research.arXiv preprint arXiv:2510.27598

work page arXiv
[4]

Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fan- shi Zhang, Yaoqi Ye, Jiawei Wang, et al. 2025b. Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002

work page arXiv
[5]

xAI. 2025. grok-4.1.https://x.ai/news/grok-4-1

work page 2025
[6]

Yang Xiao, Mohan Jiang, Jie Sun, Keyu Li, Jifan Lin, Yumin Zhuang, Ji Zeng, Shijie Xia, Qishuo Hua, Xuefeng Li, et al. 2025a. Limi: Less is more for agency.arXiv preprint arXiv:2509.17567

work page arXiv
[7]

Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, and Pengfei Liu. 2025b. Towards dynamic theory of mind: Evaluating llm adaptation to temporal evolution of human states. arXiv preprint arXiv:2505.17663

work page arXiv
[8]

Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, and Pengfei Liu. 2025c. Limopro: Reasoning refinement for efficient and effective test-time scaling.arXiv preprint arXiv:2505.19187

work page arXiv
[9]

Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li, and Pengfei Liu. 2025d. Scale: Selective resource allocation for overcoming performance bottlenecks in mathematical test-time scaling.arXiv preprint arXiv:2512.00466

work page arXiv
[10]

Tianze Xu, Pengrui Lu, Lyumanshan Ye, Xiangkun Hu, and Pengfei Liu. 2025. Researcherbench: Evaluating deep ai research systems on the frontiers of scientific inquiry.arXiv preprint arXiv:2507.16280

work page arXiv 2025
[11]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652

work page 2024
[13]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations. 12 A. Appendix SII-GAIR A Appendix Scenarios Tasks Game 10 50 Front-end 3 15 Back-end 3 15 Code 9 29 Research 5 19 MCP 2 10 Tota...

work page 2022
[14]

The ”Executors”:There is a striking divergence in how models orient themselves

The ”Navigators” vs. The ”Executors”:There is a striking divergence in how models orient themselves. GLM-4.6 exhibits a unique ”navigator” strategy, invokinglist directory 158 times—nearly triple the av- erage of other models. This indicates a strong preference for gathering environmental context before taking ac- tion. Conversely, GPT-5.2 and Claude-4.5-...

work page
[15]

”Rewriters”:The data reveals a fundamental difference in code modification philosophies

Editing Styles: ”Surgeons” vs. ”Rewriters”:The data reveals a fundamental difference in code modification philosophies. GPT-5.2 acts as a ”surgeon,” heavily utilizing the replace tool (146 invocations) to make precise, localized edits to existing files. In sharp contrast, GLM-4.6 overwhelmingly prefers thewrite file tool (381 invocations), suggesting a te...

work page
[16]

It is the only model to record significant usage of update memory bank (22 times) and initialize memory bank (7 times)

Memory Utilization:Gemini-3-Pro stands out as the sole model to effectively leverage long-term memory capabilities. It is the only model to record significant usage of update memory bank (22 times) and initialize memory bank (7 times). While other models rely entirely on their context window, Gemini attempts to persist state and key information externally...

work page
[17]

score": 6,

Information Retrieval:For external knowledge acquisition, GLM-4.6 again shows a distinct profile, us- ing web fetch 96 times, whereas models like Claude-4.5-Opus and GPT-5.2 rely more on their internal knowledge or specific search queries (search file content). 13 A.2 Evaluation Prompts SII-GAIR Claude-4.5-O Claude-4.5-S Gemini-3 GPT-5.2 Grok-4.1 GLM-4.6 ...

work page