arxiv: 2605.10912 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: no theorem link

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding , Xuanlang Dai , Long Xing , Shengyuan Ding , Ziyu Liu , Yang JingYi , Penghui Yang , Zhixiong Zhang

show 9 more authors

Xilin Wei Xinyu Fang Yubo Ma Haodong Duan Jing Shao Jiaqi Wang Dahua Lin Kai Chen Yuhang Zang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords agent evaluationlong-horizon tasksCLI agentsnative runtimebenchmarkfrontier modelshybrid gradingmultimodal tasks

0 comments

The pith

No frontier model exceeds 62.2 percent on WildClawBench, a native-runtime test of long-horizon CLI agent tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WildClawBench to fill gaps left by synthetic, short-horizon agent benchmarks that rely on mock services and final-answer checks. It supplies 60 human-authored tasks that run inside reproducible Docker containers with genuine CLI harnesses and real tools, each task lasting roughly eight minutes and requiring over twenty tool calls. When nineteen frontier models are tested, Claude Opus 4.7 reaches the highest score of 62.2 percent under the OpenClaw harness, every other model stays below 60 percent, and simply swapping the harness can change one model's result by up to 18 points. These outcomes demonstrate that reliable completion of extended, real-world command-line work remains out of reach for current models.

Core claim

WildClawBench is a benchmark of 60 human-authored, bilingual, multimodal tasks across six categories that execute natively inside Docker containers hosting actual CLI agent harnesses such as OpenClaw, Claude Code, Codex, or Hermes Agent. Each task averages eight minutes of wall-clock time and more than twenty tool calls with access to real tools. Grading combines deterministic rule-based checks, environment-state auditing of side effects, and an LLM or VLM judge for semantic verification. Across nineteen frontier models the highest score is 62.2 percent for Claude Opus 4.7 under OpenClaw, all others remain below 60 percent, and harness choice alone shifts individual model scores by as much 1

What carries the argument

WildClawBench, the native-runtime benchmark that places tasks inside reproducible Docker containers running real CLI harnesses and applies hybrid grading of rules, state audits, and LLM/VLM judgment.

Load-bearing premise

The 60 human-authored tasks and hybrid grading procedure accurately represent the distribution and difficulty of real-world long-horizon CLI work without selection bias or judge error.

What would settle it

A new frontier model that consistently completes more than 80 percent of the WildClawBench tasks across multiple harnesses and independent runs would indicate that native-runtime long-horizon evaluation is no longer unresolved.

read the original abstract

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WildClawBench, a native-runtime benchmark consisting of 60 human-authored, bilingual, multimodal CLI tasks spanning six categories. Each task runs in a reproducible Docker container with real tools (no mocks), averages ~8 minutes and >20 tool calls, and is graded via a hybrid procedure (rule-based checks + environment-state audit + LLM/VLM semantic judge). Experiments across 19 frontier models show Claude Opus 4.7 reaching 62.2% success under the OpenClaw harness while all others remain below 60%; switching harnesses alone can shift a model's score by up to 18 points. The authors conclude that long-horizon, native-runtime agent evaluation remains far from resolved and release the tasks, code, and containers for reproducibility.

Significance. If the tasks and hybrid grading faithfully capture real-world long-horizon CLI difficulty, the benchmark would provide a valuable, reproducible signal that current frontier models still struggle with extended, multi-step agent workflows in production-like environments. The release of containerized tooling and the demonstration of large harness sensitivity are concrete strengths that could accelerate progress measurement beyond synthetic or short-horizon suites.

major comments (3)

[Abstract, §3] Abstract and §3 (Task Construction): the central claim that 62.2% demonstrates a genuine capability gap rests on the 60 tasks accurately representing real-world long-horizon CLI distributions, yet no quantitative comparison to usage logs, no inter-annotator agreement statistics on task difficulty or success criteria, and no details on how the six thematic categories were sampled are provided.
[§4] §4 (Evaluation Procedure): the hybrid grading (rule-based + state audit + LLM/VLM judge) is load-bearing for all reported numbers, but the manuscript supplies no calibration of the LLM/VLM judge against human raters on held-out trajectories, no inter-rater reliability for the semantic component, and no error analysis of judge disagreements.
[§5] §5 (Results): the 18-point harness shift is presented as evidence of evaluation sensitivity, but without judge validation this same sensitivity could mean the 62.2% figure itself is inflated or deflated by systematic judge bias, directly affecting the 'far-from-resolved' conclusion.

minor comments (2)

[Abstract] The abstract states clear performance numbers and release plans but defers all methodological detail; a short methods summary paragraph would improve readability.
[Table 1] Table 1 (model results) would benefit from explicit confidence intervals or per-category breakdowns to clarify whether the 62.2% lead is robust across task types.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the validation aspects of the manuscript without altering the core experimental results or conclusions.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Task Construction): the central claim that 62.2% demonstrates a genuine capability gap rests on the 60 tasks accurately representing real-world long-horizon CLI distributions, yet no quantitative comparison to usage logs, no inter-annotator agreement statistics on task difficulty or success criteria, and no details on how the six thematic categories were sampled are provided.

Authors: We agree that explicit details on task construction would strengthen the manuscript. The six thematic categories were selected by domain experts to span representative real-world CLI scenarios (software engineering, system administration, data analysis, networking, security, and multimedia processing) drawn from common production workflows. We will revise §3 to document the sampling rationale, authoring process, and inter-annotator agreement statistics on success criteria (computed via multiple expert reviews). A direct quantitative comparison to usage logs is not possible because such logs are proprietary and not publicly available; however, the native-runtime execution with real tools and >20-step average horizon already provides a stronger proxy for real-world difficulty than synthetic benchmarks. This will be a partial revision focused on added documentation. revision: partial
Referee: [§4] §4 (Evaluation Procedure): the hybrid grading (rule-based + state audit + LLM/VLM judge) is load-bearing for all reported numbers, but the manuscript supplies no calibration of the LLM/VLM judge against human raters on held-out trajectories, no inter-rater reliability for the semantic component, and no error analysis of judge disagreements.

Authors: We acknowledge that systematic validation of the LLM/VLM judge is necessary for full confidence in the hybrid scores. We will add a dedicated subsection to §4 that reports calibration results comparing the judge to human raters on held-out trajectories, inter-rater reliability metrics for the semantic component, and an error analysis of disagreement cases. These additions will be based on additional analysis performed for the revision and will not change any of the primary experimental numbers. revision: yes
Referee: [§5] §5 (Results): the 18-point harness shift is presented as evidence of evaluation sensitivity, but without judge validation this same sensitivity could mean the 62.2% figure itself is inflated or deflated by systematic judge bias, directly affecting the 'far-from-resolved' conclusion.

Authors: The harness-sensitivity result and the 62.2% ceiling are both derived from the same hybrid grading pipeline. Once the judge calibration, reliability metrics, and error analysis are added to §4 as described above, we will update the discussion in §5 to explicitly reference these validation results when interpreting both the absolute scores and the harness-induced variance. This will confirm that the observed performance gap and sensitivity are not artifacts of unvalidated judge bias, thereby reinforcing rather than weakening the conclusion that long-horizon native-runtime evaluation remains far from resolved. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model evaluations

full rationale

The paper releases a benchmark of 60 human-authored tasks and reports success rates from running 19 frontier models on them under different harnesses, using hybrid grading. No mathematical derivations, fitted parameters, predictions, or self-citation chains are present. The central results (e.g., Claude Opus 4.7 at 62.2%) are obtained by direct execution on the released tasks and containers, making the work self-contained as an empirical evaluation release rather than a derived claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim that current models fall short rests on the assumption that the chosen tasks and evaluation protocol faithfully capture real deployment conditions; no free parameters or new entities are introduced.

axioms (2)

domain assumption The 60 human-authored tasks are representative of realistic long-horizon CLI work
The benchmark’s validity depends on this representativeness claim stated in the abstract.
domain assumption Hybrid rule-based plus LLM/VLM judging produces reliable success labels
Grading method is described but not validated in the abstract.

pith-pipeline@v0.9.0 · 5606 in / 1321 out tokens · 53104 ms · 2026-05-12T03:35:32.764186+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 20 internal anchors

[1]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly , Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter , Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024 , 2024

work page internal anchor Pith review arXiv 2024
[2]

Demystifying evals for AI agents, Jan 2026

Anthropic. Demystifying evals for AI agents, Jan 2026. URL https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents

work page 2026
[3]

Introducing claude opus 4.6, February 2026

Anthropic. Introducing claude opus 4.6, February 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

work page 2026
[4]

Introducing claude opus 4.7, April 2026

Anthropic. Introducing claude opus 4.7, April 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

work page 2026
[5]

Bonatti, D

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker , et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264 , 2024

work page arXiv 2024
[6]

Claude Code

Claude Code Team. Claude Code. https://github.com/anthropics/claude-code, 2026

work page 2026
[7]

Codex Team. Codex. https://github.com/openai/codex, 2026

work page 2026
[8]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner , Marc Fischer , and Florian Tramèr . Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems , 37:82895–82920, 2024

work page 2024
[9]

Deepseek-v3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

work page 2025
[10]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[11]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty , Léo Boisvert, Megh Thakkar , Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718 , 2024

work page internal anchor Pith review arXiv 2024
[12]

Gemini 3.1 Pro: A smarter model for your most complex tasks, February 2026

Google. Gemini 3.1 Pro: A smarter model for your most complex tasks, February 2026. URL https: //blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

work page 2026
[13]

Hermes Team. Hermes. https://github.com/nousresearch/hermes-agent, 2026

work page 2026
[14]

Os agents: A survey on mllm-based agents for computer , phone and browser use

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for computer , phone and browser use. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7436–7465, 2025

work page 2025
[15]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruim- ing Tang, and Enhong Chen. Understanding the planning of llm agents: A survey . arXiv preprint arXiv:2402.02716, 2024

work page internal anchor Pith review arXiv 2024
[16]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Pinchbench: Benchmarking system for evaluating LLM models as OpenClaw coding agents

Kilo AI team. Pinchbench: Benchmarking system for evaluating LLM models as OpenClaw coding agents. https://github.com/pinchbench/skill, 2026. GitHub repository

work page 2026
[19]

Kimi-k2.5

Kimi Team. Kimi-k2.5. https://huggingface.co/moonshotai/Kimi-K2.5, 2026. 11 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

work page 2026
[20]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur , Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov , and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 881–905, 2024

work page 2024
[21]

Api-bank: A comprehensive benchmark for tool-augmented llms

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Y ongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing , pages 3102–3116, 2023

work page 2023
[22]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster , Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery . arXiv preprint arXiv:2408.06292 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw , Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Gaia: a bench- mark for general ai assistants

Grégoire Mialon, Clémentine Fourrier , Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a bench- mark for general ai assistants. In The Twelfth International Conference on Learning Representations , 2023

work page 2023
[26]

Minimax-m2.5

Minimax Team. Minimax-m2.5. https://www.minimax.io/news/minimax-m25, 2026

work page 2026
[27]

Minimax-m2.7

Minimax Team. Minimax-m2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026

work page 2026
[28]

Introducing gpt‑5.4, March 2026

OpenAI. Introducing gpt‑5.4, March 2026. URL https://openai.com/index/introducing-gpt-5-4/

work page 2026
[29]

Introducing gpt‑5.5, April 2026

OpenAI. Introducing gpt‑5.5, April 2026. URL https://openai.com/index/introducing-gpt-5-5/

work page 2026
[30]

OpenClaw

OpenClaw Team. OpenClaw. https://github.com/openclaw/openclaw, 2026

work page 2026
[31]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog? id=qwen3.5

work page 2026
[33]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair , Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573 , 2024

work page internal anchor Pith review arXiv 2024
[34]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Y ongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817 , 2023

work page internal anchor Pith review arXiv 2023
[35]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer , Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539–68551, 2023

work page 2023
[36]

step-3.5-flash

StepFun Team. step-3.5-flash. https://static.stepfun.com/blog/step-3.5-flash/ , 2026

work page 2026
[37]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ,...

work page 2024
[38]

Odysseybench: Evaluating llm agents on long-horizon complex office application workflows

Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows. arXiv preprint arXiv:2508.09124, 2025. 12 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

work page arXiv 2025
[39]

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453,

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453 , 2025

work page arXiv 2025
[40]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay , Scott McKinney , Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging bench- mark for browsing agents. arXiv preprint arXiv:2504.12516 , 2025

work page internal anchor Pith review arXiv 2025
[41]

grok4.20

xAI. grok4.20. https://docs.x.ai/developers/models/grok-4.20-0309-reasoning/ , 2026

work page 2026
[42]

Mimo-v2-flash

Xiaomi MiMo Team. Mimo-v2-flash. https://mimo.xiaomi.com/blog/mimo-v2-flash , 2025

work page 2025
[43]

Mimo-v2.5-pro

Xiaomi MiMo Team. Mimo-v2.5-pro. https://huggingface.co/collections/XiaomiMiMo/mimo-v25, 2026

work page 2026
[44]

Mimo-v2-pro

Xiaomi MiMo Team. Mimo-v2-pro. https://mimo.xiaomi.com/mimo-v2-pro , 2026

work page 2026
[45]

Osworld: Benchmarking multimodal agents for open- ended tasks in real computer environments

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhou- jun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open- ended tasks in real computer environments. Advances in Neural Information Processing Systems , 37: 52040–52094, 2024

work page 2024
[46]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: benchmarking llm agents on consequential real world tasks. arXiv preprint arXiv:2412.14161 , 2024

work page internal anchor Pith review arXiv 2024
[47]

Swe-agent: Agent-computer interfaces enable automated software engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems , 37:50528–50652, 2024

work page 2024
[48]

Webshop: Towards scalable real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems , 35: 20744–20757, 2022

work page 2022
[49]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

work page 2022
[50]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ -bench: A benchmark for tool- agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv , Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents. arXiv preprint arXiv:2604.06132, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv , Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. GLM-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 10471–10506, 2024

work page 2024
[54]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Y ongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. arXiv preprint arXiv:2410.02644 , 2024

work page internal anchor Pith review arXiv 2024
[55]

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, et al. Clawbench: Can ai agents complete everyday online tasks? arXiv preprint arXiv:2604.08523 , 2026. 13 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji- Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

work page 2025
[57]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470 , 2024

work page internal anchor Pith review arXiv 2024
[58]

Glm-5.1: Towards long-horizon tasks, April 2026

Zhipu. Glm-5.1: Towards long-horizon tasks, April 2026. URL https://z.ai/blog/glm-5.1/

work page 2026
[59]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar , Xianyi Cheng, Tianyue Ou, Y onatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 , 2023. 14 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation A. Broader Impacts WildClawBench p...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Navigate to the Wikipedia page for Emperor Huan of Han

work page
[61]

Biography

Read and parse the “Biography” section to extract all mentioned persons

work page
[62]

Biography

For each mentioned person, check whether they have a “Biography” section on their own Wikipedia page

work page
[63]

For those who do, extract the full Biography section content

work page
[64]

Grading Criteria • results/ directory exists

Save each extracted biography as results/{person_name}.md using the person’s actual name. Grading Criteria • results/ directory exists. • All expected person Markdown files are created under results/. • No unexpected or extra files are created under results/. • Each file is named with the correct person name. • The content of each file matches the ground-...

work page
[65]

Call slack_list_messages to retrieve the recent message list

work page
[66]

Call slack_get_message for each message to read the full content

work page
[67]

Analyze each message to identify action items explicitly or implicitly assigned to the user

work page
[68]

Skip noise messages that contain no action items

work page
[69]

Output a structured list of all action items with deadlines, if mentioned, and who assigned each one

work page
[70]

Grading Criteria • Action-item extraction is scored against a checklist of concrete deliverables, deadlines, and follow-up tasks mentioned across the message history

Do not call slack_send_message; this is a read-only extraction task. Grading Criteria • Action-item extraction is scored against a checklist of concrete deliverables, deadlines, and follow-up tasks mentioned across the message history . • Credit depends on recovering updated deadlines rather than stale ones, including revised due dates for API documentati...

work page 2023
[71]

Search the web for open-source projects that run LLMs on consumer hardware without GPUs

work page
[72]

Identify multiple candidate projects and evaluate each against the clues

work page
[73]

Narrow down to the correct project by confirming language, name origin, creator background, and GGUF adoption

work page
[74]

Carry Y our Story

Verify that the GitHub star count exceeds 60k. Grading Criteria • Finding the correct repository earns the points. • The only fully correct answer is llama.cpp by Georgi Gerganov . • Nearby alternatives such as Ollama, llamafile, and LocalAI do not receive credit. 30 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Creative Synthes...

work page
[75]

Examine the product photo to understand the product’s visual style and characteristics

work page
[76]

Identify key selling points from the photo

work page
[77]

Design a poster that prominently highlights these features

work page
[78]

Include the basic info and a call to action

work page
[79]

Grading Criteria • poster.png exists and has the correct dimensions ( 1080×1440)

Save the final PNG image to the specified path. Grading Criteria • poster.png exists and has the correct dimensions ( 1080×1440). • Content completeness: brand, name, tagline, price, reference price, CT A, and feature presentation are all present. • Feature highlighting: the poster identifies real product details from the photo rather than generic claims....

work page
[80]

Read the file /tmp_workspace/chapter_0_introduction_linux_os.md

work page

Showing first 80 references.