pith. machine review for the scientific record. sign in

arxiv: 2605.10912 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: no theorem link

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords agent evaluationlong-horizon tasksCLI agentsnative runtimebenchmarkfrontier modelshybrid gradingmultimodal tasks
0
0 comments X

The pith

No frontier model exceeds 62.2 percent on WildClawBench, a native-runtime test of long-horizon CLI agent tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WildClawBench to fill gaps left by synthetic, short-horizon agent benchmarks that rely on mock services and final-answer checks. It supplies 60 human-authored tasks that run inside reproducible Docker containers with genuine CLI harnesses and real tools, each task lasting roughly eight minutes and requiring over twenty tool calls. When nineteen frontier models are tested, Claude Opus 4.7 reaches the highest score of 62.2 percent under the OpenClaw harness, every other model stays below 60 percent, and simply swapping the harness can change one model's result by up to 18 points. These outcomes demonstrate that reliable completion of extended, real-world command-line work remains out of reach for current models.

Core claim

WildClawBench is a benchmark of 60 human-authored, bilingual, multimodal tasks across six categories that execute natively inside Docker containers hosting actual CLI agent harnesses such as OpenClaw, Claude Code, Codex, or Hermes Agent. Each task averages eight minutes of wall-clock time and more than twenty tool calls with access to real tools. Grading combines deterministic rule-based checks, environment-state auditing of side effects, and an LLM or VLM judge for semantic verification. Across nineteen frontier models the highest score is 62.2 percent for Claude Opus 4.7 under OpenClaw, all others remain below 60 percent, and harness choice alone shifts individual model scores by as much 1

What carries the argument

WildClawBench, the native-runtime benchmark that places tasks inside reproducible Docker containers running real CLI harnesses and applies hybrid grading of rules, state audits, and LLM/VLM judgment.

Load-bearing premise

The 60 human-authored tasks and hybrid grading procedure accurately represent the distribution and difficulty of real-world long-horizon CLI work without selection bias or judge error.

What would settle it

A new frontier model that consistently completes more than 80 percent of the WildClawBench tasks across multiple harnesses and independent runs would indicate that native-runtime long-horizon evaluation is no longer unresolved.

read the original abstract

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces WildClawBench, a native-runtime benchmark consisting of 60 human-authored, bilingual, multimodal CLI tasks spanning six categories. Each task runs in a reproducible Docker container with real tools (no mocks), averages ~8 minutes and >20 tool calls, and is graded via a hybrid procedure (rule-based checks + environment-state audit + LLM/VLM semantic judge). Experiments across 19 frontier models show Claude Opus 4.7 reaching 62.2% success under the OpenClaw harness while all others remain below 60%; switching harnesses alone can shift a model's score by up to 18 points. The authors conclude that long-horizon, native-runtime agent evaluation remains far from resolved and release the tasks, code, and containers for reproducibility.

Significance. If the tasks and hybrid grading faithfully capture real-world long-horizon CLI difficulty, the benchmark would provide a valuable, reproducible signal that current frontier models still struggle with extended, multi-step agent workflows in production-like environments. The release of containerized tooling and the demonstration of large harness sensitivity are concrete strengths that could accelerate progress measurement beyond synthetic or short-horizon suites.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (Task Construction): the central claim that 62.2% demonstrates a genuine capability gap rests on the 60 tasks accurately representing real-world long-horizon CLI distributions, yet no quantitative comparison to usage logs, no inter-annotator agreement statistics on task difficulty or success criteria, and no details on how the six thematic categories were sampled are provided.
  2. [§4] §4 (Evaluation Procedure): the hybrid grading (rule-based + state audit + LLM/VLM judge) is load-bearing for all reported numbers, but the manuscript supplies no calibration of the LLM/VLM judge against human raters on held-out trajectories, no inter-rater reliability for the semantic component, and no error analysis of judge disagreements.
  3. [§5] §5 (Results): the 18-point harness shift is presented as evidence of evaluation sensitivity, but without judge validation this same sensitivity could mean the 62.2% figure itself is inflated or deflated by systematic judge bias, directly affecting the 'far-from-resolved' conclusion.
minor comments (2)
  1. [Abstract] The abstract states clear performance numbers and release plans but defers all methodological detail; a short methods summary paragraph would improve readability.
  2. [Table 1] Table 1 (model results) would benefit from explicit confidence intervals or per-category breakdowns to clarify whether the 62.2% lead is robust across task types.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the validation aspects of the manuscript without altering the core experimental results or conclusions.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Task Construction): the central claim that 62.2% demonstrates a genuine capability gap rests on the 60 tasks accurately representing real-world long-horizon CLI distributions, yet no quantitative comparison to usage logs, no inter-annotator agreement statistics on task difficulty or success criteria, and no details on how the six thematic categories were sampled are provided.

    Authors: We agree that explicit details on task construction would strengthen the manuscript. The six thematic categories were selected by domain experts to span representative real-world CLI scenarios (software engineering, system administration, data analysis, networking, security, and multimedia processing) drawn from common production workflows. We will revise §3 to document the sampling rationale, authoring process, and inter-annotator agreement statistics on success criteria (computed via multiple expert reviews). A direct quantitative comparison to usage logs is not possible because such logs are proprietary and not publicly available; however, the native-runtime execution with real tools and >20-step average horizon already provides a stronger proxy for real-world difficulty than synthetic benchmarks. This will be a partial revision focused on added documentation. revision: partial

  2. Referee: [§4] §4 (Evaluation Procedure): the hybrid grading (rule-based + state audit + LLM/VLM judge) is load-bearing for all reported numbers, but the manuscript supplies no calibration of the LLM/VLM judge against human raters on held-out trajectories, no inter-rater reliability for the semantic component, and no error analysis of judge disagreements.

    Authors: We acknowledge that systematic validation of the LLM/VLM judge is necessary for full confidence in the hybrid scores. We will add a dedicated subsection to §4 that reports calibration results comparing the judge to human raters on held-out trajectories, inter-rater reliability metrics for the semantic component, and an error analysis of disagreement cases. These additions will be based on additional analysis performed for the revision and will not change any of the primary experimental numbers. revision: yes

  3. Referee: [§5] §5 (Results): the 18-point harness shift is presented as evidence of evaluation sensitivity, but without judge validation this same sensitivity could mean the 62.2% figure itself is inflated or deflated by systematic judge bias, directly affecting the 'far-from-resolved' conclusion.

    Authors: The harness-sensitivity result and the 62.2% ceiling are both derived from the same hybrid grading pipeline. Once the judge calibration, reliability metrics, and error analysis are added to §4 as described above, we will update the discussion in §5 to explicitly reference these validation results when interpreting both the absolute scores and the harness-induced variance. This will confirm that the observed performance gap and sensitivity are not artifacts of unvalidated judge bias, thereby reinforcing rather than weakening the conclusion that long-horizon native-runtime evaluation remains far from resolved. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model evaluations

full rationale

The paper releases a benchmark of 60 human-authored tasks and reports success rates from running 19 frontier models on them under different harnesses, using hybrid grading. No mathematical derivations, fitted parameters, predictions, or self-citation chains are present. The central results (e.g., Claude Opus 4.7 at 62.2%) are obtained by direct execution on the released tasks and containers, making the work self-contained as an empirical evaluation release rather than a derived claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim that current models fall short rests on the assumption that the chosen tasks and evaluation protocol faithfully capture real deployment conditions; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption The 60 human-authored tasks are representative of realistic long-horizon CLI work
    The benchmark’s validity depends on this representativeness claim stated in the abstract.
  • domain assumption Hybrid rule-based plus LLM/VLM judging produces reliable success labels
    Grading method is described but not validated in the abstract.

pith-pipeline@v0.9.0 · 5606 in / 1321 out tokens · 53104 ms · 2026-05-12T03:35:32.764186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

110 extracted references · 110 canonical work pages · 20 internal anchors

  1. [1]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Maksym Andriushchenko, Alexandra Souly , Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter , Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024 , 2024

  2. [2]

    Demystifying evals for AI agents, Jan 2026

    Anthropic. Demystifying evals for AI agents, Jan 2026. URL https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents

  3. [3]

    Introducing claude opus 4.6, February 2026

    Anthropic. Introducing claude opus 4.6, February 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

  4. [4]

    Introducing claude opus 4.7, April 2026

    Anthropic. Introducing claude opus 4.7, April 2026. URL https://www.anthropic.com/news/ claude-opus-4-7

  5. [5]

    Bonatti, D

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker , et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264 , 2024

  6. [6]

    Claude Code

    Claude Code Team. Claude Code. https://github.com/anthropics/claude-code, 2026

  7. [7]

    Codex Team. Codex. https://github.com/openai/codex, 2026

  8. [8]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner , Marc Fischer , and Florian Tramèr . Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems , 37:82895–82920, 2024

  9. [9]

    Deepseek-v3.2: Pushing the frontier of open large language models, 2025

    DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025

  10. [10]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  11. [11]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty , Léo Boisvert, Megh Thakkar , Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718 , 2024

  12. [12]

    Gemini 3.1 Pro: A smarter model for your most complex tasks, February 2026

    Google. Gemini 3.1 Pro: A smarter model for your most complex tasks, February 2026. URL https: //blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/

  13. [13]

    Hermes Team. Hermes. https://github.com/nousresearch/hermes-agent, 2026

  14. [14]

    Os agents: A survey on mllm-based agents for computer , phone and browser use

    Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for computer , phone and browser use. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7436–7465, 2025

  15. [15]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruim- ing Tang, and Enhong Chen. Understanding the planning of llm agents: A survey . arXiv preprint arXiv:2402.02716, 2024

  16. [16]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024

  17. [17]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  18. [18]

    Pinchbench: Benchmarking system for evaluating LLM models as OpenClaw coding agents

    Kilo AI team. Pinchbench: Benchmarking system for evaluating LLM models as OpenClaw coding agents. https://github.com/pinchbench/skill, 2026. GitHub repository

  19. [19]

    Kimi-k2.5

    Kimi Team. Kimi-k2.5. https://huggingface.co/moonshotai/Kimi-K2.5, 2026. 11 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

  20. [20]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur , Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov , and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 881–905, 2024

  21. [21]

    Api-bank: A comprehensive benchmark for tool-augmented llms

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Y ongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing , pages 3102–3116, 2023

  22. [22]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

  23. [23]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster , Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery . arXiv preprint arXiv:2408.06292 , 2024

  24. [24]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw , Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868 , 2026

  25. [25]

    Gaia: a bench- mark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier , Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a bench- mark for general ai assistants. In The Twelfth International Conference on Learning Representations , 2023

  26. [26]

    Minimax-m2.5

    Minimax Team. Minimax-m2.5. https://www.minimax.io/news/minimax-m25, 2026

  27. [27]

    Minimax-m2.7

    Minimax Team. Minimax-m2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026

  28. [28]

    Introducing gpt‑5.4, March 2026

    OpenAI. Introducing gpt‑5.4, March 2026. URL https://openai.com/index/introducing-gpt-5-4/

  29. [29]

    Introducing gpt‑5.5, April 2026

    OpenAI. Introducing gpt‑5.5, April 2026. URL https://openai.com/index/introducing-gpt-5-5/

  30. [30]

    OpenClaw

    OpenClaw Team. OpenClaw. https://github.com/openclaw/openclaw, 2026

  31. [31]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

  32. [32]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog? id=qwen3.5

  33. [33]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair , Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573 , 2024

  34. [34]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Y ongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817 , 2023

  35. [35]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer , Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539–68551, 2023

  36. [36]

    step-3.5-flash

    StepFun Team. step-3.5-flash. https://static.stepfun.com/blog/step-3.5-flash/ , 2026

  37. [37]

    Appworld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ,...

  38. [38]

    Odysseybench: Evaluating llm agents on long-horizon complex office application workflows

    Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows. arXiv preprint arXiv:2508.09124, 2025. 12 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

  39. [39]

    Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453,

    Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453 , 2025

  40. [40]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay , Scott McKinney , Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging bench- mark for browsing agents. arXiv preprint arXiv:2504.12516 , 2025

  41. [41]

    grok4.20

    xAI. grok4.20. https://docs.x.ai/developers/models/grok-4.20-0309-reasoning/ , 2026

  42. [42]

    Mimo-v2-flash

    Xiaomi MiMo Team. Mimo-v2-flash. https://mimo.xiaomi.com/blog/mimo-v2-flash , 2025

  43. [43]

    Mimo-v2.5-pro

    Xiaomi MiMo Team. Mimo-v2.5-pro. https://huggingface.co/collections/XiaomiMiMo/mimo-v25, 2026

  44. [44]

    Mimo-v2-pro

    Xiaomi MiMo Team. Mimo-v2-pro. https://mimo.xiaomi.com/mimo-v2-pro , 2026

  45. [45]

    Osworld: Benchmarking multimodal agents for open- ended tasks in real computer environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhou- jun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open- ended tasks in real computer environments. Advances in Neural Information Processing Systems , 37: 52040–52094, 2024

  46. [46]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: benchmarking llm agents on consequential real world tasks. arXiv preprint arXiv:2412.14161 , 2024

  47. [47]

    Swe-agent: Agent-computer interfaces enable automated software engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems , 37:50528–50652, 2024

  48. [48]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems , 35: 20744–20757, 2022

  49. [49]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

  50. [50]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ -bench: A benchmark for tool- agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045 , 2024

  51. [51]

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv , Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents. arXiv preprint arXiv:2604.06132, 2026

  52. [52]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv , Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. GLM-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  53. [53]

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 10471–10506, 2024

  54. [54]

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Y ongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. arXiv preprint arXiv:2410.02644 , 2024

  55. [55]

    ClawBench: Can AI Agents Complete Everyday Online Tasks?

    Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, et al. Clawbench: Can ai agents complete everyday online tasks? arXiv preprint arXiv:2604.08523 , 2026. 13 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

  56. [56]

    A survey on the memory mechanism of large language model-based agents

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji- Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

  57. [57]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470 , 2024

  58. [58]

    Glm-5.1: Towards long-horizon tasks, April 2026

    Zhipu. Glm-5.1: Towards long-horizon tasks, April 2026. URL https://z.ai/blog/glm-5.1/

  59. [59]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar , Xianyi Cheng, Tianyue Ou, Y onatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 , 2023. 14 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation A. Broader Impacts WildClawBench p...

  60. [60]

    Navigate to the Wikipedia page for Emperor Huan of Han

  61. [61]

    Biography

    Read and parse the “Biography” section to extract all mentioned persons

  62. [62]

    Biography

    For each mentioned person, check whether they have a “Biography” section on their own Wikipedia page

  63. [63]

    For those who do, extract the full Biography section content

  64. [64]

    Grading Criteria • results/ directory exists

    Save each extracted biography as results/{person_name}.md using the person’s actual name. Grading Criteria • results/ directory exists. • All expected person Markdown files are created under results/. • No unexpected or extra files are created under results/. • Each file is named with the correct person name. • The content of each file matches the ground-...

  65. [65]

    Call slack_list_messages to retrieve the recent message list

  66. [66]

    Call slack_get_message for each message to read the full content

  67. [67]

    Analyze each message to identify action items explicitly or implicitly assigned to the user

  68. [68]

    Skip noise messages that contain no action items

  69. [69]

    Output a structured list of all action items with deadlines, if mentioned, and who assigned each one

  70. [70]

    Grading Criteria • Action-item extraction is scored against a checklist of concrete deliverables, deadlines, and follow-up tasks mentioned across the message history

    Do not call slack_send_message; this is a read-only extraction task. Grading Criteria • Action-item extraction is scored against a checklist of concrete deliverables, deadlines, and follow-up tasks mentioned across the message history . • Credit depends on recovering updated deadlines rather than stale ones, including revised due dates for API documentati...

  71. [71]

    Search the web for open-source projects that run LLMs on consumer hardware without GPUs

  72. [72]

    Identify multiple candidate projects and evaluate each against the clues

  73. [73]

    Narrow down to the correct project by confirming language, name origin, creator background, and GGUF adoption

  74. [74]

    Carry Y our Story

    Verify that the GitHub star count exceeds 60k. Grading Criteria • Finding the correct repository earns the points. • The only fully correct answer is llama.cpp by Georgi Gerganov . • Nearby alternatives such as Ollama, llamafile, and LocalAI do not receive credit. 30 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Creative Synthes...

  75. [75]

    Examine the product photo to understand the product’s visual style and characteristics

  76. [76]

    Identify key selling points from the photo

  77. [77]

    Design a poster that prominently highlights these features

  78. [78]

    Include the basic info and a call to action

  79. [79]

    Grading Criteria • poster.png exists and has the correct dimensions ( 1080×1440)

    Save the final PNG image to the specified path. Grading Criteria • poster.png exists and has the correct dimensions ( 1080×1440). • Content completeness: brand, name, tagline, price, reference price, CT A, and feature presentation are all present. • Feature highlighting: the poster identifies real product details from the photo rather than generic claims....

  80. [80]

    Read the file /tmp_workspace/chapter_0_introduction_linux_os.md

Showing first 80 references.