Recognition: no theorem link
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Pith reviewed 2026-05-12 03:35 UTC · model grok-4.3
The pith
No frontier model exceeds 62.2 percent on WildClawBench, a native-runtime test of long-horizon CLI agent tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WildClawBench is a benchmark of 60 human-authored, bilingual, multimodal tasks across six categories that execute natively inside Docker containers hosting actual CLI agent harnesses such as OpenClaw, Claude Code, Codex, or Hermes Agent. Each task averages eight minutes of wall-clock time and more than twenty tool calls with access to real tools. Grading combines deterministic rule-based checks, environment-state auditing of side effects, and an LLM or VLM judge for semantic verification. Across nineteen frontier models the highest score is 62.2 percent for Claude Opus 4.7 under OpenClaw, all others remain below 60 percent, and harness choice alone shifts individual model scores by as much 1
What carries the argument
WildClawBench, the native-runtime benchmark that places tasks inside reproducible Docker containers running real CLI harnesses and applies hybrid grading of rules, state audits, and LLM/VLM judgment.
Load-bearing premise
The 60 human-authored tasks and hybrid grading procedure accurately represent the distribution and difficulty of real-world long-horizon CLI work without selection bias or judge error.
What would settle it
A new frontier model that consistently completes more than 80 percent of the WildClawBench tasks across multiple harnesses and independent runs would indicate that native-runtime long-horizon evaluation is no longer unresolved.
read the original abstract
Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WildClawBench, a native-runtime benchmark consisting of 60 human-authored, bilingual, multimodal CLI tasks spanning six categories. Each task runs in a reproducible Docker container with real tools (no mocks), averages ~8 minutes and >20 tool calls, and is graded via a hybrid procedure (rule-based checks + environment-state audit + LLM/VLM semantic judge). Experiments across 19 frontier models show Claude Opus 4.7 reaching 62.2% success under the OpenClaw harness while all others remain below 60%; switching harnesses alone can shift a model's score by up to 18 points. The authors conclude that long-horizon, native-runtime agent evaluation remains far from resolved and release the tasks, code, and containers for reproducibility.
Significance. If the tasks and hybrid grading faithfully capture real-world long-horizon CLI difficulty, the benchmark would provide a valuable, reproducible signal that current frontier models still struggle with extended, multi-step agent workflows in production-like environments. The release of containerized tooling and the demonstration of large harness sensitivity are concrete strengths that could accelerate progress measurement beyond synthetic or short-horizon suites.
major comments (3)
- [Abstract, §3] Abstract and §3 (Task Construction): the central claim that 62.2% demonstrates a genuine capability gap rests on the 60 tasks accurately representing real-world long-horizon CLI distributions, yet no quantitative comparison to usage logs, no inter-annotator agreement statistics on task difficulty or success criteria, and no details on how the six thematic categories were sampled are provided.
- [§4] §4 (Evaluation Procedure): the hybrid grading (rule-based + state audit + LLM/VLM judge) is load-bearing for all reported numbers, but the manuscript supplies no calibration of the LLM/VLM judge against human raters on held-out trajectories, no inter-rater reliability for the semantic component, and no error analysis of judge disagreements.
- [§5] §5 (Results): the 18-point harness shift is presented as evidence of evaluation sensitivity, but without judge validation this same sensitivity could mean the 62.2% figure itself is inflated or deflated by systematic judge bias, directly affecting the 'far-from-resolved' conclusion.
minor comments (2)
- [Abstract] The abstract states clear performance numbers and release plans but defers all methodological detail; a short methods summary paragraph would improve readability.
- [Table 1] Table 1 (model results) would benefit from explicit confidence intervals or per-category breakdowns to clarify whether the 62.2% lead is robust across task types.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the validation aspects of the manuscript without altering the core experimental results or conclusions.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (Task Construction): the central claim that 62.2% demonstrates a genuine capability gap rests on the 60 tasks accurately representing real-world long-horizon CLI distributions, yet no quantitative comparison to usage logs, no inter-annotator agreement statistics on task difficulty or success criteria, and no details on how the six thematic categories were sampled are provided.
Authors: We agree that explicit details on task construction would strengthen the manuscript. The six thematic categories were selected by domain experts to span representative real-world CLI scenarios (software engineering, system administration, data analysis, networking, security, and multimedia processing) drawn from common production workflows. We will revise §3 to document the sampling rationale, authoring process, and inter-annotator agreement statistics on success criteria (computed via multiple expert reviews). A direct quantitative comparison to usage logs is not possible because such logs are proprietary and not publicly available; however, the native-runtime execution with real tools and >20-step average horizon already provides a stronger proxy for real-world difficulty than synthetic benchmarks. This will be a partial revision focused on added documentation. revision: partial
-
Referee: [§4] §4 (Evaluation Procedure): the hybrid grading (rule-based + state audit + LLM/VLM judge) is load-bearing for all reported numbers, but the manuscript supplies no calibration of the LLM/VLM judge against human raters on held-out trajectories, no inter-rater reliability for the semantic component, and no error analysis of judge disagreements.
Authors: We acknowledge that systematic validation of the LLM/VLM judge is necessary for full confidence in the hybrid scores. We will add a dedicated subsection to §4 that reports calibration results comparing the judge to human raters on held-out trajectories, inter-rater reliability metrics for the semantic component, and an error analysis of disagreement cases. These additions will be based on additional analysis performed for the revision and will not change any of the primary experimental numbers. revision: yes
-
Referee: [§5] §5 (Results): the 18-point harness shift is presented as evidence of evaluation sensitivity, but without judge validation this same sensitivity could mean the 62.2% figure itself is inflated or deflated by systematic judge bias, directly affecting the 'far-from-resolved' conclusion.
Authors: The harness-sensitivity result and the 62.2% ceiling are both derived from the same hybrid grading pipeline. Once the judge calibration, reliability metrics, and error analysis are added to §4 as described above, we will update the discussion in §5 to explicitly reference these validation results when interpreting both the absolute scores and the harness-induced variance. This will confirm that the observed performance gap and sensitivity are not artifacts of unvalidated judge bias, thereby reinforcing rather than weakening the conclusion that long-horizon native-runtime evaluation remains far from resolved. revision: partial
Circularity Check
No circularity: empirical benchmark with direct model evaluations
full rationale
The paper releases a benchmark of 60 human-authored tasks and reports success rates from running 19 frontier models on them under different harnesses, using hybrid grading. No mathematical derivations, fitted parameters, predictions, or self-citation chains are present. The central results (e.g., Claude Opus 4.7 at 62.2%) are obtained by direct execution on the released tasks and containers, making the work self-contained as an empirical evaluation release rather than a derived claim.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The 60 human-authored tasks are representative of realistic long-horizon CLI work
- domain assumption Hybrid rule-based plus LLM/VLM judging produces reliable success labels
Reference graph
Works this paper leans on
-
[1]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Maksym Andriushchenko, Alexandra Souly , Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter , Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents. arXiv preprint arXiv:2410.09024 , 2024
work page internal anchor Pith review arXiv 2024
-
[2]
Demystifying evals for AI agents, Jan 2026
Anthropic. Demystifying evals for AI agents, Jan 2026. URL https://www.anthropic.com/engineering/ demystifying-evals-for-ai-agents
work page 2026
-
[3]
Introducing claude opus 4.6, February 2026
Anthropic. Introducing claude opus 4.6, February 2026. URL https://www.anthropic.com/news/ claude-opus-4-6
work page 2026
-
[4]
Introducing claude opus 4.7, April 2026
Anthropic. Introducing claude opus 4.7, April 2026. URL https://www.anthropic.com/news/ claude-opus-4-7
work page 2026
-
[5]
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker , et al. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264 , 2024
-
[6]
Claude Code Team. Claude Code. https://github.com/anthropics/claude-code, 2026
work page 2026
-
[7]
Codex Team. Codex. https://github.com/openai/codex, 2026
work page 2026
-
[8]
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner , Marc Fischer , and Florian Tramèr . Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems , 37:82895–82920, 2024
work page 2024
-
[9]
Deepseek-v3.2: Pushing the frontier of open large language models, 2025
DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models, 2025
work page 2025
-
[10]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
work page 2026
-
[11]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty , Léo Boisvert, Megh Thakkar , Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718 , 2024
work page internal anchor Pith review arXiv 2024
-
[12]
Gemini 3.1 Pro: A smarter model for your most complex tasks, February 2026
Google. Gemini 3.1 Pro: A smarter model for your most complex tasks, February 2026. URL https: //blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/
work page 2026
-
[13]
Hermes Team. Hermes. https://github.com/nousresearch/hermes-agent, 2026
work page 2026
-
[14]
Os agents: A survey on mllm-based agents for computer , phone and browser use
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, et al. Os agents: A survey on mllm-based agents for computer , phone and browser use. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 7436–7465, 2025
work page 2025
-
[15]
Understanding the planning of LLM agents: A survey
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruim- ing Tang, and Enhong Chen. Understanding the planning of llm agents: A survey . arXiv preprint arXiv:2402.02716, 2024
work page internal anchor Pith review arXiv 2024
-
[16]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Pinchbench: Benchmarking system for evaluating LLM models as OpenClaw coding agents
Kilo AI team. Pinchbench: Benchmarking system for evaluating LLM models as OpenClaw coding agents. https://github.com/pinchbench/skill, 2026. GitHub repository
work page 2026
- [19]
-
[20]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur , Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov , and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 881–905, 2024
work page 2024
-
[21]
Api-bank: A comprehensive benchmark for tool-augmented llms
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Y ongbin Li. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing , pages 3102–3116, 2023
work page 2023
-
[22]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster , Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery . arXiv preprint arXiv:2408.06292 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw , Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv preprint arXiv:2601.11868 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Gaia: a bench- mark for general ai assistants
Grégoire Mialon, Clémentine Fourrier , Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a bench- mark for general ai assistants. In The Twelfth International Conference on Learning Representations , 2023
work page 2023
-
[26]
Minimax Team. Minimax-m2.5. https://www.minimax.io/news/minimax-m25, 2026
work page 2026
-
[27]
Minimax Team. Minimax-m2.7. https://huggingface.co/MiniMaxAI/MiniMax-M2.7, 2026
work page 2026
-
[28]
Introducing gpt‑5.4, March 2026
OpenAI. Introducing gpt‑5.4, March 2026. URL https://openai.com/index/introducing-gpt-5-4/
work page 2026
-
[29]
Introducing gpt‑5.5, April 2026
OpenAI. Introducing gpt‑5.5, April 2026. URL https://openai.com/index/introducing-gpt-5-5/
work page 2026
- [30]
-
[31]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog? id=qwen3.5
work page 2026
-
[33]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair , Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573 , 2024
work page internal anchor Pith review arXiv 2024
-
[34]
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Y ongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox. arXiv preprint arXiv:2309.15817 , 2023
work page internal anchor Pith review arXiv 2023
-
[35]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettle- moyer , Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems , 36:68539–68551, 2023
work page 2023
-
[36]
StepFun Team. step-3.5-flash. https://static.stepfun.com/blog/step-3.5-flash/ , 2026
work page 2026
-
[37]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A controllable world of apps and people for benchmarking interactive coding agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) ,...
work page 2024
-
[38]
Odysseybench: Evaluating llm agents on long-horizon complex office application workflows
Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, and Saravan Rajmohan. Odysseybench: Evaluating llm agents on long-horizon complex office application workflows. arXiv preprint arXiv:2508.09124, 2025. 12 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
-
[39]
Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453 , 2025
-
[40]
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Jason Wei, Zhiqing Sun, Spencer Papay , Scott McKinney , Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging bench- mark for browsing agents. arXiv preprint arXiv:2504.12516 , 2025
work page internal anchor Pith review arXiv 2025
- [41]
-
[42]
Xiaomi MiMo Team. Mimo-v2-flash. https://mimo.xiaomi.com/blog/mimo-v2-flash , 2025
work page 2025
-
[43]
Xiaomi MiMo Team. Mimo-v2.5-pro. https://huggingface.co/collections/XiaomiMiMo/mimo-v25, 2026
work page 2026
-
[44]
Xiaomi MiMo Team. Mimo-v2-pro. https://mimo.xiaomi.com/mimo-v2-pro , 2026
work page 2026
-
[45]
Osworld: Benchmarking multimodal agents for open- ended tasks in real computer environments
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhou- jun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open- ended tasks in real computer environments. Advances in Neural Information Processing Systems , 37: 52040–52094, 2024
work page 2024
-
[46]
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: benchmarking llm agents on consequential real world tasks. arXiv preprint arXiv:2412.14161 , 2024
work page internal anchor Pith review arXiv 2024
-
[47]
Swe-agent: Agent-computer interfaces enable automated software engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems , 37:50528–50652, 2024
work page 2024
-
[48]
Webshop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems , 35: 20744–20757, 2022
work page 2022
-
[49]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022
work page 2022
-
[50]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ -bench: A benchmark for tool- agent-user interaction in real-world domains. arXiv preprint arXiv:2406.12045 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv , Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents. arXiv preprint arXiv:2604.06132, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv , Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. GLM-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. In Findings of the Association for Computational Linguistics: ACL 2024 , pages 10471–10506, 2024
work page 2024
-
[54]
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Hanrong Zhang, Jingyuan Huang, Kai Mei, Yifei Yao, Zhenting Wang, Chenlu Zhan, Hongwei Wang, and Y ongfeng Zhang. Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents. arXiv preprint arXiv:2410.02644 , 2024
work page internal anchor Pith review arXiv 2024
-
[55]
ClawBench: Can AI Agents Complete Everyday Online Tasks?
Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, et al. Clawbench: Can ai agents complete everyday online tasks? arXiv preprint arXiv:2604.08523 , 2026. 13 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
A survey on the memory mechanism of large language model-based agents
Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji- Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025
work page 2025
-
[57]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470 , 2024
work page internal anchor Pith review arXiv 2024
-
[58]
Glm-5.1: Towards long-horizon tasks, April 2026
Zhipu. Glm-5.1: Towards long-horizon tasks, April 2026. URL https://z.ai/blog/glm-5.1/
work page 2026
-
[59]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar , Xianyi Cheng, Tianyue Ou, Y onatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 , 2023. 14 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation A. Broader Impacts WildClawBench p...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Navigate to the Wikipedia page for Emperor Huan of Han
- [61]
- [62]
-
[63]
For those who do, extract the full Biography section content
-
[64]
Grading Criteria • results/ directory exists
Save each extracted biography as results/{person_name}.md using the person’s actual name. Grading Criteria • results/ directory exists. • All expected person Markdown files are created under results/. • No unexpected or extra files are created under results/. • Each file is named with the correct person name. • The content of each file matches the ground-...
-
[65]
Call slack_list_messages to retrieve the recent message list
-
[66]
Call slack_get_message for each message to read the full content
-
[67]
Analyze each message to identify action items explicitly or implicitly assigned to the user
-
[68]
Skip noise messages that contain no action items
-
[69]
Output a structured list of all action items with deadlines, if mentioned, and who assigned each one
-
[70]
Do not call slack_send_message; this is a read-only extraction task. Grading Criteria • Action-item extraction is scored against a checklist of concrete deliverables, deadlines, and follow-up tasks mentioned across the message history . • Credit depends on recovering updated deadlines rather than stale ones, including revised due dates for API documentati...
work page 2023
-
[71]
Search the web for open-source projects that run LLMs on consumer hardware without GPUs
-
[72]
Identify multiple candidate projects and evaluate each against the clues
-
[73]
Narrow down to the correct project by confirming language, name origin, creator background, and GGUF adoption
-
[74]
Verify that the GitHub star count exceeds 60k. Grading Criteria • Finding the correct repository earns the points. • The only fully correct answer is llama.cpp by Georgi Gerganov . • Nearby alternatives such as Ollama, llamafile, and LocalAI do not receive credit. 30 WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation Creative Synthes...
-
[75]
Examine the product photo to understand the product’s visual style and characteristics
-
[76]
Identify key selling points from the photo
-
[77]
Design a poster that prominently highlights these features
-
[78]
Include the basic info and a call to action
-
[79]
Grading Criteria • poster.png exists and has the correct dimensions ( 1080×1440)
Save the final PNG image to the specified path. Grading Criteria • poster.png exists and has the correct dimensions ( 1080×1440). • Content completeness: brand, name, tagline, price, reference price, CT A, and feature presentation are all present. • Feature highlighting: the poster identifies real product details from the photo rather than generic claims....
-
[80]
Read the file /tmp_workspace/chapter_0_introduction_linux_os.md
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.