arxiv: 2604.23781 · v2 · submitted 2026-04-26 · 💻 cs.CV · cs.SE

Recognition: unknown

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng , Lingxiao Du , Zijian Wu , Guanzheng Chen , Xiangyan Liu , Jiaqi Liao , Chonghe Jiang , Zhenglin Wan

show 41 more authors

Jiawei Gu Pengfei Zhou Rui Huang Ziqi Zhao Shengyuan Ding Ailing Yu Bo Peng Bowei Xia Hao Sun Haotian Liang Ji Xie Jiajun Chen Jiajun Song Liu Yang Ming Xu Qionglin Qiu Runhao Fu Shengfang Zhai Shijian Wang Tengfei Ma Tianyi Wu Weiyang Jin Yan Wang Yang Dai Yao Lai Youwei Shu Yue Liu Yunzhuo Hao Yuwei Niu Jinkai Huang Jiayuan Zhuo Zhennan Shen Linyu Wu Hannah Yao Charles Chen Cihang Xie Yuyin Zhou Jiaheng Zhang Zeyu Zheng Mengkang Hu Michael Qizhe Shieh

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:33 UTC · model grok-4.3

classification 💻 cs.CV cs.SE

keywords multi-turn agentsmulti-day workflowsstateful environmentsagent benchmarkscoworker agentsdeterministic evaluationmultimodal agentsexogenous updates

0 comments

The pith

Current frontier AI agents fully complete only 20% of multi-turn multi-day coworker tasks when the surrounding environment evolves independently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark to evaluate AI agents that act as persistent coworkers over multiple working days. It incorporates tasks where services such as email, calendars, and files change on their own between agent actions. Evaluation relies on deterministic checkers applied to the final state of five sandboxed services rather than subjective judging. Results across seven models show high partial progress but rare full success, with a clear drop after the first independent update.

Core claim

The benchmark contains 100 tasks across 13 professional scenarios executed against five stateful services and scored by 1537 deterministic Python checkers. Benchmarking frontier agent systems yields a maximum weighted score of 75.8 yet only 20.0% strict task success. Turn-level analysis shows performance declines after the first exogenous environment update.

What carries the argument

A stateful sandboxed service environment whose state evolves between turns independently of the agent, together with rule-based verification by deterministic Python checkers.

Load-bearing premise

The 100 tasks, 13 scenarios, five stateful services, and 1537 deterministic checkers accurately capture the core challenges of real-world multi-day coworker agent performance in evolving environments.

What would settle it

An agent system that maintains above 50% strict task success across the full set of 100 tasks even after multiple independent service updates would indicate that adaptation to changing state is not the primary barrier.

Figures

Figures reproduced from arXiv: 2604.23781 by Ailing Yu, Bo Peng, Bowei Xia, Charles Chen, Chonghe Jiang, Cihang Xie, Fanqing Meng, Guanzheng Chen, Hannah Yao, Hao Sun, Haotian Liang, Jiaheng Zhang, Jiajun Chen, Jiajun Song, Jiaqi Liao, Jiawei Gu, Jiayuan Zhuo, Jinkai Huang, Ji Xie, Lingxiao Du, Linyu Wu, Liu Yang, Mengkang Hu, Michael Qizhe Shieh, Ming Xu, Pengfei Zhou, Qionglin Qiu, Rui Huang, Runhao Fu, Shengfang Zhai, Shengyuan Ding, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Xiangyan Liu, Yang Dai, Yan Wang, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Yuyin Zhou, Zeyu Zheng, Zhenglin Wan, Zhennan Shen, Zijian Wu, Ziqi Zhao.

**Figure 1.** Figure 1: ClawMark results overview. Left: main leaderboard across seven frontier models under the single-run protocol (§5.1); Claude Sonnet 4.6 leads at 75.8 weighted score and the top strict Task Success is 20.0; both metrics leave room to improve. Right: distribution of the 100 tasks across the 13 professional scenarios; the benchmark covers specialised domains including legal assistance, investment analysis, and… view at source ↗

**Figure 2.** Figure 2: Anatomy of a ClawMark task. Example: insurance_task5 (Enterprise Property Insurance Claim), a six-turn adjudication of a ¥1.2 M fire-damage claim with 22 weighted checkers across five backends; turns 1–3 are shown here; the remaining three turns follow the same template (wake-up prompt, loud/silent events, per-turn checkers). Each card is one in-universe working day. Coloured pills list the backends the tu… view at source ↗

**Figure 3.** Figure 3: ClawMark construction pipeline. Four phases: task authoring, task-driven evidence sourcing, a review loop (task review + trajectory review) that iterates 3–5 rounds per task, and a release gate. A task enters the release corpus only when all four release-gate conditions hold simultaneously. Phase 3: Review loop (3–5 rounds). Every task alternates between task review and trajectory review. Task review comb… view at source ↗

**Figure 4.** Figure 4: Day-by-day trajectory on the 73 tasks with exactly three turns. Day 2 is where the first external mutation lands: six of seven models drop there, while Qwen 3.6 Plus is the only model with a small Day-2 gain. By Day 3 recovery is partial, with most models still below their Day-1 baseline. Day 2 is where the first external mutation lands, and six of the seven models drop there. The largest Day-1 → Day-2 d… view at source ↗

**Figure 5.** Figure 5: Implementation-level view of a ClawMark task. A task is defined by a compact file bundle: task.py specifies per-turn prompts, service seed hooks, and the checker rubric, while assets/ and inject/stage{k}/ (legacy field name; one entry per turn) provide static evidence and between-turn updates. The loader parses these files into runtime task objects, after which the orchestrator executes turns against the s… view at source ↗

read the original abstract

Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0\%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClawMark is a usable benchmark release for multi-day stateful agents with deterministic scoring that shows adaptation to changes as the main gap.

read the letter

ClawMark introduces a benchmark built around tasks that span multiple days where the environment keeps changing on its own through new emails, calendar shifts, and updates to other services. It includes multimodal inputs and scores everything with 1537 deterministic Python checkers on the final state, no LLM judges. They run it on seven frontier systems and report 75.8 weighted score for the best one but only 20% strict task success, with a noticeable drop after the first exogenous update. The full release of the benchmark, harness, and construction pipeline is the part that actually matters for the field right now. What the paper does well is keep the evaluation transparent and reproducible. The distinction between weighted progress and full completion makes sense for this setting, and the turn-level breakdown points to a concrete issue in handling state changes. The services and scenarios are professional enough to be relevant. The soft spots are around scope and validation. One hundred tasks across thirteen scenarios is a reasonable start, but it is still a sandbox, so how well it reflects messy real coworker work depends on the task details, which the release lets people check. The multimodal elements are present but the results do not isolate their effect. With 1537 checkers, coverage of edge cases in long flows is hard to verify at a glance even if the code is out. This is for researchers who build or test persistent agent systems and need something beyond single-episode text tasks. If you work on long-horizon agents, the released code is worth running. It deserves peer review because the empirical pattern is clear, the setup is inspectable, and the release lowers the barrier for others to use or improve it.

Referee Report

2 major / 2 minor

Summary. The paper introduces ClawMark, a benchmark for multi-turn, multi-day multimodal coworker agents operating in a living-world setting. It features 100 tasks across 13 professional scenarios executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet), with scoring performed by 1537 deterministic Python checkers on post-execution state and no LLM judges. Seven frontier agent systems are evaluated; the strongest achieves a 75.8 weighted score but only 20.0% strict Task Success, with turn-level analysis indicating performance drops after the first exogenous environment update.

Significance. If the tasks and checkers are representative, the benchmark provides a valuable, reproducible platform for evaluating long-horizon agent adaptation in dynamic, multimodal environments, addressing a clear gap in existing static and text-centric evaluations. The open release of the benchmark, evaluation harness, and construction pipeline is a notable strength that enables direct inspection and extension by the community.

major comments (2)

[§3] §3 (Benchmark Construction): The manuscript provides no description of the process used to validate the 1537 deterministic checkers for correctness across state transitions or to ensure the 100 tasks require genuine multi-day adaptation to exogenous changes. This is load-bearing for the central empirical claim that strict end-to-end completion remains rare (20%) while partial progress is common.
[§4.3] §4.3 (Turn-level Analysis): The reported performance drop after the first exogenous update is not accompanied by quantitative breakdowns (e.g., per-service or per-scenario deltas, or comparison to intra-turn baselines), making it difficult to confirm that adaptation to changing state is the primary gap rather than other factors such as initial planning or tool use.

minor comments (2)

[Abstract] The abstract and §1 would benefit from a brief explicit statement of the weighted scoring formula to clarify how it differs from strict Task Success.
[§4] Figure captions for the turn-level plots should include the exact number of turns analyzed and any filtering criteria applied to the 100 tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We address each major comment below and describe the changes we will incorporate into the revised manuscript.

read point-by-point responses

Referee: §3 (Benchmark Construction): The manuscript provides no description of the process used to validate the 1537 deterministic checkers for correctness across state transitions or to ensure the 100 tasks require genuine multi-day adaptation to exogenous changes. This is load-bearing for the central empirical claim that strict end-to-end completion remains rare (20%) while partial progress is common.

Authors: We agree that explicit documentation of the checker validation process and task design criteria would strengthen the paper. In the revised §3 we will add a dedicated subsection describing the validation workflow: each checker was subjected to unit tests exercising all relevant state transitions (including exogenous updates), followed by author-led manual inspection of 20% of tasks for semantic correctness. We will also report aggregate statistics on exogenous changes per task (mean 3.2 updates across the 100 tasks) together with two concrete task examples that illustrate required multi-day adaptation, thereby supporting the claim that low strict success rates reflect adaptation challenges rather than checker artifacts. revision: yes
Referee: §4.3 (Turn-level Analysis): The reported performance drop after the first exogenous update is not accompanied by quantitative breakdowns (e.g., per-service or per-scenario deltas, or comparison to intra-turn baselines), making it difficult to confirm that adaptation to changing state is the primary gap rather than other factors such as initial planning or tool use.

Authors: We accept that the current turn-level analysis would be more convincing with finer-grained quantitative support. In the revised §4.3 we will insert two new tables: one showing per-service and per-scenario success-rate deltas between the turn immediately preceding and following the first exogenous update, and a second comparing overall task success on the subset of tasks that contain at least one exogenous change versus an intra-turn baseline constructed from the same tasks with updates artificially removed. These additions will allow readers to assess whether the observed drop is primarily attributable to state adaptation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper with no derivations, equations, fitted parameters, or predictive claims. The central results (75.8 weighted score, 20% strict success, performance drop after exogenous updates) are direct measurements obtained by running seven agent systems against the released 100 tasks, 1537 deterministic checkers, and five stateful services. Task definitions, scoring logic, and state transitions are explicitly constructed and released for inspection; no step reduces to a self-definition, a fitted input renamed as prediction, or a load-bearing self-citation chain. The paper is self-contained against its own released artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claims rest on the unstated premise that the chosen tasks and checkers validly represent coworker agent challenges; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5742 in / 1083 out tokens · 26471 ms · 2026-05-08T06:33:29.883820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 13 canonical work pages · 11 internal anchors

[1]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.𝜏-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review arXiv 2024
[2]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review arXiv 2024
[3]

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: benchmarking llm agents on consequential real world tasks.arXiv preprint arXiv:2412.14161, 2024

work page internal anchor Pith review arXiv 2024
[4]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 881–905, 2024

2024
[5]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[6]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review arXiv 2023
[7]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review arXiv 2026
[8]

Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002, 2025

Zijian Wu, Xiangyan Liu, Xinyuan Zhang, Lingjun Chen, Fanqing Meng, Lingxiao Du, Yiran Zhao, Fanshi Zhang, Yaoqi Ye, Jiawei Wang, et al. Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002, 2025

work page arXiv 2025
[9]

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, et al. Claw-eval: Toward trustworthy evaluation of autonomous agents.arXiv preprint arXiv:2604.06132, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, et al. Clawsbench: Evaluating capability and safety of llm productivity agents in simulated workspaces.arXiv preprint arXiv:2604.05172, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023
[12]

Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453,

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers. arXiv preprint arXiv:2508.20453, 2025

work page arXiv 2025
[13]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review arXiv 2023
[14]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review arXiv 2023
[15]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023. evolvent.co 12

2023
[16]

ClawArena: Benchmarking AI Agents in Evolving Information Environments

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, et al. Clawarena: Benchmarking ai agents in evolving information environments.arXiv preprint arXiv:2604.04202, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[18]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024

2024
[19]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review arXiv 2023
[20]

Day 1 / Day 2 / Day 3

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023. evolvent.co 13 A. Multi-turn evaluation: terminology and conventions Throughout the paper we use a small terminology set (t...

2023