pith. machine review for the scientific record. sign in

arxiv: 2604.11201 · v2 · submitted 2026-04-13 · 💻 cs.CL · cs.AI

Recognition: unknown

CocoaBench: Evaluating Unified Digital Agents in the Wild

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords unified digital agentsbenchmarkLLM agentslong-horizon taskstool usevisual groundingreasoningevaluation
0
0 comments X

The pith

A new benchmark of long-horizon tasks shows even the best unified digital agents succeed only 45.1 percent of the time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CocoaBench to evaluate unified digital agents on tasks that demand flexible mixing of vision, search, and coding over extended sequences. Earlier tests examined these skills in isolation, yet practical applications require agents to switch between them within one workflow. Results indicate top systems reach just 45.1 percent success, with breakdowns traced to shortfalls in reasoning and planning, tool calling and execution, and accurate visual interpretation. Tasks are defined only by natural-language instructions plus automatic checks on the final output, allowing consistent measurement across varied agent designs.

Core claim

Unified digital agents that integrate vision, search, and coding into single systems remain unreliable on diverse, long-horizon tasks. On CocoaBench, constructed from human-designed scenarios and scored by automatic evaluation of final outputs, the strongest evaluated agent achieves only 45.1 percent success. Analysis identifies clear gaps in reasoning and planning, tool use and execution, and visual grounding as primary barriers to higher performance.

What carries the argument

CocoaBench, a collection of long-horizon tasks given solely by an instruction and scored automatically on the final output state, together with CocoaAgent, a lightweight shared scaffold for comparing different model backbones.

If this is right

  • Agents require targeted advances in reasoning and planning to manage extended task sequences reliably.
  • Tool use and execution must improve in accuracy and robustness to raise overall success rates.
  • Visual grounding needs strengthening so agents can better interpret and act on screen content or images.
  • A shared lightweight scaffold enables controlled experiments that isolate the contribution of different model backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training approaches may need to emphasize joint optimization across vision, search, and coding rather than separate skill modules.
  • Automatic evaluation pipelines could be adapted to new domains to test whether the observed performance gaps generalize.
  • Persistent shortfalls suggest near-term applications may still depend on human oversight for critical steps.

Load-bearing premise

Human-designed tasks and their automatic evaluation functions accurately capture the capabilities needed for real-world unified agent use cases without introducing bias or false positives in success measurement.

What would settle it

A side-by-side comparison in which human experts complete a random sample of the same tasks and their success rate is measured against the automatic scoring to check for systematic over- or under-counting of correct outcomes.

Figures

Figures reproduced from arXiv: 2604.11201 by Boyuan Zheng, CocoaBench Team: Shibo Hao, Eric P. Xing, Feng Yao, Haoxiang Zhang, Hexi Jin, Jingbo Shang, Jixuan Chen, Julian McAuley, Junli Wang, Kun Zhou, Lianhui Qin, Licheng Liu, Pracha Promthaw, Qiyue Gao, Rupesh Kumar Srivastava, Tianyang Liu, Tommaso Cerruti, Xiaohan Fu, Yijiang Li, Yuheng Zha, Yu Wang, Zhengtao Han, Zhengzhong Liu, Zhifei Li, Zhining Zhang, Zhiqi Liang, Zhiting Hu, Zhoujun Cheng, Zilong Wang, Ziqiao Ma.

Figure 1
Figure 1. Figure 1: COCOABENCH evaluates agents on complex digital tasks that require flexible composition of core capabilities such as vision, search, coding. The shopping example shown here illustrates one such task and highlights the multi step, compositional nature of the benchmark. design outcome based proxy evaluators whose success strongly implies correct execution. This process to outcome transformation preserves open… view at source ↗
Figure 2
Figure 2. Figure 2: Statistics of COCOABENCH. (a) Distribution of task domains, covering a wide range of everyday topics. (b) Distribution of resource types required by the tasks. Docu￾ments include .csv, .pdf, and .bibtex. (c) Human-annotated key capabilities required for each task, including Vision, Search, and Coding. The Vision, Search, and Coding types are not mutually exclusive; numbers on the bars indicate the proporti… view at source ↗
Figure 3
Figure 3. Figure 3: Co-occurrence matrix of re￾quired capabilities (Vision, Search, and Coding). We evaluate representative agent systems on COCOABENCH to cover a range of agent designs and capability profiles. (1) ChatGPT Agent Mode (OpenAI, 2025b) is one of the earliest uni￾fied digital agents, with support for browsing, coding, and visual interaction in a sandbox en￾vironment. (2) OpenClaw (OpenClaw, 2026) is an open sourc… view at source ↗
Figure 4
Figure 4. Figure 4: Overall performance on COCOABENCH for representative agent systems and model backbones under the shared COCOA-AGENT scaffold. $0.5 $1 $1.5 $2 Average Cost per Task (USD) 0% 10% 20% 30% 40% 50% 60% CodeX 500 1000 1500 2000 2500 3000 3500 Average Time per Task (seconds) 0% 10% 20% 30% 40% 50% 60% CodeX Agent Cocoa Agent OpenClaw CodeX Claude Code Model GPT-5.4 Claude-Sonnet-4.6 Gemini-3.1-pro Gemini-Flash-3.… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy-cost (left) and accuracy–time (right) trade-offs across agents. Marker [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-model distribution of tool calls across the three capability categories under [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top 10 most frequently used tools, ranked by total call count. Tool call distribution [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Left: Error type distribution across all six models evaluated on COCOABENCH using the COCOA-AGENT framework (based on 712 failure trajectories). Right: Comparisons of error distributions between GPT-5.4 and Kimi K2.5. causes are annotated by an LLM as judge (Claude Sonnet 4.6, grade prompt is provided in Appendix C.5). We organize the annotated error types into three classes. Reasoning & Planning (E1) desc… view at source ↗
Figure 9
Figure 9. Figure 9: Aggregate failure distribution across all 6 models evaluated under [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: COCOA-AGENT with Claude Sonnet 4.6 failures on COCOABENCH mapped to the failure taxonomy. The inner ring: Reasoning & Planning (50%), Tool & Execution (17%), Visual Grounding (33%). The outer ring details 9 active subcategories. Displayed percentages reflect each subcategory’s share of total failure-mode mentions across all 114 failed runs. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: COCOA-AGENT with GPT-5.4 failures on COCOABENCH mapped to the failure taxonomy. The inner ring: Reasoning & Planning (57%), Tool & Execution (11%), Visual Grounding (32%). The outer ring details 8 active subcategories. Displayed percentages reflect each subcategory’s share of the total failure-mode mentions across all 96 failed runs [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: COCOA-AGENT with Kimi-k2.5 failures on COCOABENCH mapped to the failure taxonomy. The inner ring: Reasoning & Planning (55%), Tool & Execution (15%), Visual Grounding (30%). The outer ring details 9 active subcategories. Displayed percentages reflect each subcategory’s share of the total failure-mode mentions across all 137 failed runs. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: COCOA-AGENT with Qwen3.5-397B failures on COCOABENCH mapped to the failure taxonomy. The inner ring: Reasoning & Planning (56%), Tool & Execution (19%), Vi￾sual Grounding (25%). The outer ring details 9 active subcategories. Displayed percentages reflect each subcategory’s share of the total failure-mode mentions across all 139 failed runs [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: COCOA-AGENT with Gemini 3.1 Pro Thinking failures on COCOABENCH mapped to the failure taxonomy. The inner ring: Reasoning & Planning (55%), Tool & Execution (20%), Visual Grounding (25%). The outer ring details 9 active subcategories. Displayed percentages reflect each subcategory’s share of the total failure-mode mentions across all 107 failed runs. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: COCOA-AGENT with Gemini 3 Flash failures on COCOABENCH mapped to the failure taxonomy. The inner ring: Reasoning & Planning (53%), Tool & Execution (17%), Visual Grounding (30%). The outer ring details 9 active subcategories. Displayed percentages reflect each subcategory’s share of the total failure-mode mentions across all 125 failed runs [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: OpenAI Codex failures on COCOABENCH mapped to the failure taxonomy. The inner ring: Reasoning & Planning (59%), Tool & Execution (5%), Visual Grounding (36%). The outer ring details 8 active subcategories. Displayed percentages reflect each subcategory’s share of total failure-mode mentions across all 86 failed runs. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces CocoaBench, a benchmark consisting of human-designed long-horizon tasks that require agents to flexibly compose vision, search, and coding capabilities. Tasks are defined solely by a natural-language instruction plus an automatic evaluation function over the final output state, with the goal of enabling scalable and reliable assessment across different agent architectures. The authors also contribute CocoaAgent, a lightweight shared scaffold intended to support controlled comparisons. Experiments evaluate multiple agent systems and report that the strongest system reaches only 45.1% success; the analysis attributes the shortfall to deficiencies in reasoning/planning, tool use/execution, and visual grounding.

Significance. If the automatic evaluation functions are shown to be faithful, CocoaBench would fill a genuine gap by testing integrated rather than isolated agent capabilities and would supply a reproducible, scalable testbed. The reported performance ceiling and the capability-gap diagnosis would then constitute a useful empirical signal for the field. The decision to release tasks with only instruction-plus-final-predicate specifications is a methodological strength that supports broad adoption.

major comments (3)
  1. [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation Protocol): No inter-rater agreement statistics, human validation study, or false-positive audit of the automatic evaluation functions are reported. Because success is determined exclusively by a final-output predicate on long-horizon trajectories, the 45.1% headline figure and the subsequent attribution of specific capability gaps rest on an unverified assumption that the predicates credit only trajectories that exercised the intended reasoning, tool-use, and visual-grounding steps.
  2. [§5 and Table 1] §5 (Experiments) and Table 1: The paper states that the best system achieves 45.1% success but does not report the total number of tasks, their distribution across capability categories, or any measure of task diversity or difficulty calibration. Without these quantities it is impossible to assess whether the claimed “substantial room for improvement” in reasoning, tool use, and visual grounding is proportionate to the data or driven by a small number of outlier tasks.
  3. [§6] §6 (Analysis): The diagnosis that agents fail primarily on reasoning/planning, tool execution, and visual grounding is derived from post-hoc inspection of final outputs. Because the evaluation supplies no process supervision or step-level logging, it remains possible that some accepted trajectories reached the predicate via shortcuts or partial tool misuse, which would inflate measured success and overstate the severity of the identified gaps.
minor comments (3)
  1. [Figure 2 and §5.2] Figure 2 and §5.2: The per-capability breakdown bars are not accompanied by confidence intervals or per-task variance, making it difficult to judge whether differences between agents are statistically meaningful.
  2. [§2] §2 (Related Work): Several recent GUI-agent and unified-agent papers from 2024 are omitted; adding them would better situate CocoaBench relative to contemporaneous benchmarks.
  3. [Appendix A] Appendix A: The task templates contain a small number of typographical inconsistencies in the instruction phrasing that could be cleaned up for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of benchmark validation and reporting that we will address in the revision. Below we respond point by point to the major comments, indicating planned changes to the manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation Protocol): No inter-rater agreement statistics, human validation study, or false-positive audit of the automatic evaluation functions are reported. Because success is determined exclusively by a final-output predicate on long-horizon trajectories, the 45.1% headline figure and the subsequent attribution of specific capability gaps rest on an unverified assumption that the predicates credit only trajectories that exercised the intended reasoning, tool-use, and visual-grounding steps.

    Authors: We agree that explicit validation of the automatic predicates is necessary to support the reliability claims. In the revised manuscript we will add a dedicated subsection in §4 describing a human validation study: two independent annotators will review a stratified sample of 100 trajectories (including both successes and failures) and compute inter-rater agreement (Cohen’s κ) as well as agreement with the automatic predicates. We will also report estimated false-positive rates for each predicate category. This addition directly addresses the concern that the 45.1% figure may be inflated by unverified shortcuts. revision: yes

  2. Referee: [§5 and Table 1] §5 (Experiments) and Table 1: The paper states that the best system achieves 45.1% success but does not report the total number of tasks, their distribution across capability categories, or any measure of task diversity or difficulty calibration. Without these quantities it is impossible to assess whether the claimed “substantial room for improvement” in reasoning, tool use, and visual grounding is proportionate to the data or driven by a small number of outlier tasks.

    Authors: We acknowledge the reporting gap. The current CocoaBench contains 200 tasks; we will insert a new table (or expanded Table 1) that lists the total count, the breakdown by primary capability (vision-only, search-only, coding-only, and integrated), and summary statistics on task length and required tool calls. We will also describe the difficulty calibration procedure used during task design (pilot testing with human experts to ensure no task is trivially solvable or impossible). These additions will allow readers to judge whether the observed performance ceiling reflects broad capability gaps. revision: yes

  3. Referee: [§6] §6 (Analysis): The diagnosis that agents fail primarily on reasoning/planning, tool execution, and visual grounding is derived from post-hoc inspection of final outputs. Because the evaluation supplies no process supervision or step-level logging, it remains possible that some accepted trajectories reached the predicate via shortcuts or partial tool misuse, which would inflate measured success and overstate the severity of the identified gaps.

    Authors: The analysis in §6 is indeed based on manual review of final states and error logs rather than step-level supervision, which is a deliberate design choice to keep evaluation scalable. We will revise §6 to (1) explicitly state this limitation, (2) describe the criteria used during inspection to flag likely shortcuts (e.g., final state achieved without evidence of required visual grounding), and (3) add a small-scale qualitative audit of 30 accepted trajectories to quantify how often shortcuts appear to have been used. While we cannot retroactively provide full process supervision for all runs, these clarifications and the added audit will temper the strength of the gap diagnosis. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain

full rationale

The paper introduces CocoaBench as a collection of human-designed long-horizon tasks specified by instructions plus automatic final-output predicates, together with a lightweight scaffold (CocoaAgent) for controlled evaluation. All reported results (e.g., 45.1 % success) are direct empirical measurements on externally defined tasks and scoring functions. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology; the work therefore contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that human-designed tasks with automatic final-output evaluators can reliably measure integrated agent performance; no free parameters, invented entities, or additional axioms are described in the abstract.

axioms (1)
  • domain assumption Tasks specified only by instruction plus automatic evaluation function over final output enable reliable and scalable evaluation across diverse agent infrastructures
    Stated directly in the abstract as the basis for the benchmark design

pith-pipeline@v0.9.0 · 5588 in / 1215 out tokens · 25416 ms · 2026-05-10T15:07:15.876998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

    cs.CL 2026-05 unverdicted novelty 7.0

    LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages · cited by 1 Pith paper

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...