Recognition: unknown
CocoaBench: Evaluating Unified Digital Agents in the Wild
Pith reviewed 2026-05-10 15:07 UTC · model grok-4.3
The pith
A new benchmark of long-horizon tasks shows even the best unified digital agents succeed only 45.1 percent of the time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Unified digital agents that integrate vision, search, and coding into single systems remain unreliable on diverse, long-horizon tasks. On CocoaBench, constructed from human-designed scenarios and scored by automatic evaluation of final outputs, the strongest evaluated agent achieves only 45.1 percent success. Analysis identifies clear gaps in reasoning and planning, tool use and execution, and visual grounding as primary barriers to higher performance.
What carries the argument
CocoaBench, a collection of long-horizon tasks given solely by an instruction and scored automatically on the final output state, together with CocoaAgent, a lightweight shared scaffold for comparing different model backbones.
If this is right
- Agents require targeted advances in reasoning and planning to manage extended task sequences reliably.
- Tool use and execution must improve in accuracy and robustness to raise overall success rates.
- Visual grounding needs strengthening so agents can better interpret and act on screen content or images.
- A shared lightweight scaffold enables controlled experiments that isolate the contribution of different model backbones.
Where Pith is reading between the lines
- Training approaches may need to emphasize joint optimization across vision, search, and coding rather than separate skill modules.
- Automatic evaluation pipelines could be adapted to new domains to test whether the observed performance gaps generalize.
- Persistent shortfalls suggest near-term applications may still depend on human oversight for critical steps.
Load-bearing premise
Human-designed tasks and their automatic evaluation functions accurately capture the capabilities needed for real-world unified agent use cases without introducing bias or false positives in success measurement.
What would settle it
A side-by-side comparison in which human experts complete a random sample of the same tasks and their success rate is measured against the automatic scoring to check for systematic over- or under-counting of correct outcomes.
Figures
read the original abstract
LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CocoaBench, a benchmark consisting of human-designed long-horizon tasks that require agents to flexibly compose vision, search, and coding capabilities. Tasks are defined solely by a natural-language instruction plus an automatic evaluation function over the final output state, with the goal of enabling scalable and reliable assessment across different agent architectures. The authors also contribute CocoaAgent, a lightweight shared scaffold intended to support controlled comparisons. Experiments evaluate multiple agent systems and report that the strongest system reaches only 45.1% success; the analysis attributes the shortfall to deficiencies in reasoning/planning, tool use/execution, and visual grounding.
Significance. If the automatic evaluation functions are shown to be faithful, CocoaBench would fill a genuine gap by testing integrated rather than isolated agent capabilities and would supply a reproducible, scalable testbed. The reported performance ceiling and the capability-gap diagnosis would then constitute a useful empirical signal for the field. The decision to release tasks with only instruction-plus-final-predicate specifications is a methodological strength that supports broad adoption.
major comments (3)
- [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation Protocol): No inter-rater agreement statistics, human validation study, or false-positive audit of the automatic evaluation functions are reported. Because success is determined exclusively by a final-output predicate on long-horizon trajectories, the 45.1% headline figure and the subsequent attribution of specific capability gaps rest on an unverified assumption that the predicates credit only trajectories that exercised the intended reasoning, tool-use, and visual-grounding steps.
- [§5 and Table 1] §5 (Experiments) and Table 1: The paper states that the best system achieves 45.1% success but does not report the total number of tasks, their distribution across capability categories, or any measure of task diversity or difficulty calibration. Without these quantities it is impossible to assess whether the claimed “substantial room for improvement” in reasoning, tool use, and visual grounding is proportionate to the data or driven by a small number of outlier tasks.
- [§6] §6 (Analysis): The diagnosis that agents fail primarily on reasoning/planning, tool execution, and visual grounding is derived from post-hoc inspection of final outputs. Because the evaluation supplies no process supervision or step-level logging, it remains possible that some accepted trajectories reached the predicate via shortcuts or partial tool misuse, which would inflate measured success and overstate the severity of the identified gaps.
minor comments (3)
- [Figure 2 and §5.2] Figure 2 and §5.2: The per-capability breakdown bars are not accompanied by confidence intervals or per-task variance, making it difficult to judge whether differences between agents are statistically meaningful.
- [§2] §2 (Related Work): Several recent GUI-agent and unified-agent papers from 2024 are omitted; adding them would better situate CocoaBench relative to contemporaneous benchmarks.
- [Appendix A] Appendix A: The task templates contain a small number of typographical inconsistencies in the instruction phrasing that could be cleaned up for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of benchmark validation and reporting that we will address in the revision. Below we respond point by point to the major comments, indicating planned changes to the manuscript.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation Protocol): No inter-rater agreement statistics, human validation study, or false-positive audit of the automatic evaluation functions are reported. Because success is determined exclusively by a final-output predicate on long-horizon trajectories, the 45.1% headline figure and the subsequent attribution of specific capability gaps rest on an unverified assumption that the predicates credit only trajectories that exercised the intended reasoning, tool-use, and visual-grounding steps.
Authors: We agree that explicit validation of the automatic predicates is necessary to support the reliability claims. In the revised manuscript we will add a dedicated subsection in §4 describing a human validation study: two independent annotators will review a stratified sample of 100 trajectories (including both successes and failures) and compute inter-rater agreement (Cohen’s κ) as well as agreement with the automatic predicates. We will also report estimated false-positive rates for each predicate category. This addition directly addresses the concern that the 45.1% figure may be inflated by unverified shortcuts. revision: yes
-
Referee: [§5 and Table 1] §5 (Experiments) and Table 1: The paper states that the best system achieves 45.1% success but does not report the total number of tasks, their distribution across capability categories, or any measure of task diversity or difficulty calibration. Without these quantities it is impossible to assess whether the claimed “substantial room for improvement” in reasoning, tool use, and visual grounding is proportionate to the data or driven by a small number of outlier tasks.
Authors: We acknowledge the reporting gap. The current CocoaBench contains 200 tasks; we will insert a new table (or expanded Table 1) that lists the total count, the breakdown by primary capability (vision-only, search-only, coding-only, and integrated), and summary statistics on task length and required tool calls. We will also describe the difficulty calibration procedure used during task design (pilot testing with human experts to ensure no task is trivially solvable or impossible). These additions will allow readers to judge whether the observed performance ceiling reflects broad capability gaps. revision: yes
-
Referee: [§6] §6 (Analysis): The diagnosis that agents fail primarily on reasoning/planning, tool execution, and visual grounding is derived from post-hoc inspection of final outputs. Because the evaluation supplies no process supervision or step-level logging, it remains possible that some accepted trajectories reached the predicate via shortcuts or partial tool misuse, which would inflate measured success and overstate the severity of the identified gaps.
Authors: The analysis in §6 is indeed based on manual review of final states and error logs rather than step-level supervision, which is a deliberate design choice to keep evaluation scalable. We will revise §6 to (1) explicitly state this limitation, (2) describe the criteria used during inspection to flag likely shortcuts (e.g., final state achieved without evidence of required visual grounding), and (3) add a small-scale qualitative audit of 30 accepted trajectories to quantify how often shortcuts appear to have been used. While we cannot retroactively provide full process supervision for all runs, these clarifications and the added audit will temper the strength of the gap diagnosis. revision: partial
Circularity Check
Empirical benchmark with no derivation chain
full rationale
The paper introduces CocoaBench as a collection of human-designed long-horizon tasks specified by instructions plus automatic final-output predicates, together with a lightweight scaffold (CocoaAgent) for controlled evaluation. All reported results (e.g., 45.1 % success) are direct empirical measurements on externally defined tasks and scoring functions. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described methodology; the work therefore contains no load-bearing steps that reduce to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tasks specified only by instruction plus automatic evaluation function over final output enable reliable and scalable evaluation across diverse agent infrastructures
Forward citations
Cited by 1 Pith paper
-
LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues
LongMemEval-V2 is a new benchmark where AgentRunbook-C reaches 72.5% accuracy on long-term agent memory tasks, beating RAG baselines at 48.5% and basic coding agents at 69.3%.
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.