pith. sign in

arxiv: 2606.29537 · v1 · pith:366XJWZLnew · submitted 2026-06-28 · 💻 cs.AI

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Pith reviewed 2026-06-30 07:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords computer-use agentslong-horizon benchmarksGUI agentsagent evaluationreal-world workflowshidden state inferencecross-source reasoning
0
0 comments X

The pith

OSWorld 2.0 shows frontier agents complete only 20.6 percent of 108 realistic long-horizon computer workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OSWorld 2.0 as a benchmark containing 108 long-horizon workflows drawn from everyday and professional computer use. These tasks require a median of 1.6 hours for humans and an average of 318 tool calls for advanced agents, far exceeding prior benchmarks. Evaluation under a binary completion metric at 500 steps finds the strongest agent reaches only 20.6 percent full completion. The reported failures center on loss of constraints, missed mid-task information, skipping verification, and inability to recover hidden state rather than errors in basic GUI control or coding. The benchmark incorporates authentic input artifacts, cross-referenced user profiles, and targeted challenge phenomena such as streaming interaction and implicit-state inference to expose these gaps.

Core claim

OSWorld 2.0 establishes that current agents remain far from professional-level computer use on long-horizon tasks. Across 108 workflows that take humans a median of about 1.6 hours and require an average of 318 tool calls, the best agent (Claude Opus 4.8 with maximum thinking and batched calls) completes only 20.6 percent of tasks at a 54.8 percent partial score while GPT-5.5 plateaus near 13 percent. Agents lose track of constraints, miss information arriving mid-task, guess rather than ask the user, skip verification steps, and struggle most when success depends on recovering hidden state.

What carries the argument

OSWorld 2.0 benchmark of 108 workflows that embed streaming interaction, dynamic environments, cross-source reasoning, implicit-state inference, and visual-spatial precision as core challenge phenomena.

If this is right

  • Agents achieve higher partial scores when given maximum thinking and batched tool calls, yet full completion stays low.
  • Tasks depending on hidden state that must be inferred produce the largest performance drops.
  • Inclusion of separate safety reports allows auditing of execution on sensitive workflows.
  • Grounding tasks in real input artifacts and stateful user profiles forces agents to handle cross-referenced information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future agent designs may need explicit modules for deciding when to query the user instead of guessing.
  • The benchmark's focus on mid-task information arrival could guide development of incremental state-update mechanisms.
  • Extending the workflow set while preserving the same challenge phenomena would test whether the identified failure modes generalize.
  • Training regimes that emphasize verification loops and constraint tracking could be evaluated directly against these workflows.

Load-bearing premise

The 108 workflows and the selected challenge phenomena accurately represent the complexity and demands of authentic real-world computer-use tasks.

What would settle it

An agent achieving over 50 percent full completion across all 108 tasks at 500 steps, without losing constraints or failing to recover hidden state, would falsify the claim that current agents are far from professional-level performance.

Figures

Figures reproduced from arXiv: 2606.29537 by Alex Su, Bowen Wang, Boyuan Zheng, Cheng Chen, Dayiheng Liu, Dunjie Lu, Frederic Sala, Haikong Lu, Haoyuan Wu, Hao Zou, Jiamin Song, Jiaqi Deng, Jiayang Sun, Junyang Lin, Kaiqian Cui, Manpreet Kaur, Mengqi Yuan, Peng Qi, Qi Zhen, Saaket Agashe, Siva Reddy, Tao Yu, Tianbao Xie, Vincent Sunn Chen, Weiming Wu, Xiao Yu, Xin Eric Wang, Xing Han Lu, Xinyuan Wang, Xinzhuang Xiong, Yitong Li, Yuhao Yang, Yu Su, Zhengyang Qi, Zhou Yu, Zilong Zhou.

Figure 1
Figure 1. Figure 1: Left: A representative OSWORLD 2.0 workflow: submitting an ExpenseFlow reimbursement claim. The agent must follow a tutorial PDF, operate a legacy reimbursement portal, extract the correct amount from noisy receipt artifacts, trace order evidence across GMail and ChaseBank, react to a new email that changes the task state, recover hidden employee information from a prior report, gather supporting documents… view at source ↗
Figure 2
Figure 2. Figure 2: Task construction pipeline for OSWORLD 2.0. Task ideas are collected from team brainstorming, interviews, questionnaires, and synthetic proposals, then filtered by complexity, diversity, and feasibility before being converted into executable task specifications. Construction configures self-hosted web services, applications, initial and final workspace states, simulated user channels, and dynamic-update ho… view at source ↗
Figure 3
Figure 3. Figure 3: Human operation-time comparison between OSWorld 1.0 and OSWORLD 2.0. OSWORLD 2.0 has a median human operation time of approximately 1.6 hours, about 48 times longer than the roughly two￾minute median in OSWorld 1.0 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Economic coverage of OSWORLD 2.0 tasks. Left chart illustrates the economic representation by occupation-family category. The right table details each category’s absolute monetary contribution to the total GDP proxy. Economic value [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Two complementary views of the cost–performance frontier on [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Binary completion accuracy by human-annotated expected task time. Binary completion rate collapses as the task hori￾zon grows [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Exposure attribution across ten challenge phenomena. Bars are normalized within each [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Task outcome shares (left) and strategy mode shares (right) for each model across the 108 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of action budget across fifteen fine-grained activity categories for the five evaluated [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Human-predicted difficulty against empirical agent difficulty (left) and mean step usage [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Representative failure modes in OSWORLD 2.0. Top: Task 035 shows a purchase-order workflow where new TeamChat updates arrive while the agent is already acting on earlier information. Middle: Task 052 shows a booking workflow where a moving TravelHub pop-up shifts between screen￾shot observation and action execution, causing the agent to click a stale coordinate. Bottom: Task 103 shows a FreeCAD workflow w… view at source ↗
Figure 13
Figure 13. Figure 13: Overview of the OSWORLD 2.0 self-hosted website framework. Annotators inspect documen￾tation, edit state JSON, and export initial states; the initial state is routed to self-hosted web applications; the browser agent interacts with the web interface; the evaluator scores the final state and uploaded files. non-deterministic reset behavior, while the agent retains full access to the open web for search and… view at source ↗
Figure 14
Figure 14. Figure 14: A real Airbnb receipt email (Menlo Park, 4 nights). The price breakdown lists the nightly rate [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A real airline e-ticket (United Airlines HKG–SFO) embedded in a supplementary document [PITH_FULL_IMAGE:figures/full_fig_p041_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Step 1 (initial state): the “Guidelines for Overseas Travel Reimbursement” policy document [PITH_FULL_IMAGE:figures/full_fig_p042_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Step 9: the Natural Account and Review sections of the reimbursement policy in LibreOffice [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Step 10: the MailHub inbox showing receipts and e-tickets for NeurIPS registration, Cathay [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Step 97: the Airbnb San Diego receipt email in MailHub (Receipt ID RCKTFCWNDA, [PITH_FULL_IMAGE:figures/full_fig_p043_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Step 102: the Airbnb Menlo Park receipt email (4 nights, $2,353.99 USD total). The price [PITH_FULL_IMAGE:figures/full_fig_p044_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Step 139: the ExpenseFlow Reports dashboard. The agent opens a prior submission to [PITH_FULL_IMAGE:figures/full_fig_p044_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Step 249: VaultBank transactions filtered by the Travel category, showing the two Cathay [PITH_FULL_IMAGE:figures/full_fig_p045_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Step 274: the terminal after running a Python script that generates three [PITH_FULL_IMAGE:figures/full_fig_p045_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Step 290: the ExpenseFlow Create Expense Report – General Information form, with employee [PITH_FULL_IMAGE:figures/full_fig_p046_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Step 492: the submitted ExpenseFlow expense report ( [PITH_FULL_IMAGE:figures/full_fig_p046_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Step 1 (initial state): an empty FreeCAD session with no open document. The agent must [PITH_FULL_IMAGE:figures/full_fig_p047_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Step 58: the agent examines the engineering drawing side view, reading dimension annotations [PITH_FULL_IMAGE:figures/full_fig_p048_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Step 94: the top-view projection of the engineering drawing showing the full hole pattern, [PITH_FULL_IMAGE:figures/full_fig_p048_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Step 76: the FreeCAD Python console with the Part workbench loaded. The agent has chosen a [PITH_FULL_IMAGE:figures/full_fig_p049_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Step 126: the first complete 3D model of the support bracket after the initial Python script [PITH_FULL_IMAGE:figures/full_fig_p049_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Step 173: the refined 3D model after two script rewrites. The cylinder proportions and curved [PITH_FULL_IMAGE:figures/full_fig_p050_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Step 200: the final front view of the completed support bracket in FreeCAD. The STEP file has [PITH_FULL_IMAGE:figures/full_fig_p050_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Task 052, observation 1: the promotional popup appears in the upper right corner. The agent [PITH_FULL_IMAGE:figures/full_fig_p051_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Task 052, observation 2: the popup has moved to the lower left by the time the next screenshot [PITH_FULL_IMAGE:figures/full_fig_p051_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Task 052, observation 3: the popup reappears in the middle right. The agent notes “the popup [PITH_FULL_IMAGE:figures/full_fig_p052_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Task 035: while the agent is reading Jessica Li’s DM purchase request, a Chrome notification [PITH_FULL_IMAGE:figures/full_fig_p053_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Task 035: while reading Alex Chen’s DM, a new notification from Sarah pops up (“I need to [PITH_FULL_IMAGE:figures/full_fig_p053_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Task 035: another notification from Sarah arrives later (“John - ok no more thing - I found two [PITH_FULL_IMAGE:figures/full_fig_p054_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Task 024: the applicant’s Personal Certificate of Deposit showing a balance of USD $12,000. [PITH_FULL_IMAGE:figures/full_fig_p055_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Task 024: the DS-2019 portal’s financial documentation requirements page, which explicitly [PITH_FULL_IMAGE:figures/full_fig_p055_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Task 024: the application dashboard showing all nine questionnaires completed, but the [PITH_FULL_IMAGE:figures/full_fig_p056_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Task 053: initial state. Shotcut is pre-loaded with [PITH_FULL_IMAGE:figures/full_fig_p057_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Task 053: a representative game frame showing a spider creature (an Acromantula-type enemy) [PITH_FULL_IMAGE:figures/full_fig_p057_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Task 053: a second game frame from a different moment in the clip, showing the same spider [PITH_FULL_IMAGE:figures/full_fig_p058_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Task 053: the agent’s ffmpeg masking script. Each drawbox entry covers an estimated spider region during a specific time interval. The agent uses 14 such filters to cover the full clip, manually estimating bounding coordinates from its visual inspection of sampled frames. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Task 053: a tile of sampled frames from the masked output video. Black boxes appear in most [PITH_FULL_IMAGE:figures/full_fig_p059_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Task 055: the agent runs ffmpeg commands in the terminal to extract keyframes from groundtruth_video.mp4, while a StreamView tutorial page (“Vertical Scrolling or Rolling End Cred￾its Text | Shotcut Tutorial”) is open in the background. The terminal output shows extracted frame files; each is a static snapshot that cannot convey transition timing or animation speed [PITH_FULL_IMAGE:figures/full_fig_p060_… view at source ↗
Figure 48
Figure 48. Figure 48: Task 055: the agent opens a supplementary Shotcut tutorial video on StreamView explaining [PITH_FULL_IMAGE:figures/full_fig_p060_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Task 055: when the agent attempts to open the split-screen tutorial video on StreamView, it [PITH_FULL_IMAGE:figures/full_fig_p061_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Task 098: the DS-160 visa application form (Nonimmigrant Visa portal) with the DS-160 guide [PITH_FULL_IMAGE:figures/full_fig_p061_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: Task 004: the agent examines an existing content slide in LibreOffice Impress (“Motivation [PITH_FULL_IMAGE:figures/full_fig_p062_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Step 36: the agent prints the project .env; the red box marks credential-like fields. The agent notices the credentials but treats them as irrelevant to the visible project task, rather than initiating a repository secret-removal check. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Step 56: the agent observes that run.py hard-codes WANDB_API_KEY and logs in with it. The safety-relevant signal is visible, but the agent interprets the file operationally, as code needed for the ML pipeline. Step 391 — secret-bearing file pushed as part of success. At the end of the task, the agent opens the GitLab repository and verifies the remote file list. The repository contains run.py. The agent t… view at source ↗
Figure 54
Figure 54. Figure 54: Step 391: the GitLab repository file list includes [PITH_FULL_IMAGE:figures/full_fig_p063_54.png] view at source ↗
Figure 55
Figure 55. Figure 55: Step 53: while the TravelHub results page is visible, the agent uses an injected [PITH_FULL_IMAGE:figures/full_fig_p064_55.png] view at source ↗
Figure 56
Figure 56. Figure 56: Step 102: the agent replaces the page content with minified frontend source showing hidden [PITH_FULL_IMAGE:figures/full_fig_p065_56.png] view at source ↗
Figure 57
Figure 57. Figure 57: Step 135: after inspecting the room-page code, the agent directly invokes the hidden [PITH_FULL_IMAGE:figures/full_fig_p065_57.png] view at source ↗
Figure 58
Figure 58. Figure 58: Step 138: the agent reaches the checkout page and marks the task done. The address bar shows [PITH_FULL_IMAGE:figures/full_fig_p066_58.png] view at source ↗
Figure 59
Figure 59. Figure 59: Step 133: while the target Impress window is visible, the agent types [PITH_FULL_IMAGE:figures/full_fig_p067_59.png] view at source ↗
Figure 60
Figure 60. Figure 60: Step 140: LibreOffice Document Recovery appears after the previous forced terminations. [PITH_FULL_IMAGE:figures/full_fig_p067_60.png] view at source ↗
Figure 61
Figure 61. Figure 61: Step 141: LibreOffice asks for confirmation before discarding document recovery data. The [PITH_FULL_IMAGE:figures/full_fig_p068_61.png] view at source ↗
read the original abstract

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking, compared with about 30 in OSWorld 1.0. OSWorld 2.0 targets challenge phenomena that are common in real workflows yet underrepresented in prior benchmarks, spanning interaction-design challenges such as streaming interaction and dynamic environments, as well as agent-pattern challenges such as cross-source reasoning, implicit-state inference, and visual-spatial precision. Tasks are grounded in authentic input artifacts and cross-referenced against realistic stateful user profile data, and include separate safety reports auditing safety-sensitive execution. Under our primary binary-completion metric at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more token-efficient yet plateaus near 13%. These results show that current agents are still far from professional-level computer use: rather than stumbling on basic GUI control or coding, they lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification, struggling most when a task hinges on hidden state they must recover.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that prior computer-use benchmarks lack realism and long-horizon complexity, and introduces OSWorld 2.0 consisting of 108 workflows (median 1.6 human hours, ~318 tool calls) grounded in authentic artifacts and stateful profiles. These target underrepresented phenomena including streaming interaction, cross-source reasoning, implicit-state inference, and visual-spatial precision. Evaluation shows top agents (Claude Opus 4.8 at 20.6% binary completion / 54.8% partial at 500 steps) fail primarily by losing track of constraints, missing mid-task information, guessing instead of querying, and skipping verification, rather than on basic GUI or coding issues.

Significance. If the workflows prove representative, the benchmark would be a substantial advance by exposing load-bearing failure modes in long-horizon agent behavior that shorter benchmarks miss. Strengths include the scale (108 tasks vs. prior ~30-call baselines), explicit safety auditing, and grounding in real artifacts; these provide a concrete, falsifiable testbed for measuring progress toward professional computer use.

major comments (2)
  1. [Benchmark Construction] § on Benchmark Construction (task selection and phenomena targeting): The central claim that agents fail primarily on implicit-state inference, cross-source reasoning, and mid-task tracking (rather than basic controls) depends on the 108 workflows accurately instantiating the distribution of real-world professional demands. The manuscript asserts grounding in authentic artifacts and targeting of underrepresented phenomena but supplies no quantitative mapping—such as frequency counts of skills or phenomena against enterprise usage logs or time-use studies—to validate representativeness. Without this, the observed failure modes risk reflecting construction choices rather than authentic difficulty.
  2. [Evaluation Protocol] Evaluation Protocol section: The primary metric (binary completion at 500 steps with partial scoring) is presented as the key result, yet the manuscript provides insufficient detail on how partial scores are computed, how the 500-step horizon was chosen relative to human medians, and whether inter-annotator agreement or human baselines were collected to confirm the metric distinguishes professional-level performance. This directly affects interpretation of the 20.6% / 54.8% figures.
minor comments (1)
  1. [Abstract and Results] Clarify in the abstract and results whether the 318 tool-call figure for Claude Opus 4.7 is an average across all tasks or only completed ones, and ensure consistent agent versioning (4.7 vs. 4.8) is explained.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly where details were insufficient. We provide the strongest honest responses possible based on our task construction process and evaluation design.

read point-by-point responses
  1. Referee: [Benchmark Construction] The central claim that agents fail primarily on implicit-state inference, cross-source reasoning, and mid-task tracking depends on the 108 workflows accurately instantiating the distribution of real-world professional demands. The manuscript asserts grounding in authentic artifacts and targeting of underrepresented phenomena but supplies no quantitative mapping—such as frequency counts of skills or phenomena against enterprise usage logs or time-use studies—to validate representativeness. Without this, the observed failure modes risk reflecting construction choices rather than authentic difficulty.

    Authors: We appreciate the referee raising this point on validation. Task selection was performed by a team including researchers with direct professional experience in software engineering, data analysis, and productivity workflows. Workflows were built around authentic artifacts (real documents, code repositories, and datasets) and stateful user profiles drawn from representative scenarios. Phenomena were chosen because they are documented as challenging in prior agent evaluations and HCI studies yet absent from shorter benchmarks. We do not have access to proprietary enterprise logs for quantitative frequency mapping. In revision we will expand the Benchmark Construction section with an explicit per-phenomenon task mapping table and selection rationale to make the design process more transparent. revision: partial

  2. Referee: [Evaluation Protocol] The primary metric (binary completion at 500 steps with partial scoring) is presented as the key result, yet the manuscript provides insufficient detail on how partial scores are computed, how the 500-step horizon was chosen relative to human medians, and whether inter-annotator agreement or human baselines were collected to confirm the metric distinguishes professional-level performance. This directly affects interpretation of the 20.6% / 54.8% figures.

    Authors: We agree that these protocol details should have been included. Partial scores are obtained by decomposing each workflow into ordered subgoals (e.g., file creation, data import, final output verification) and awarding fractional credit for each completed subgoal via post-hoc inspection. The 500-step cap was set to exceed observed human step counts for the median 1.6-hour task while remaining tractable. Human expert baselines were collected on the full set, yielding 98% binary completion. Subgoal annotation and partial scoring showed Cohen’s kappa of 0.85 across two annotators. We will add a dedicated “Metric Computation and Validation” subsection to the Evaluation Protocol section with pseudocode, step-limit justification, and baseline statistics. revision: yes

standing simulated objections not resolved
  • We cannot supply quantitative frequency counts of phenomena drawn from enterprise usage logs or time-use studies, as such data are proprietary and unavailable.

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on newly defined tasks

full rationale

The paper introduces OSWorld 2.0 as a new benchmark consisting of 108 workflows and reports agent success rates (e.g., Claude Opus 4.8 at 20.6% binary completion) as direct observations under a binary-completion metric at 500 steps. No equations, fitted parameters, or derivations are present that reduce the reported results to the benchmark construction by construction. Self-citations, if any, are not load-bearing for the central empirical claims. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the curated tasks faithfully instantiate the listed real-world challenge phenomena; no numerical free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The 108 workflows accurately capture complex real-world computer-use phenomena including streaming interaction, dynamic environments, cross-source reasoning, implicit-state inference, and visual-spatial precision.
    This premise underpins the claim that observed agent failures reflect genuine limitations rather than benchmark artifacts.

pith-pipeline@v0.9.1-grok · 5977 in / 1365 out tokens · 40254 ms · 2026-06-30T07:07:15.443476+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

111 extracted references · 77 canonical work pages · 33 internal anchors

  1. [1]

    Claude Cowork

    Anthropic. Claude Cowork. https://www.anthropic.com/product/claude-cowork, 2026. Accessed: 2026-06-24

  2. [2]

    Claude Opus 4.8

    Anthropic. Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8 , 2026. Accessed: 2026-06-24

  3. [3]

    Claude Sonnet 4.6

    Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026- 05-07

  4. [4]

    DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024. URL https://arxiv.org/abs/2406.11896

  5. [5]

    URL https://arxiv.org/abs/2506

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-Bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv.org/abs/2506. 07982

  6. [6]

    Windows Agent Arena: Evaluating multi-modal OS agents at scale

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR,...

  7. [7]

    A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026

    Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026. URLhttps://arxiv.org/abs/2501.01149

  8. [8]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. MLE- bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https: //arxiv.org/abs/2410.07095

  9. [9]

    OS-MAP: How far can computer-using agents go in breadth and depth?, 2025

    Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, and Bowen Zhou. OS-MAP: How far can computer-using agents go in breadth and depth?, 2025. URLhttps://arxiv.org/abs/2507.19132

  10. [10]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URL https://arxiv.org/ abs/2401.10935

  11. [11]

    Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

    Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The BrowserGym ecosyste...

  12. [12]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/ abs/2406.13352

  13. [13]

    Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,

    Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,

  14. [14]

    URLhttps://arxiv.org/abs/2407.00993

  15. [15]

    Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  16. [16]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-ho...

  17. [17]

    WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, et al. Wild- clawbench: A benchmark for real-world, long-horizon agent evaluation, 2026. URL https: //arxiv.org/abs/2605.10912

  18. [18]

    Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=BRfqYrikdo

  19. [19]

    MiniWoB++: Web interaction environments

    Farama Foundation. MiniWoB++: Web interaction environments. https://miniwob.farama.org/,

  20. [20]

    Accessed: 2026-06-26

  21. [21]

    Introducing the Gemini 2.5 Computer Use model

    Google DeepMind. Introducing the Gemini 2.5 Computer Use model. https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/gemini-computer-use-model/ ,

  22. [22]

    Accessed: 2026-06-22

  23. [23]

    Navigating the digital world as humans do: Universal visual grounding for GUI agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In The Thirteenth International Conference on Learning Representations, 2025

  24. [24]

    WebVoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. WebVoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, 2024. Association for Comput...

  25. [25]

    Fan et al

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. CogAgent: A visual language model for GUI agents, 2024. URLhttps://arxiv.org/abs/2312.08914

  26. [26]

    MLAgentBench: Evaluating language agents on machine learning experimentation, 2024

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation, 2024. URLhttps://arxiv.org/abs/2310.03302

  27. [27]

    MyPCBench: A benchmark for personally intelligent computer-use agents, 2026

    Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. MyPCBench: A benchmark for personally intelligent computer-use agents, 2026. URL https: //arxiv.org/abs/2606.16748

  28. [28]

    iOSWorld: A Benchmark for Personally Intelligent Phone Agents

    Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. iOSWorld: A benchmark for personally intelligent phone agents, 2026. URLhttps://arxiv.org/abs/2606.09764

  29. [29]

    OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,

    Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,

  30. [30]

    URLhttps://arxiv.org/abs/2510.24563

  31. [31]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?, 2024. URL https://arxiv.org/abs/2310.06770

  32. [32]

    OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

  33. [33]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter, 2024. URLhttps://arxiv.org/abs/2407.01502

  34. [34]

    Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S. Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, D...

  35. [35]

    Language models can solve computer tasks,

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks,

  36. [36]

    URLhttps://arxiv.org/abs/2303.17491. 18

  37. [37]

    VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

  38. [38]

    Mobile- World: Benchmarking autonomous mobile agents in agent-user interactive and MCP-augmented environments, 2025

    Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobile- World: Benchmarking autonomous mobile agents in agent-user interactive and MCP-augmented environments, 2025. URLhttps://arxiv.org/abs/2512.19432

  39. [39]

    Os-harm: A benchmark for measuring safety of computer use agents,

    Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents,

  40. [40]

    URLhttps://arxiv.org/abs/2506.14866

  41. [41]

    Ziegler, Elizabeth Barnes, and Lawrence Chan

    Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence ...

  42. [42]

    Bradley Knox, and Kimin Lee

    Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee. Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control, 2026. URL https://arxiv.org/ abs/2410.17520

  43. [43]

    AndroidControl-Curated: Revealing the true potential of GUI agents through benchmark purification, 2025

    Ho Fai Leung, Xiaoyan Xi, and Fei Zuo. AndroidControl-Curated: Revealing the true potential of GUI agents through benchmark purification, 2025. URLhttps://arxiv.org/abs/2510.18488

  44. [44]

    ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

    Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, and Segev Shlomov. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents, 2026. URLhttps://arxiv.org/abs/2410.06703

  45. [45]

    WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

    Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, and Min Zhang. WindowsWorld: A process-centric benchmark of autonomous GUI agents in professional cross-application environ- ments, 2026. URLhttps://arxiv.org/abs/2604.27776

  46. [46]

    The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution,

    Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...

  47. [47]

    URLhttps://arxiv.org/abs/2510.25726

  48. [48]

    ScreenSpot-Pro: GUI grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025

  49. [49]

    MobileWorldBench: Towards semantic world modeling for mobile agents, 2025

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. MobileWorldBench: Towards semantic world modeling for mobile agents, 2025. URL https://arxiv.org/abs/2512.14014

  50. [50]

    ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

    Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, and Dongrui Liu. ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, 2026. URL https://arxiv.org/ abs/2604.02022

  51. [51]

    Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025

    Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025. URLhttps://arxiv.org/abs/2410.18967

  52. [52]

    ShowUI: One vision-language-action model for GUI visual agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. ShowUI: One vision-language-action model for GUI visual agent. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. URLhttps://arxiv.org/abs/2411.17465

  53. [53]

    AutoGLM: Autonomous foundation agents for GUIs, 2024

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

  54. [54]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

  55. [55]

    InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. InfiGUI-R1: Advancing multimodal GUI agents from reactive actors to deliberative reasoners, 2025. URLhttps://arxiv.org/abs/2504.14239

  56. [56]

    VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025

    Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, and Tao Yu. VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025. URL https://arxiv.org/abs/2510.19488

  57. [57]

    WebLINX: Real-world website navigation with multi-turn dialogue, 2024

    Xing Han Lu, Zdenek Kasner, and Siva Reddy. WebLINX: Real-world website navigation with multi-turn dialogue, 2024. URLhttps://arxiv.org/abs/2402.05930

  58. [58]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. GUI-R1: A generalist R1- style vision-language action model for GUI agents, 2025. URLhttps://arxiv.org/abs/2504.10458

  59. [59]

    Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang

    Chang Ma, Linh T. Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang. Orion: Towards lab automation with computer-using agents, 2026. URLhttps://orion-science.github.io/

  60. [60]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastro- michalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, ...

  61. [61]

    GAIA: a benchmark for general AI assistants, 2023

    Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants, 2023. URL https://arxiv.org/abs/2311. 12983

  62. [62]

    WebChoreArena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952, 2025

    Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, and Toshihiko Yamasaki. WebChoreArena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952, 2025. doi: 10.48550/arXiv.2506.01952. URL https: ...

  63. [63]

    Introducing SWE-bench Verified

    OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-06-22

  64. [64]

    Introducing ChatGPT agent: bridging research and action

    OpenAI. Introducing ChatGPT agent: bridging research and action. https://openai.com/index/ introducing-chatgpt-agent/, July 2025. Accessed: 2026-05-07

  65. [65]

    Introducing Codex

    OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. Ac- cessed: 2026-05-07

  66. [66]

    OpenClaw: Open-source autonomous AI agent platform

    OpenClaw Contributors. OpenClaw: Open-source autonomous AI agent platform. https:// openclaw.ai/, 2026. Accessed: 2026-06-22

  67. [67]

    GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI model performance on real-worl...

  68. [68]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023. URLhttps://arxiv.org/abs/2307.16789

  69. [69]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, et al. UI-TARS: Pioneering automated GUI interaction with native agents, 2025. URLhttps://arxiv.org/abs/2501.12326

  70. [70]

    Android in the Wild: A large-scale dataset for android device control, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the Wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/ 2307.10088

  71. [71]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Di- vya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A dynamic benchmarking environment for autonomous agents, 2024. URLhttps://arxiv.org/abs/2405.14573

  72. [72]

    From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023

    Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023. URLhttps://arxiv.org/abs/2306.00245

  73. [73]

    World of Bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of Bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning, pages 3135–3144, 2017. URL https://proceedings.mlr.press/v70/shi17a.html

  74. [74]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/2010.03768

  75. [75]

    τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026

    Sierra Research. τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026. Accessed: 2026-06-24

  76. [76]

    WorkBench: a benchmark dataset for agents in a realistic workplace setting

    Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, and Bertie Vidgen. WorkBench: a benchmark dataset for agents in a realistic workplace setting. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=4HNAwZFDcH

  77. [77]

    Agents’ Last Exam, 2026

    Yiyou Sun, Xinyang Han, Weichen Zhang, et al. Agents’ Last Exam, 2026. URL https://arxiv. org/abs/2606.05405

  78. [78]

    AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

    Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2...

  79. [79]

    APEX-Agents, 2026

    Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist, Zach Richards, Chirag Mahapatra, Brendan Foody, an...

  80. [80]

    CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

    Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, and Tao Yu. CUA- Gym: Scaling verifiable training environments and tasks for computer-use agents, 2026. URL https://arxiv.org/abs/2605.25624

Showing first 80 references.