OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Alex Su; Bowen Wang; Boyuan Zheng; Cheng Chen; Dayiheng Liu; Dunjie Lu; Frederic Sala; Haikong Lu; Haoyuan Wu; Hao Zou

arxiv: 2606.29537 · v1 · pith:366XJWZLnew · submitted 2026-06-28 · 💻 cs.AI

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Mengqi Yuan , Zilong Zhou , Xinzhuang Xiong , Weiming Wu , Jiayang Sun , Jiamin Song , Kaiqian Cui , Bowen Wang

show 28 more authors

Haoyuan Wu Yitong Li Dunjie Lu Haikong Lu Qi Zhen Xinyuan Wang Jiaqi Deng Yuhao Yang Cheng Chen Boyuan Zheng Alex Su Xiao Yu Hao Zou Saaket Agashe Xing Han Lu Manpreet Kaur Zhengyang Qi Vincent Sunn Chen Frederic Sala Dayiheng Liu Junyang Lin Zhou Yu Yu Su Siva Reddy Xin Eric Wang Peng Qi Tianbao Xie Tao Yu

This is my paper

Pith reviewed 2026-06-30 07:07 UTC · model grok-4.3

classification 💻 cs.AI

keywords computer-use agentslong-horizon benchmarksGUI agentsagent evaluationreal-world workflowshidden state inferencecross-source reasoning

0 comments

The pith

OSWorld 2.0 shows frontier agents complete only 20.6 percent of 108 realistic long-horizon computer workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OSWorld 2.0 as a benchmark containing 108 long-horizon workflows drawn from everyday and professional computer use. These tasks require a median of 1.6 hours for humans and an average of 318 tool calls for advanced agents, far exceeding prior benchmarks. Evaluation under a binary completion metric at 500 steps finds the strongest agent reaches only 20.6 percent full completion. The reported failures center on loss of constraints, missed mid-task information, skipping verification, and inability to recover hidden state rather than errors in basic GUI control or coding. The benchmark incorporates authentic input artifacts, cross-referenced user profiles, and targeted challenge phenomena such as streaming interaction and implicit-state inference to expose these gaps.

Core claim

OSWorld 2.0 establishes that current agents remain far from professional-level computer use on long-horizon tasks. Across 108 workflows that take humans a median of about 1.6 hours and require an average of 318 tool calls, the best agent (Claude Opus 4.8 with maximum thinking and batched calls) completes only 20.6 percent of tasks at a 54.8 percent partial score while GPT-5.5 plateaus near 13 percent. Agents lose track of constraints, miss information arriving mid-task, guess rather than ask the user, skip verification steps, and struggle most when success depends on recovering hidden state.

What carries the argument

OSWorld 2.0 benchmark of 108 workflows that embed streaming interaction, dynamic environments, cross-source reasoning, implicit-state inference, and visual-spatial precision as core challenge phenomena.

If this is right

Agents achieve higher partial scores when given maximum thinking and batched tool calls, yet full completion stays low.
Tasks depending on hidden state that must be inferred produce the largest performance drops.
Inclusion of separate safety reports allows auditing of execution on sensitive workflows.
Grounding tasks in real input artifacts and stateful user profiles forces agents to handle cross-referenced information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent designs may need explicit modules for deciding when to query the user instead of guessing.
The benchmark's focus on mid-task information arrival could guide development of incremental state-update mechanisms.
Extending the workflow set while preserving the same challenge phenomena would test whether the identified failure modes generalize.
Training regimes that emphasize verification loops and constraint tracking could be evaluated directly against these workflows.

Load-bearing premise

The 108 workflows and the selected challenge phenomena accurately represent the complexity and demands of authentic real-world computer-use tasks.

What would settle it

An agent achieving over 50 percent full completion across all 108 tasks at 500 steps, without losing constraints or failing to recover hidden state, would falsify the claim that current agents are far from professional-level performance.

Figures

Figures reproduced from arXiv: 2606.29537 by Alex Su, Bowen Wang, Boyuan Zheng, Cheng Chen, Dayiheng Liu, Dunjie Lu, Frederic Sala, Haikong Lu, Haoyuan Wu, Hao Zou, Jiamin Song, Jiaqi Deng, Jiayang Sun, Junyang Lin, Kaiqian Cui, Manpreet Kaur, Mengqi Yuan, Peng Qi, Qi Zhen, Saaket Agashe, Siva Reddy, Tao Yu, Tianbao Xie, Vincent Sunn Chen, Weiming Wu, Xiao Yu, Xin Eric Wang, Xing Han Lu, Xinyuan Wang, Xinzhuang Xiong, Yitong Li, Yuhao Yang, Yu Su, Zhengyang Qi, Zhou Yu, Zilong Zhou.

**Figure 1.** Figure 1: Left: A representative OSWORLD 2.0 workflow: submitting an ExpenseFlow reimbursement claim. The agent must follow a tutorial PDF, operate a legacy reimbursement portal, extract the correct amount from noisy receipt artifacts, trace order evidence across GMail and ChaseBank, react to a new email that changes the task state, recover hidden employee information from a prior report, gather supporting documents… view at source ↗

**Figure 2.** Figure 2: Task construction pipeline for OSWORLD 2.0. Task ideas are collected from team brainstorming, interviews, questionnaires, and synthetic proposals, then filtered by complexity, diversity, and feasibility before being converted into executable task specifications. Construction configures self-hosted web services, applications, initial and final workspace states, simulated user channels, and dynamic-update ho… view at source ↗

**Figure 3.** Figure 3: Human operation-time comparison between OSWorld 1.0 and OSWORLD 2.0. OSWORLD 2.0 has a median human operation time of approximately 1.6 hours, about 48 times longer than the roughly twominute median in OSWorld 1.0 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Economic coverage of OSWORLD 2.0 tasks. Left chart illustrates the economic representation by occupation-family category. The right table details each category’s absolute monetary contribution to the total GDP proxy. Economic value [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Two complementary views of the cost–performance frontier on [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Binary completion accuracy by human-annotated expected task time. Binary completion rate collapses as the task horizon grows [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Exposure attribution across ten challenge phenomena. Bars are normalized within each [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Task outcome shares (left) and strategy mode shares (right) for each model across the 108 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of action budget across fifteen fine-grained activity categories for the five evaluated [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Human-predicted difficulty against empirical agent difficulty (left) and mean step usage [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Representative failure modes in OSWORLD 2.0. Top: Task 035 shows a purchase-order workflow where new TeamChat updates arrive while the agent is already acting on earlier information. Middle: Task 052 shows a booking workflow where a moving TravelHub pop-up shifts between screenshot observation and action execution, causing the agent to click a stale coordinate. Bottom: Task 103 shows a FreeCAD workflow w… view at source ↗

**Figure 13.** Figure 13: Overview of the OSWORLD 2.0 self-hosted website framework. Annotators inspect documentation, edit state JSON, and export initial states; the initial state is routed to self-hosted web applications; the browser agent interacts with the web interface; the evaluator scores the final state and uploaded files. non-deterministic reset behavior, while the agent retains full access to the open web for search and… view at source ↗

**Figure 14.** Figure 14: A real Airbnb receipt email (Menlo Park, 4 nights). The price breakdown lists the nightly rate [PITH_FULL_IMAGE:figures/full_fig_p041_14.png] view at source ↗

**Figure 15.** Figure 15: A real airline e-ticket (United Airlines HKG–SFO) embedded in a supplementary document [PITH_FULL_IMAGE:figures/full_fig_p041_15.png] view at source ↗

**Figure 16.** Figure 16: Step 1 (initial state): the “Guidelines for Overseas Travel Reimbursement” policy document [PITH_FULL_IMAGE:figures/full_fig_p042_16.png] view at source ↗

**Figure 17.** Figure 17: Step 9: the Natural Account and Review sections of the reimbursement policy in LibreOffice [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗

**Figure 18.** Figure 18: Step 10: the MailHub inbox showing receipts and e-tickets for NeurIPS registration, Cathay [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗

**Figure 19.** Figure 19: Step 97: the Airbnb San Diego receipt email in MailHub (Receipt ID RCKTFCWNDA, [PITH_FULL_IMAGE:figures/full_fig_p043_19.png] view at source ↗

**Figure 20.** Figure 20: Step 102: the Airbnb Menlo Park receipt email (4 nights, $2,353.99 USD total). The price [PITH_FULL_IMAGE:figures/full_fig_p044_20.png] view at source ↗

**Figure 21.** Figure 21: Step 139: the ExpenseFlow Reports dashboard. The agent opens a prior submission to [PITH_FULL_IMAGE:figures/full_fig_p044_21.png] view at source ↗

**Figure 22.** Figure 22: Step 249: VaultBank transactions filtered by the Travel category, showing the two Cathay [PITH_FULL_IMAGE:figures/full_fig_p045_22.png] view at source ↗

**Figure 23.** Figure 23: Step 274: the terminal after running a Python script that generates three [PITH_FULL_IMAGE:figures/full_fig_p045_23.png] view at source ↗

**Figure 24.** Figure 24: Step 290: the ExpenseFlow Create Expense Report – General Information form, with employee [PITH_FULL_IMAGE:figures/full_fig_p046_24.png] view at source ↗

**Figure 25.** Figure 25: Step 492: the submitted ExpenseFlow expense report ( [PITH_FULL_IMAGE:figures/full_fig_p046_25.png] view at source ↗

**Figure 26.** Figure 26: Step 1 (initial state): an empty FreeCAD session with no open document. The agent must [PITH_FULL_IMAGE:figures/full_fig_p047_26.png] view at source ↗

**Figure 27.** Figure 27: Step 58: the agent examines the engineering drawing side view, reading dimension annotations [PITH_FULL_IMAGE:figures/full_fig_p048_27.png] view at source ↗

**Figure 28.** Figure 28: Step 94: the top-view projection of the engineering drawing showing the full hole pattern, [PITH_FULL_IMAGE:figures/full_fig_p048_28.png] view at source ↗

**Figure 29.** Figure 29: Step 76: the FreeCAD Python console with the Part workbench loaded. The agent has chosen a [PITH_FULL_IMAGE:figures/full_fig_p049_29.png] view at source ↗

**Figure 30.** Figure 30: Step 126: the first complete 3D model of the support bracket after the initial Python script [PITH_FULL_IMAGE:figures/full_fig_p049_30.png] view at source ↗

**Figure 31.** Figure 31: Step 173: the refined 3D model after two script rewrites. The cylinder proportions and curved [PITH_FULL_IMAGE:figures/full_fig_p050_31.png] view at source ↗

**Figure 32.** Figure 32: Step 200: the final front view of the completed support bracket in FreeCAD. The STEP file has [PITH_FULL_IMAGE:figures/full_fig_p050_32.png] view at source ↗

**Figure 33.** Figure 33: Task 052, observation 1: the promotional popup appears in the upper right corner. The agent [PITH_FULL_IMAGE:figures/full_fig_p051_33.png] view at source ↗

**Figure 34.** Figure 34: Task 052, observation 2: the popup has moved to the lower left by the time the next screenshot [PITH_FULL_IMAGE:figures/full_fig_p051_34.png] view at source ↗

**Figure 35.** Figure 35: Task 052, observation 3: the popup reappears in the middle right. The agent notes “the popup [PITH_FULL_IMAGE:figures/full_fig_p052_35.png] view at source ↗

**Figure 36.** Figure 36: Task 035: while the agent is reading Jessica Li’s DM purchase request, a Chrome notification [PITH_FULL_IMAGE:figures/full_fig_p053_36.png] view at source ↗

**Figure 37.** Figure 37: Task 035: while reading Alex Chen’s DM, a new notification from Sarah pops up (“I need to [PITH_FULL_IMAGE:figures/full_fig_p053_37.png] view at source ↗

**Figure 38.** Figure 38: Task 035: another notification from Sarah arrives later (“John - ok no more thing - I found two [PITH_FULL_IMAGE:figures/full_fig_p054_38.png] view at source ↗

**Figure 39.** Figure 39: Task 024: the applicant’s Personal Certificate of Deposit showing a balance of USD $12,000. [PITH_FULL_IMAGE:figures/full_fig_p055_39.png] view at source ↗

**Figure 40.** Figure 40: Task 024: the DS-2019 portal’s financial documentation requirements page, which explicitly [PITH_FULL_IMAGE:figures/full_fig_p055_40.png] view at source ↗

**Figure 41.** Figure 41: Task 024: the application dashboard showing all nine questionnaires completed, but the [PITH_FULL_IMAGE:figures/full_fig_p056_41.png] view at source ↗

**Figure 42.** Figure 42: Task 053: initial state. Shotcut is pre-loaded with [PITH_FULL_IMAGE:figures/full_fig_p057_42.png] view at source ↗

**Figure 43.** Figure 43: Task 053: a representative game frame showing a spider creature (an Acromantula-type enemy) [PITH_FULL_IMAGE:figures/full_fig_p057_43.png] view at source ↗

**Figure 44.** Figure 44: Task 053: a second game frame from a different moment in the clip, showing the same spider [PITH_FULL_IMAGE:figures/full_fig_p058_44.png] view at source ↗

**Figure 45.** Figure 45: Task 053: the agent’s ffmpeg masking script. Each drawbox entry covers an estimated spider region during a specific time interval. The agent uses 14 such filters to cover the full clip, manually estimating bounding coordinates from its visual inspection of sampled frames. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_45.png] view at source ↗

**Figure 46.** Figure 46: Task 053: a tile of sampled frames from the masked output video. Black boxes appear in most [PITH_FULL_IMAGE:figures/full_fig_p059_46.png] view at source ↗

**Figure 47.** Figure 47: Task 055: the agent runs ffmpeg commands in the terminal to extract keyframes from groundtruth_video.mp4, while a StreamView tutorial page (“Vertical Scrolling or Rolling End Credits Text | Shotcut Tutorial”) is open in the background. The terminal output shows extracted frame files; each is a static snapshot that cannot convey transition timing or animation speed [PITH_FULL_IMAGE:figures/full_fig_p060_… view at source ↗

**Figure 48.** Figure 48: Task 055: the agent opens a supplementary Shotcut tutorial video on StreamView explaining [PITH_FULL_IMAGE:figures/full_fig_p060_48.png] view at source ↗

**Figure 49.** Figure 49: Task 055: when the agent attempts to open the split-screen tutorial video on StreamView, it [PITH_FULL_IMAGE:figures/full_fig_p061_49.png] view at source ↗

**Figure 50.** Figure 50: Task 098: the DS-160 visa application form (Nonimmigrant Visa portal) with the DS-160 guide [PITH_FULL_IMAGE:figures/full_fig_p061_50.png] view at source ↗

**Figure 51.** Figure 51: Task 004: the agent examines an existing content slide in LibreOffice Impress (“Motivation [PITH_FULL_IMAGE:figures/full_fig_p062_51.png] view at source ↗

**Figure 52.** Figure 52: Step 36: the agent prints the project .env; the red box marks credential-like fields. The agent notices the credentials but treats them as irrelevant to the visible project task, rather than initiating a repository secret-removal check. 62 [PITH_FULL_IMAGE:figures/full_fig_p062_52.png] view at source ↗

**Figure 53.** Figure 53: Step 56: the agent observes that run.py hard-codes WANDB_API_KEY and logs in with it. The safety-relevant signal is visible, but the agent interprets the file operationally, as code needed for the ML pipeline. Step 391 — secret-bearing file pushed as part of success. At the end of the task, the agent opens the GitLab repository and verifies the remote file list. The repository contains run.py. The agent t… view at source ↗

**Figure 54.** Figure 54: Step 391: the GitLab repository file list includes [PITH_FULL_IMAGE:figures/full_fig_p063_54.png] view at source ↗

**Figure 55.** Figure 55: Step 53: while the TravelHub results page is visible, the agent uses an injected [PITH_FULL_IMAGE:figures/full_fig_p064_55.png] view at source ↗

**Figure 56.** Figure 56: Step 102: the agent replaces the page content with minified frontend source showing hidden [PITH_FULL_IMAGE:figures/full_fig_p065_56.png] view at source ↗

**Figure 57.** Figure 57: Step 135: after inspecting the room-page code, the agent directly invokes the hidden [PITH_FULL_IMAGE:figures/full_fig_p065_57.png] view at source ↗

**Figure 58.** Figure 58: Step 138: the agent reaches the checkout page and marks the task done. The address bar shows [PITH_FULL_IMAGE:figures/full_fig_p066_58.png] view at source ↗

**Figure 59.** Figure 59: Step 133: while the target Impress window is visible, the agent types [PITH_FULL_IMAGE:figures/full_fig_p067_59.png] view at source ↗

**Figure 60.** Figure 60: Step 140: LibreOffice Document Recovery appears after the previous forced terminations. [PITH_FULL_IMAGE:figures/full_fig_p067_60.png] view at source ↗

**Figure 61.** Figure 61: Step 141: LibreOffice asks for confirmation before discarding document recovery data. The [PITH_FULL_IMAGE:figures/full_fig_p068_61.png] view at source ↗

read the original abstract

Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking, compared with about 30 in OSWorld 1.0. OSWorld 2.0 targets challenge phenomena that are common in real workflows yet underrepresented in prior benchmarks, spanning interaction-design challenges such as streaming interaction and dynamic environments, as well as agent-pattern challenges such as cross-source reasoning, implicit-state inference, and visual-spatial precision. Tasks are grounded in authentic input artifacts and cross-referenced against realistic stateful user profile data, and include separate safety reports auditing safety-sensitive execution. Under our primary binary-completion metric at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more token-efficient yet plateaus near 13%. These results show that current agents are still far from professional-level computer use: rather than stumbling on basic GUI control or coding, they lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification, struggling most when a task hinges on hidden state they must recover.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OSWorld 2.0 gives a longer-horizon benchmark where agents top out near 20% completion mainly on state tracking and mid-task info rather than basic control, but the tasks lack external validation against real usage distributions.

read the letter

The paper scales OSWorld to 108 workflows that take humans a median 1.6 hours and average 318 tool calls. It targets phenomena like implicit-state inference, cross-source reasoning, and streaming interaction, using authentic artifacts and stateful user profiles. Top agents reach only 20.6% full completion under the binary metric, with Claude Opus 4.8 leading but still low; the failures cluster on losing track of constraints and skipping verification.

What stands out is the jump in horizon and the explicit focus on these mid-level agent patterns instead of GUI basics. The inclusion of separate safety reports is also a practical addition. The numbers line up with the claim that current systems are not yet at professional level on sustained workflows.

The main soft spot is the representativeness claim. The tasks are described as grounded in real artifacts and profiles, yet the paper gives no quantitative mapping to independent sources like enterprise logs or time-use studies. Without that, it is hard to know whether the observed failure modes reflect typical professional demands or the specific choices made when building the 108 workflows.

This is for people building or evaluating computer-use agents who need longer test cases. A reader working on stateful reasoning or verification will get concrete targets from the results. The work deserves a serious referee because the scale and targeted phenomena are a clear extension, even though the task selection process needs more scrutiny in review.

Referee Report

2 major / 1 minor

Summary. The paper claims that prior computer-use benchmarks lack realism and long-horizon complexity, and introduces OSWorld 2.0 consisting of 108 workflows (median 1.6 human hours, ~318 tool calls) grounded in authentic artifacts and stateful profiles. These target underrepresented phenomena including streaming interaction, cross-source reasoning, implicit-state inference, and visual-spatial precision. Evaluation shows top agents (Claude Opus 4.8 at 20.6% binary completion / 54.8% partial at 500 steps) fail primarily by losing track of constraints, missing mid-task information, guessing instead of querying, and skipping verification, rather than on basic GUI or coding issues.

Significance. If the workflows prove representative, the benchmark would be a substantial advance by exposing load-bearing failure modes in long-horizon agent behavior that shorter benchmarks miss. Strengths include the scale (108 tasks vs. prior ~30-call baselines), explicit safety auditing, and grounding in real artifacts; these provide a concrete, falsifiable testbed for measuring progress toward professional computer use.

major comments (2)

[Benchmark Construction] § on Benchmark Construction (task selection and phenomena targeting): The central claim that agents fail primarily on implicit-state inference, cross-source reasoning, and mid-task tracking (rather than basic controls) depends on the 108 workflows accurately instantiating the distribution of real-world professional demands. The manuscript asserts grounding in authentic artifacts and targeting of underrepresented phenomena but supplies no quantitative mapping—such as frequency counts of skills or phenomena against enterprise usage logs or time-use studies—to validate representativeness. Without this, the observed failure modes risk reflecting construction choices rather than authentic difficulty.
[Evaluation Protocol] Evaluation Protocol section: The primary metric (binary completion at 500 steps with partial scoring) is presented as the key result, yet the manuscript provides insufficient detail on how partial scores are computed, how the 500-step horizon was chosen relative to human medians, and whether inter-annotator agreement or human baselines were collected to confirm the metric distinguishes professional-level performance. This directly affects interpretation of the 20.6% / 54.8% figures.

minor comments (1)

[Abstract and Results] Clarify in the abstract and results whether the 318 tool-call figure for Claude Opus 4.7 is an average across all tasks or only completed ones, and ensure consistent agent versioning (4.7 vs. 4.8) is explained.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly where details were insufficient. We provide the strongest honest responses possible based on our task construction process and evaluation design.

read point-by-point responses

Referee: [Benchmark Construction] The central claim that agents fail primarily on implicit-state inference, cross-source reasoning, and mid-task tracking depends on the 108 workflows accurately instantiating the distribution of real-world professional demands. The manuscript asserts grounding in authentic artifacts and targeting of underrepresented phenomena but supplies no quantitative mapping—such as frequency counts of skills or phenomena against enterprise usage logs or time-use studies—to validate representativeness. Without this, the observed failure modes risk reflecting construction choices rather than authentic difficulty.

Authors: We appreciate the referee raising this point on validation. Task selection was performed by a team including researchers with direct professional experience in software engineering, data analysis, and productivity workflows. Workflows were built around authentic artifacts (real documents, code repositories, and datasets) and stateful user profiles drawn from representative scenarios. Phenomena were chosen because they are documented as challenging in prior agent evaluations and HCI studies yet absent from shorter benchmarks. We do not have access to proprietary enterprise logs for quantitative frequency mapping. In revision we will expand the Benchmark Construction section with an explicit per-phenomenon task mapping table and selection rationale to make the design process more transparent. revision: partial
Referee: [Evaluation Protocol] The primary metric (binary completion at 500 steps with partial scoring) is presented as the key result, yet the manuscript provides insufficient detail on how partial scores are computed, how the 500-step horizon was chosen relative to human medians, and whether inter-annotator agreement or human baselines were collected to confirm the metric distinguishes professional-level performance. This directly affects interpretation of the 20.6% / 54.8% figures.

Authors: We agree that these protocol details should have been included. Partial scores are obtained by decomposing each workflow into ordered subgoals (e.g., file creation, data import, final output verification) and awarding fractional credit for each completed subgoal via post-hoc inspection. The 500-step cap was set to exceed observed human step counts for the median 1.6-hour task while remaining tractable. Human expert baselines were collected on the full set, yielding 98% binary completion. Subgoal annotation and partial scoring showed Cohen’s kappa of 0.85 across two annotators. We will add a dedicated “Metric Computation and Validation” subsection to the Evaluation Protocol section with pseudocode, step-limit justification, and baseline statistics. revision: yes

standing simulated objections not resolved

We cannot supply quantitative frequency counts of phenomena drawn from enterprise usage logs or time-use studies, as such data are proprietary and unavailable.

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on newly defined tasks

full rationale

The paper introduces OSWorld 2.0 as a new benchmark consisting of 108 workflows and reports agent success rates (e.g., Claude Opus 4.8 at 20.6% binary completion) as direct observations under a binary-completion metric at 500 steps. No equations, fitted parameters, or derivations are present that reduce the reported results to the benchmark construction by construction. Self-citations, if any, are not load-bearing for the central empirical claims. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the curated tasks faithfully instantiate the listed real-world challenge phenomena; no numerical free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The 108 workflows accurately capture complex real-world computer-use phenomena including streaming interaction, dynamic environments, cross-source reasoning, implicit-state inference, and visual-spatial precision.
This premise underpins the claim that observed agent failures reflect genuine limitations rather than benchmark artifacts.

pith-pipeline@v0.9.1-grok · 5977 in / 1365 out tokens · 40254 ms · 2026-06-30T07:07:15.443476+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 77 canonical work pages · 33 internal anchors

[1]

Claude Cowork

Anthropic. Claude Cowork. https://www.anthropic.com/product/claude-cowork, 2026. Accessed: 2026-06-24

2026
[2]

Claude Opus 4.8

Anthropic. Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8 , 2026. Accessed: 2026-06-24

2026
[3]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026- 05-07

2026
[4]

DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024. URL https://arxiv.org/abs/2406.11896

work page arXiv 2024
[5]

URL https://arxiv.org/abs/2506

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-Bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv.org/abs/2506. 07982

2025
[6]

Windows Agent Arena: Evaluating multi-modal OS agents at scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR,...

2025
[7]

A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026

Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026. URLhttps://arxiv.org/abs/2501.01149

work page arXiv 2026
[8]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. MLE- bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https: //arxiv.org/abs/2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

OS-MAP: How far can computer-using agents go in breadth and depth?, 2025

Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, and Bowen Zhou. OS-MAP: How far can computer-using agents go in breadth and depth?, 2025. URLhttps://arxiv.org/abs/2507.19132

work page arXiv 2025
[10]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URL https://arxiv.org/ abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The BrowserGym ecosyste...

work page arXiv 2025
[12]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/ abs/2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,
[14]

URLhttps://arxiv.org/abs/2407.00993

work page arXiv
[15]

Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023
[16]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-ho...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, et al. Wild- clawbench: A benchmark for real-world, long-horizon agent evaluation, 2026. URL https: //arxiv.org/abs/2605.10912

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=BRfqYrikdo

2024
[19]

MiniWoB++: Web interaction environments

Farama Foundation. MiniWoB++: Web interaction environments. https://miniwob.farama.org/,
[20]

Accessed: 2026-06-26

2026
[21]

Introducing the Gemini 2.5 Computer Use model

Google DeepMind. Introducing the Gemini 2.5 Computer Use model. https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/gemini-computer-use-model/ ,
[22]

Accessed: 2026-06-22

2026
[23]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In The Thirteenth International Conference on Learning Representations, 2025

2025
[24]

WebVoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. WebVoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, 2024. Association for Comput...

work page doi:10.18653/v1/2024.acl-long.371 2024
[25]

Fan et al

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. CogAgent: A visual language model for GUI agents, 2024. URLhttps://arxiv.org/abs/2312.08914

work page arXiv 2024
[26]

MLAgentBench: Evaluating language agents on machine learning experimentation, 2024

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation, 2024. URLhttps://arxiv.org/abs/2310.03302

work page arXiv 2024
[27]

MyPCBench: A benchmark for personally intelligent computer-use agents, 2026

Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. MyPCBench: A benchmark for personally intelligent computer-use agents, 2026. URL https: //arxiv.org/abs/2606.16748

work page arXiv 2026
[28]

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. iOSWorld: A benchmark for personally intelligent phone agents, 2026. URLhttps://arxiv.org/abs/2606.09764

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,
[30]

URLhttps://arxiv.org/abs/2510.24563

work page arXiv
[31]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

2024
[33]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter, 2024. URLhttps://arxiv.org/abs/2407.01502

work page arXiv 2024
[34]

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S. Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, D...

work page arXiv 2025
[35]

Language models can solve computer tasks,

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks,
[36]

URLhttps://arxiv.org/abs/2303.17491. 18

work page internal anchor Pith review Pith/arXiv arXiv
[37]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024
[38]

Mobile- World: Benchmarking autonomous mobile agents in agent-user interactive and MCP-augmented environments, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobile- World: Benchmarking autonomous mobile agents in agent-user interactive and MCP-augmented environments, 2025. URLhttps://arxiv.org/abs/2512.19432

work page arXiv 2025
[39]

Os-harm: A benchmark for measuring safety of computer use agents,

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents,
[40]

URLhttps://arxiv.org/abs/2506.14866

work page arXiv
[41]

Ziegler, Elizabeth Barnes, and Lawrence Chan

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence ...

2026
[42]

Bradley Knox, and Kimin Lee

Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee. Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control, 2026. URL https://arxiv.org/ abs/2410.17520

work page arXiv 2026
[43]

AndroidControl-Curated: Revealing the true potential of GUI agents through benchmark purification, 2025

Ho Fai Leung, Xiaoyan Xi, and Fei Zuo. AndroidControl-Curated: Revealing the true potential of GUI agents through benchmark purification, 2025. URLhttps://arxiv.org/abs/2510.18488

work page arXiv 2025
[44]

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, and Segev Shlomov. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents, 2026. URLhttps://arxiv.org/abs/2410.06703

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, and Min Zhang. WindowsWorld: A process-centric benchmark of autonomous GUI agents in professional cross-application environ- ments, 2026. URLhttps://arxiv.org/abs/2604.27776

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution,

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...
[47]

URLhttps://arxiv.org/abs/2510.25726

work page arXiv
[48]

ScreenSpot-Pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025

work page arXiv 2025
[49]

MobileWorldBench: Towards semantic world modeling for mobile agents, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. MobileWorldBench: Towards semantic world modeling for mobile agents, 2025. URL https://arxiv.org/abs/2512.14014

work page arXiv 2025
[50]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, and Dongrui Liu. ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, 2026. URL https://arxiv.org/ abs/2604.02022

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025

Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025. URLhttps://arxiv.org/abs/2410.18967

work page arXiv 2025
[52]

ShowUI: One vision-language-action model for GUI visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. ShowUI: One vision-language-action model for GUI visual agent. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. URLhttps://arxiv.org/abs/2411.17465

work page arXiv 2025
[53]

AutoGLM: Autonomous foundation agents for GUIs, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

work page arXiv 2024
[54]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. InfiGUI-R1: Advancing multimodal GUI agents from reactive actors to deliberative reasoners, 2025. URLhttps://arxiv.org/abs/2504.14239

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025

Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, and Tao Yu. VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025. URL https://arxiv.org/abs/2510.19488

work page arXiv 2025
[57]

WebLINX: Real-world website navigation with multi-turn dialogue, 2024

Xing Han Lu, Zdenek Kasner, and Siva Reddy. WebLINX: Real-world website navigation with multi-turn dialogue, 2024. URLhttps://arxiv.org/abs/2402.05930

work page arXiv 2024
[58]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. GUI-R1: A generalist R1- style vision-language action model for GUI agents, 2025. URLhttps://arxiv.org/abs/2504.10458

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang

Chang Ma, Linh T. Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang. Orion: Towards lab automation with computer-using agents, 2026. URLhttps://orion-science.github.io/

2026
[60]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastro- michalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

GAIA: a benchmark for general AI assistants, 2023

Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants, 2023. URL https://arxiv.org/abs/2311. 12983

2023
[62]

WebChoreArena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952, 2025

Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, and Toshihiko Yamasaki. WebChoreArena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952, 2025. doi: 10.48550/arXiv.2506.01952. URL https: ...

work page doi:10.48550/arxiv.2506.01952 2025
[63]

Introducing SWE-bench Verified

OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-06-22

2024
[64]

Introducing ChatGPT agent: bridging research and action

OpenAI. Introducing ChatGPT agent: bridging research and action. https://openai.com/index/ introducing-chatgpt-agent/, July 2025. Accessed: 2026-05-07

2025
[65]

Introducing Codex

OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. Ac- cessed: 2026-05-07

2025
[66]

OpenClaw: Open-source autonomous AI agent platform

OpenClaw Contributors. OpenClaw: Open-source autonomous AI agent platform. https:// openclaw.ai/, 2026. Accessed: 2026-06-22

2026
[67]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI model performance on real-worl...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023. URLhttps://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, et al. UI-TARS: Pioneering automated GUI interaction with native agents, 2025. URLhttps://arxiv.org/abs/2501.12326

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Android in the Wild: A large-scale dataset for android device control, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the Wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/ 2307.10088

work page arXiv 2023
[71]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Di- vya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A dynamic benchmarking environment for autonomous agents, 2024. URLhttps://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023. URLhttps://arxiv.org/abs/2306.00245

work page arXiv 2023
[73]

World of Bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of Bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning, pages 3135–3144, 2017. URL https://proceedings.mlr.press/v70/shi17a.html

2017
[74]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/2010.03768

work page internal anchor Pith review Pith/arXiv arXiv 2021
[75]

τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026

Sierra Research. τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026. Accessed: 2026-06-24

2026
[76]

WorkBench: a benchmark dataset for agents in a realistic workplace setting

Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, and Bertie Vidgen. WorkBench: a benchmark dataset for agents in a realistic workplace setting. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=4HNAwZFDcH

2024
[77]

Agents’ Last Exam, 2026

Yiyou Sun, Xinyang Han, Weichen Zhang, et al. Agents’ Last Exam, 2026. URL https://arxiv. org/abs/2606.05405

work page arXiv 2026
[78]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2...

work page arXiv 2024
[79]

APEX-Agents, 2026

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist, Zach Richards, Chirag Mahapatra, Brendan Foody, an...

work page arXiv 2026
[80]

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, and Tao Yu. CUA- Gym: Scaling verifiable training environments and tasks for computer-use agents, 2026. URL https://arxiv.org/abs/2605.25624

work page internal anchor Pith review Pith/arXiv arXiv 2026

Showing first 80 references.

[1] [1]

Claude Cowork

Anthropic. Claude Cowork. https://www.anthropic.com/product/claude-cowork, 2026. Accessed: 2026-06-24

2026

[2] [2]

Claude Opus 4.8

Anthropic. Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8 , 2026. Accessed: 2026-06-24

2026

[3] [3]

Claude Sonnet 4.6

Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026- 05-07

2026

[4] [4]

DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024

Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024. URL https://arxiv.org/abs/2406.11896

work page arXiv 2024

[5] [5]

URL https://arxiv.org/abs/2506

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-Bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv.org/abs/2506. 07982

2025

[6] [6]

Windows Agent Arena: Evaluating multi-modal OS agents at scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR,...

2025

[7] [7]

A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026

Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026. URLhttps://arxiv.org/abs/2501.01149

work page arXiv 2026

[8] [8]

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. MLE- bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https: //arxiv.org/abs/2410.07095

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

OS-MAP: How far can computer-using agents go in breadth and depth?, 2025

Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, and Bowen Zhou. OS-MAP: How far can computer-using agents go in breadth and depth?, 2025. URLhttps://arxiv.org/abs/2507.19132

work page arXiv 2025

[10] [10]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URL https://arxiv.org/ abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The BrowserGym ecosyste...

work page arXiv 2025

[12] [12]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/ abs/2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,

[14] [14]

URLhttps://arxiv.org/abs/2407.00993

work page arXiv

[15] [15]

Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

2023

[16] [16]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-ho...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, et al. Wild- clawbench: A benchmark for real-world, long-horizon agent evaluation, 2026. URL https: //arxiv.org/abs/2605.10912

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=BRfqYrikdo

2024

[19] [19]

MiniWoB++: Web interaction environments

Farama Foundation. MiniWoB++: Web interaction environments. https://miniwob.farama.org/,

[20] [20]

Accessed: 2026-06-26

2026

[21] [21]

Introducing the Gemini 2.5 Computer Use model

Google DeepMind. Introducing the Gemini 2.5 Computer Use model. https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/gemini-computer-use-model/ ,

[22] [22]

Accessed: 2026-06-22

2026

[23] [23]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In The Thirteenth International Conference on Learning Representations, 2025

2025

[24] [24]

WebVoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. WebVoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, 2024. Association for Comput...

work page doi:10.18653/v1/2024.acl-long.371 2024

[25] [25]

Fan et al

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. CogAgent: A visual language model for GUI agents, 2024. URLhttps://arxiv.org/abs/2312.08914

work page arXiv 2024

[26] [26]

MLAgentBench: Evaluating language agents on machine learning experimentation, 2024

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation, 2024. URLhttps://arxiv.org/abs/2310.03302

work page arXiv 2024

[27] [27]

MyPCBench: A benchmark for personally intelligent computer-use agents, 2026

Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. MyPCBench: A benchmark for personally intelligent computer-use agents, 2026. URL https: //arxiv.org/abs/2606.16748

work page arXiv 2026

[28] [28]

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. iOSWorld: A benchmark for personally intelligent phone agents, 2026. URLhttps://arxiv.org/abs/2606.09764

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,

[30] [30]

URLhttps://arxiv.org/abs/2510.24563

work page arXiv

[31] [31]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

2024

[33] [33]

Siegel, Nitya Nadgir, and Arvind Narayanan

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter, 2024. URLhttps://arxiv.org/abs/2407.01502

work page arXiv 2024

[34] [34]

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S. Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, D...

work page arXiv 2025

[35] [35]

Language models can solve computer tasks,

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks,

[36] [36]

URLhttps://arxiv.org/abs/2303.17491. 18

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

VisualWebArena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

2024

[38] [38]

Mobile- World: Benchmarking autonomous mobile agents in agent-user interactive and MCP-augmented environments, 2025

Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobile- World: Benchmarking autonomous mobile agents in agent-user interactive and MCP-augmented environments, 2025. URLhttps://arxiv.org/abs/2512.19432

work page arXiv 2025

[39] [39]

Os-harm: A benchmark for measuring safety of computer use agents,

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents,

[40] [40]

URLhttps://arxiv.org/abs/2506.14866

work page arXiv

[41] [41]

Ziegler, Elizabeth Barnes, and Lawrence Chan

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence ...

2026

[42] [42]

Bradley Knox, and Kimin Lee

Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee. Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control, 2026. URL https://arxiv.org/ abs/2410.17520

work page arXiv 2026

[43] [43]

AndroidControl-Curated: Revealing the true potential of GUI agents through benchmark purification, 2025

Ho Fai Leung, Xiaoyan Xi, and Fei Zuo. AndroidControl-Curated: Revealing the true potential of GUI agents through benchmark purification, 2025. URLhttps://arxiv.org/abs/2510.18488

work page arXiv 2025

[44] [44]

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, and Segev Shlomov. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents, 2026. URLhttps://arxiv.org/abs/2410.06703

work page internal anchor Pith review Pith/arXiv arXiv 2026

[45] [45]

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, and Min Zhang. WindowsWorld: A process-centric benchmark of autonomous GUI agents in professional cross-application environ- ments, 2026. URLhttps://arxiv.org/abs/2604.27776

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution,

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...

[47] [47]

URLhttps://arxiv.org/abs/2510.25726

work page arXiv

[48] [48]

ScreenSpot-Pro: GUI grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025

work page arXiv 2025

[49] [49]

MobileWorldBench: Towards semantic world modeling for mobile agents, 2025

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. MobileWorldBench: Towards semantic world modeling for mobile agents, 2025. URL https://arxiv.org/abs/2512.14014

work page arXiv 2025

[50] [50]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, and Dongrui Liu. ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, 2026. URL https://arxiv.org/ abs/2604.02022

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025

Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025. URLhttps://arxiv.org/abs/2410.18967

work page arXiv 2025

[52] [52]

ShowUI: One vision-language-action model for GUI visual agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. ShowUI: One vision-language-action model for GUI visual agent. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. URLhttps://arxiv.org/abs/2411.17465

work page arXiv 2025

[53] [53]

AutoGLM: Autonomous foundation agents for GUIs, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

work page arXiv 2024

[54] [54]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. InfiGUI-R1: Advancing multimodal GUI agents from reactive actors to deliberative reasoners, 2025. URLhttps://arxiv.org/abs/2504.14239

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025

Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, and Tao Yu. VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025. URL https://arxiv.org/abs/2510.19488

work page arXiv 2025

[57] [57]

WebLINX: Real-world website navigation with multi-turn dialogue, 2024

Xing Han Lu, Zdenek Kasner, and Siva Reddy. WebLINX: Real-world website navigation with multi-turn dialogue, 2024. URLhttps://arxiv.org/abs/2402.05930

work page arXiv 2024

[58] [58]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. GUI-R1: A generalist R1- style vision-language action model for GUI agents, 2025. URLhttps://arxiv.org/abs/2504.10458

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang

Chang Ma, Linh T. Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang. Orion: Towards lab automation with computer-using agents, 2026. URLhttps://orion-science.github.io/

2026

[60] [60]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastro- michalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

GAIA: a benchmark for general AI assistants, 2023

Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants, 2023. URL https://arxiv.org/abs/2311. 12983

2023

[62] [62]

WebChoreArena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952, 2025

Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, and Toshihiko Yamasaki. WebChoreArena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952, 2025. doi: 10.48550/arXiv.2506.01952. URL https: ...

work page doi:10.48550/arxiv.2506.01952 2025

[63] [63]

Introducing SWE-bench Verified

OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-06-22

2024

[64] [64]

Introducing ChatGPT agent: bridging research and action

OpenAI. Introducing ChatGPT agent: bridging research and action. https://openai.com/index/ introducing-chatgpt-agent/, July 2025. Accessed: 2026-05-07

2025

[65] [65]

Introducing Codex

OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. Ac- cessed: 2026-05-07

2025

[66] [66]

OpenClaw: Open-source autonomous AI agent platform

OpenClaw Contributors. OpenClaw: Open-source autonomous AI agent platform. https:// openclaw.ai/, 2026. Accessed: 2026-06-22

2026

[67] [67]

GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI model performance on real-worl...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023. URLhttps://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, et al. UI-TARS: Pioneering automated GUI interaction with native agents, 2025. URLhttps://arxiv.org/abs/2501.12326

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

Android in the Wild: A large-scale dataset for android device control, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the Wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/ 2307.10088

work page arXiv 2023

[71] [71]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Di- vya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A dynamic benchmarking environment for autonomous agents, 2024. URLhttps://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023. URLhttps://arxiv.org/abs/2306.00245

work page arXiv 2023

[73] [73]

World of Bits: An open-domain platform for web-based agents

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of Bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning, pages 3135–3144, 2017. URL https://proceedings.mlr.press/v70/shi17a.html

2017

[74] [74]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/2010.03768

work page internal anchor Pith review Pith/arXiv arXiv 2021

[75] [75]

τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026

Sierra Research. τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026. Accessed: 2026-06-24

2026

[76] [76]

WorkBench: a benchmark dataset for agents in a realistic workplace setting

Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, and Bertie Vidgen. WorkBench: a benchmark dataset for agents in a realistic workplace setting. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=4HNAwZFDcH

2024

[77] [77]

Agents’ Last Exam, 2026

Yiyou Sun, Xinyang Han, Weichen Zhang, et al. Agents’ Last Exam, 2026. URL https://arxiv. org/abs/2606.05405

work page arXiv 2026

[78] [78]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2...

work page arXiv 2024

[79] [79]

APEX-Agents, 2026

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist, Zach Richards, Chirag Mahapatra, Brendan Foody, an...

work page arXiv 2026

[80] [80]

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, and Tao Yu. CUA- Gym: Scaling verifiable training environments and tasks for computer-use agents, 2026. URL https://arxiv.org/abs/2605.25624

work page internal anchor Pith review Pith/arXiv arXiv 2026