pith. machine review for the scientific record. sign in

arxiv: 2604.21375 · v2 · submitted 2026-04-23 · 💻 cs.CL · cs.AI· cs.SE

Recognition: unknown

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SE
keywords GUI agentsautonomous GUI automationcompleteness verifierloop breakersearch agentOSWorldWindowsAgentArenaagentic framework
0
0 comments X

The pith

VLAA-GUI equips GUI agents with a verifier for task completion, a loop breaker for recovery, and a search agent for new workflows to avoid early false stops and repetitive failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to fix two persistent failures in autonomous GUI agents: they often claim success without visual proof that the task is done, and they repeat the same ineffective actions in loops. VLAA-GUI adds three required pieces to any base agent: a Completeness Verifier that checks UI evidence before accepting a finish, a Loop Breaker that switches interaction modes or strategies after repeated failures or unchanged screens, and a Search Agent that pulls plain-text workflow instructions from online queries when the agent does not know what to do next. The authors show these additions raise success rates on Linux and Windows task suites and let several strong language-model backbones exceed human performance on one benchmark. A reader would care because reliable GUI agents could automate routine computer work without constant human correction or wasted computation. Ablation tests indicate the components help both strong and weaker backbones, with larger gains for weaker ones when step limits are generous.

Core claim

VLAA-GUI is a modular framework built around three mandatory components that tell an agent when to stop, recover, and search. The Completeness Verifier enforces UI-observable success criteria at every proposed finish step and rejects claims lacking direct visual evidence. The Loop Breaker applies multi-tier rules that change interaction mode after repeated failures, force strategy shifts after persistent screen states, and tie reflection to action changes. An on-demand Search Agent queries a capable LLM for unfamiliar workflows and returns results as plain text, while optional Coding and Grounding Agents are invoked only when needed. When added to five top backbones and tested on OSWorld and

What carries the argument

The VLAA-GUI modular framework whose three core components (Completeness Verifier, Loop Breaker, and Search Agent) enforce stopping rules, recovery from repetition, and external knowledge lookup.

If this is right

  • All three components improve performance on both strong and weaker backbone models.
  • Weaker backbones gain more when the step budget is large enough to use the recovery and search tools.
  • The Loop Breaker cuts wasted steps by nearly half for models that tend to loop.
  • Three of the five tested backbones surpass the reported human performance level on OSWorld in a single pass.
  • The same framework works across both Linux and Windows task environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stop-recover-search pattern could be added to agents that control web browsers or mobile apps rather than desktop GUIs.
  • Making the search agent optional and text-based lowers the barrier for adding external knowledge without retraining the base model.
  • Strong emphasis on visible evidence of success may reduce the need for human review in deployed automation systems.
  • Weaker models paired with these tools might close the gap with stronger models on step-limited tasks.

Load-bearing premise

That reliable UI-observable success criteria exist for most real-world GUI tasks and can be checked by the verifier without rejecting valid completions or becoming overly conservative.

What would settle it

Run the Completeness Verifier on a new collection of GUI tasks whose end states have ambiguous visual signals and measure whether it falsely rejects more than a small fraction of actually completed runs.

Figures

Figures reproduced from arXiv: 2604.21375 by Alvaro A. Cardenas, Caiming Xiong, Cihang Xie, Haoqin Tu, Haoyue Dai, Huaxiu Yao, Nancy Lau, Qijun Han, Ran Xu, Yiyang Zhou, Yuhui Xu, Yuyin Zhou, Zeyu Zheng, Zijun Wang.

Figure 1
Figure 1. Figure 1: Advantages of VLAA-GUI. (a) Our VLAA-GUI (w/ Opus 4.6) achieves the best results (77.5%) on OSWorld-Verified [67] and surpasses human performance with one pass. (b) On one hand, by employing Completeness Verifier, VLAA-GUI mitigates the early stopping issue. (c) On the other hand, the proposed Loop Breaker helps ease the repetitive-looping problem and save wasted action steps in GUI tasks. with Opus 4.6, a… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VLAA-GUI. The Manager Agent decides the overall plan and provides concrete actions to the environment. We integrate two mandatory tools Loop Breaker and Completeness Verifier (Verifier) that are called after every action. Three on-demand tools—Search Agent (Searcher), Coding Agent (Coder), and Grounding Agent (Grounder)—are available for the Manager to deploy as needed. 3 The VLAA-GUI System 3.… view at source ↗
Figure 3
Figure 3. Figure 3: The Completeness Verifier (w/ Verifier) reduces false completion rates (False Done / Failed, False Done / All) and the Loop Breaker reduces loop incidence (Loop / Failed, Loop / All) and the wasted steps ratio on OSWorld. Loop Breaker drops overall accuracy by 11.0%, 9.1%, and 7.8% respectively (from 60.4%); at 100 steps the Completeness Verifier becomes the most impactful (−9.7%), followed by Search and L… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of the Completeness Verifier and Search Agent across step budgets on OSWorld. Sonnet 4.6 benefits consistently at all budgets, while Gemini 3 Flash gains only at relaxed budgets: tool calls consume actions that less efficient models cannot afford under tight constraints. decreasing from 3.2% to 2.1% over all steps. The effect is more obvious for the Gemini 3 Flash, where LF drops by 4.5% (from 20.7%… view at source ↗
Figure 5
Figure 5. Figure 5: Case study on OSWorld: Changing slide number color to red in Li￾breOffice Impress. The agent encounters multiple failures before final completion. The Completeness Verifier rejects two premature done() calls, and the Search Agent provides critical procedural knowledge about multi-master-slide editing. retrieved procedural knowledge contains a key insight absent from the agent’s prior context: “If your pres… view at source ↗
read the original abstract

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VLAA-GUI, a modular framework for GUI agents that addresses premature stopping and repetitive loops via three core components: a Completeness Verifier enforcing mandatory UI-observable success criteria with agent-level cross-examination, a Loop Breaker using multi-tier rules for mode switching and strategy changes, and an on-demand Search Agent for querying LLMs on unfamiliar workflows. It also incorporates optional Coding and Grounding Agents. Evaluations across five LLM backbones on OSWorld and WindowsAgentArena benchmarks report top results of 77.5% and 61.0%, with three backbones exceeding the 72.4% human baseline on OSWorld in single-pass settings; ablations indicate consistent gains from all components, with greater benefits for weaker backbones under sufficient step budgets and the Loop Breaker halving wasted steps for loop-prone models.

Significance. If the results hold under rigorous verification, the work provides a practical, backbone-agnostic engineering contribution to reliable GUI automation by directly targeting two common failure modes. Strengths include the multi-backbone evaluation on both Linux and Windows tasks, ablation studies demonstrating component-level improvements, and the explicit reduction in wasted steps, which together offer reproducible evidence for the framework's utility beyond single-model experiments.

major comments (3)
  1. [Abstract] Abstract and Evaluation section: The headline performance claims (77.5% on OSWorld, three backbones >72.4% human) and component benefits are presented without error bars, statistical significance tests, or quantitative failure-case breakdowns, which weakens the ability to assess whether gains are robust or sensitive to implementation choices.
  2. [Completeness Verifier] Completeness Verifier description: The claim that the verifier enforces 'UI-observable success criteria' and 'rejects those lacking direct visual evidence' via agent-level cross-examination is load-bearing for the early-stopping solution, yet the manuscript provides insufficient detail on exact decision rules, prompting templates, or thresholds; without this, it is unclear whether criteria are general or inadvertently benchmark-tuned, directly affecting comparability to human baselines and standard task-completion protocols.
  3. [Ablation studies] Ablation studies and Loop Breaker section: The statement that the Loop Breaker 'nearly halves wasted steps' and that weaker backbones benefit more 'when the step budget is sufficient' requires explicit reporting of the step budgets tested, per-component delta metrics, and the precise multi-tier filtering rules to substantiate the recovery and efficiency claims.
minor comments (2)
  1. [Abstract] Clarify the exact measurement protocol for the 72.4% human baseline on OSWorld to ensure apples-to-apples comparison with the single-pass agent results.
  2. [Search Agent] The integration details for the on-demand Search Agent (how queries are formulated and plain-text results are parsed into actions) could be expanded for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing honest responses and indicating revisions where the manuscript will be updated to improve clarity, reproducibility, and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Evaluation section: The headline performance claims (77.5% on OSWorld, three backbones >72.4% human) and component benefits are presented without error bars, statistical significance tests, or quantitative failure-case breakdowns, which weakens the ability to assess whether gains are robust or sensitive to implementation choices.

    Authors: We agree that the lack of error bars, statistical tests, and failure breakdowns limits assessment of robustness. In the revised manuscript, we will add error bars from repeated runs where feasible given computational costs, include a quantitative failure-case breakdown in the Evaluation section, and report statistical significance for key comparisons. We note that single-run evaluations are common in this domain due to expense, but the multi-backbone consistency provides supporting evidence; full multi-run statistics across all settings will be added where possible. revision: partial

  2. Referee: [Completeness Verifier] Completeness Verifier description: The claim that the verifier enforces 'UI-observable success criteria' and 'rejects those lacking direct visual evidence' via agent-level cross-examination is load-bearing for the early-stopping solution, yet the manuscript provides insufficient detail on exact decision rules, prompting templates, or thresholds; without this, it is unclear whether criteria are general or inadvertently benchmark-tuned, directly affecting comparability to human baselines and standard task-completion protocols.

    Authors: The referee is correct that insufficient detail on decision rules, prompts, and thresholds reduces clarity and raises questions about generality. We will expand the Completeness Verifier section in the revision to include the exact decision rules, full prompting templates for cross-examination, and thresholds. We will also clarify that the criteria rely on general UI-observable elements rather than benchmark-specific tuning, improving comparability to human baselines. revision: yes

  3. Referee: [Ablation studies] Ablation studies and Loop Breaker section: The statement that the Loop Breaker 'nearly halves wasted steps' and that weaker backbones benefit more 'when the step budget is sufficient' requires explicit reporting of the step budgets tested, per-component delta metrics, and the precise multi-tier filtering rules to substantiate the recovery and efficiency claims.

    Authors: We acknowledge that the current description lacks the required specificity on step budgets, deltas, and rules. In the revised manuscript, we will explicitly report the step budgets tested, add tables with per-component delta metrics, and detail the precise multi-tier filtering rules (including pseudocode or examples) in the Loop Breaker and Ablation sections to substantiate the claims on wasted steps and differential benefits. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical framework on external benchmarks

full rationale

The paper describes a modular empirical framework (VLAA-GUI) with three components for GUI agents, evaluated directly on public external benchmarks (OSWorld, WindowsAgentArena) using standard success rates and human baselines. No equations, derivations, fitted parameters, or self-referential reductions appear; performance numbers (e.g., 77.5% OSWorld) are reported as measured outcomes against independent backbones and tasks rather than constructed from prior author-defined quantities. Ablations compare component contributions on the same external setups without circular fitting. The work is self-contained against external benchmarks with no load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The framework rests on the domain assumption that current LLMs can reliably perform reflection, search, and grounding when prompted appropriately; it introduces three new agent components whose effectiveness is demonstrated only within the paper's experiments.

axioms (1)
  • domain assumption LLMs possess sufficient search and reflection capabilities to power the Search Agent and Loop Breaker when given appropriate prompts
    Invoked in the description of the on-demand Search Agent and reflection signals in the Loop Breaker.
invented entities (3)
  • Completeness Verifier no independent evidence
    purpose: Enforce UI-observable success criteria with cross-examination of completion claims
    New mandatory component introduced to address early stopping.
  • Loop Breaker no independent evidence
    purpose: Apply multi-tier filtering to detect and escape repetitive action loops
    New mandatory component introduced to address repetitive failures.
  • Search Agent no independent evidence
    purpose: Query capable LLMs online for unfamiliar workflows and return plain-text results
    New on-demand component introduced for handling novel tasks.

pith-pipeline@v0.9.0 · 5673 in / 1543 out tokens · 63652 ms · 2026-05-09T22:21:40.267092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 51 canonical work pages · 9 internal anchors

  1. [1]

    arXiv preprint arXiv:2410.08164 , year =

    Agashe,S.,Han,J.,Gan,S.,Yang,J.,Li,A.,Wang,X.E.:AgentS:Anopenagentic framework that uses computers like a human. arXiv preprint arXiv:2410.08164 (2024)

  2. [2]

    Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

    Agashe, S., Wong, K., Tu, V., Yang, J., Li, A., Wang, X.E.: Agent S2: A com- positional generalist-specialist framework for computer use agents. arXiv preprint arXiv:2504.00906 (2025)

  3. [3]

    Let’s think in two steps: Mitigating agreement bias in mllms with self-grounded verification.arXiv preprint arXiv:2507.11662,

    Andrade, T., et al.: Self-grounded verification: Detecting hallucinations and bias in MLLM GUI agents. arXiv preprint arXiv:2507.11662 (2025)

  4. [4]

    anthropic

    Anthropic: Developing computer use.https : / / www . anthropic . com / news / developing-computer-use(2024)

  5. [5]

    Anthropic: Claude opus 4.5.https://www.anthropic.com/news/claude-opus-4- 5(2025)

  6. [6]

    anthropic

    Anthropic: Claude sonnet 4.6.https : / / www . anthropic . com / news / claude - sonnet-4-6(2025)

  7. [7]

    com/news/claude-sonnet-4-5

    Anthropic: Introducing claude sonnet 4.5 (Sep 2025),https://www.anthropic. com/news/claude-sonnet-4-5

  8. [8]

    Anthropic: Claude opus 4.6.https://www.anthropic.com/news/claude-opus-4- 6(2026)

  9. [9]

    Screenai: A vision-language model for ui and infographics understanding

    Baechler, G., Narayanan, S., et al.: ScreenAI: A vision-language model for UI and infographics understanding. arXiv preprint arXiv:2402.04615 (2024)

  10. [10]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  11. [11]

    Bonatti, D

    Bonatti, R., Zhao, D., Bonacci, F., Dupont, D., Abdali, S., Li, Y., Lu, Y., Wagle, J., Koishida, K., Bucker, A., Jang, L., Hui, Z.: Windows agent arena: Evaluating multi-modal OS agents at scale. arXiv preprint arXiv:2409.08264 (2024)

  12. [12]

    ByteDance Seed Team: Seed1.8 model card: Towards generalized real-world agency (2025)

  13. [13]

    Cao, R., Lei, F., Wu, H., et al.: Spider2-V: How far are multimodal agents from au- tomating data science and engineering workflows? arXiv preprint arXiv:2407.10956 (2024)

  14. [14]

    Why Do Multi-Agent LLM Systems Fail?

    Cemri, M., et al.: Why do multi-agent LLM systems fail? a taxonomy of failure patterns in multi-agent LLM systems. arXiv preprint arXiv:2503.13657 (2025)

  15. [15]

    arXiv preprint arXiv:2501.01149 (2025)

    Chai, Y., et al.: A3: Android agent arena for mobile GUI agents. arXiv preprint arXiv:2501.01149 (2025)

  16. [16]

    Transactions on Machine Learning Research (2025)

    Chen, H., Tu, H., Wang, F., Liu, H., Tang, X., Du, X., Zhou, Y., Xie, C.: Sft or rl? an early investigation into training r1-like reasoning large vision-language models. Transactions on Machine Learning Research (2025)

  17. [17]

    arXiv preprint arXiv:2602.07008 (2026) VLAA-GUI 17

    Chen, R., Sun, S., Guo, X., Zhang, S., Liu, K., Liu, S., Wang, Z., Zhang, Q., Zhang, H., Cao, X.: Where not to learn: Prior-aligned training with subset-based attri- bution constraints for reliable decision-making. arXiv preprint arXiv:2602.07008 (2026) VLAA-GUI 17

  18. [18]

    Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

    Cheng, K., Sun, Q., Chu, Y., Xu, F., Li, Y., Zhang, J., Wu, Z.: SeeClick: Harnessing GUI grounding for advanced visual GUI agents. arXiv preprint arXiv:2401.10935 (2024)

  19. [19]

    The BrowserGym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

    Chezelles, T.L.S.D., Acero, J., et al.: The BrowserGym ecosystem for web agent research. arXiv preprint arXiv:2412.05467 (2024)

  20. [20]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  21. [21]

    In: UIST

    Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., Kumar, R.: Rico: A mobile app dataset for building data-driven design applica- tions. In: UIST. pp. 845–854 (2017).https://doi.org/10.1145/3126594.3126651

  22. [22]

    In: NeurIPS (2024)

    Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web. In: NeurIPS (2024)

  23. [23]

    Workarena++: Towards compositional planning and reasoning-based common knowledge work tasks, 2025

    Drouin, A., Porlier, R., et al.: WorkArena++: Benchmarking web agents with agentic and procedural knowledge. arXiv preprint arXiv:2407.05291 (2024)

  24. [24]

    Drouin, A., Porlier, R., et al.: WorkArena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718 (2024)

  25. [25]

    Mano technical report

    Fu, T., Su, A., Zhao, C., Wang, H., Wu, M., Yu, Z., Hu, F., Shi, M., Dong, W., Wang, J., Chen, Y., Yu, R., Peng, S., Li, M., Huang, N., Wei, H., Yu, J., Xin, Y., Zhao, X., Gu, K., Jiang, P., Zhou, S., Wang, S.: Mano technical report. arXiv preprint arXiv:2509.17336 (2025)

  26. [26]

    The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250, 2025

    Gonzalez-Pumariega, G., Tu, V., Lee, C.L., Yang, J., Li, A., Wang, X.E.: The unreasonable effectiveness of scaling agents for computer use. arXiv preprint arXiv:2510.02250 (2025)

  27. [27]

    com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf(2025)

    Google DeepMind: Gemini 3 flash model card.https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf(2025)

  28. [28]

    Google DeepMind: Gemini 3 pro model card.https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf(2025)

  29. [29]

    com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf(2026)

    Google DeepMind: Gemini 3.1 pro model card.https://storage.googleapis. com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf(2026)

  30. [30]

    A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

    Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck, D., Faust, A.: A real-world WebAgent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856 (2024)

  31. [31]

    HCIII Team, Lenovo: Hippo agent.https://github.com/wadang/muscle- mem- agent(2026)

  32. [32]

    Webvoyager: Building an end-to-end web agent with large multimodal models,

    He, Z., Choi, J., et al.: WebVoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919 (2024)

  33. [33]

    In: CVPR (2024)

    Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., Tang, J.: CogAgent: A visual language model for GUI agents. In: CVPR (2024)

  34. [34]

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov

    Kapoor, R., Lu, K., et al.: OmniACT: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553 (2024)

  35. [35]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    Kapoor, S., Stroebl, B., Siegel, Z.S., Rajkumar, N., Narayanan, A.: AI agents that matter. arXiv preprint arXiv:2407.01502 (2024)

  36. [36]

    Language models can solve computer tasks

    Kim, G., Petit, P., Toshev, A., Park, T., Zisserman, A., Mao, C., Ibarz, J., Batra, D., Wu, T., Tian, Y.: Language models can solve computer tasks. arXiv preprint arXiv:2303.17491 (2023)

  37. [37]

    Lee, T., Tu, H., Wong, C.H., Zheng, W., Zhou, Y., Mai, Y., Roberts, J.S., Ya- sunaga, M., Yao, H., Xie, C., et al.: Vhelm: A holistic evaluation of vision language 18 Q. Han, H. Tu et al. models. Advances in Neural Information Processing Systems37, 140632–140666 (2024)

  38. [38]

    A Zero-Shot Language Agent for Computer Control with Structured Reflection

    Li, T., Azizi, D., Papusha, I., Angelopoulos, A.N., Yang, D.: A zero-shot lan- guage agent for computer control with structured reflection. arXiv preprint arXiv:2310.08740 (2023)

  39. [39]

    In: ICLR (2024)

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., Cobbe, K.: Let’s verify step by step. In: ICLR (2024)

  40. [40]

    doi:10.48550/arXiv.2411.17465 , author =

    Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, W., Wang, L., Shou, M.Z.: ShowUI: One vision-language-action model for GUI visual agent. arXiv preprint arXiv:2411.17465 (2024)

  41. [41]

    Reinforcement learning on web interfaces using workflow-guided exploration.arXiv preprint arXiv:1802.08802, 2018

    Liu, E.Z., Guu, K., Pasupat, P., Shi, T., Liang, P.: Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802 (2018)

  42. [42]

    Autoglm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820, 2024

    Liu, X., Qin, B., Liang, D., et al.: AutoGLM: Autonomous foundation agents for GUIs. arXiv preprint arXiv:2411.00820 (2024)

  43. [43]

    H., Kasner, Z., and Reddy, S

    Lu, X., Meng, C., et al.: WebLINX: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930 (2024)

  44. [44]

    In: NeurIPS (2023)

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K., Welleck, S., Yazdanbakhsh, A., Clark, P.: Self-refine: Iterative refinement with self-feedback. In: NeurIPS (2023)

  45. [45]

    Screenagent: A vision language model-driven computer control agent

    Niu, R., Sharma, Y., Goyal, P., Majumdar, A., et al.: ScreenAgent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945 (2024)

  46. [46]

    arXiv preprint arXiv:1602.02261 (2016)

    Nogueira, R., Cho, K.: End-to-end goal-driven web navigation. arXiv preprint arXiv:1602.02261 (2016)

  47. [47]

    com / index / introducing - operator(2025)

    OpenAI: Introducing operator.https : / / openai . com / index / introducing - operator(2025)

  48. [48]

    In: COLM (2024)

    Pan, J., Zhang, Y., Tomlin, N., Zhou, Y., Levine, S., Suhr, A.: Autonomous eval- uation and refinement of digital agents. In: COLM (2024)

  49. [49]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., Zhong, W., Li, K., Li, J., Li, C., Niu, M., Zhang, C., Zheng, Y., Qin, S., Lu, Q., Tang, R., Rajmohan, S., Lin, Q., Zhang, D.: UI-TARS: Pioneering automated GUI interaction with native agents. arXiv preprint arXiv:2501.12326 (2025)

  50. [50]

    Android in the wild: A large-scale dataset for android device control

    Rawles, C., et al.: Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088 (2023)

  51. [51]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Rawles, C., Clinckemaillie, S., Chang, Y., Walber, J., Toyama, D.: AndroidWorld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573 (2024)

  52. [52]

    In: ICML

    Shi, T., Karpathy, A., Fan, L., Hernandez, J., Liang, P.: World of bits: An open- source platform for web-based agents. In: ICML. pp. 3135–3144 (2017)

  53. [53]

    In: NeurIPS (2023)

    Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. In: NeurIPS (2023)

  54. [54]

    arXiv preprint arXiv:2401.13649 , year=

    Singh, A., Yang, J., Sriram, A., et al.: VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649 (2024)

  55. [55]

    arXiv preprint arXiv:2503.00yset (2025)

    Song, L., Dai, Y., Prabhu, V., Zhang, J., Shi, T., Li, L., Li, J., Savarese, S., Chen, Z., Zhao, J., Xu, R., Xiong, C.: CoAct-1: Computer-using agents with coding as actions. arXiv preprint arXiv:2503.00yset (2025)

  56. [56]

    Cradle: Empowering foundation agents towards general computer control,

    Tan, W., Zhang, W., et al.: Cradle: Empowering foundation agents towards general computer control. arXiv preprint arXiv:2403.03186 (2024) VLAA-GUI 19

  57. [57]

    Kimi K2.5: Visual Agentic Intelligence

    Team, K., Bai, T., Bai, Y., Bao, Y., Cai, S.H., Cao, Y., Charles, Y., Che, H.S., Chen, C., Chen, G., Chen, H., Chen, J., Chen, J., Chen, J., Chen, J., Chen, K., Chen, L., Chen, R., Chen, Y., Chen, Y., Chen, Y., Chen, Y., Chen, Y., Chen, Y., Chen, Y., Chen, Y., Chen, Z., Chen, Z., Cheng, D., Chu, M., Cui, J., Deng, J., Diao, M., Ding, H., Dong, M., Dong, M...

  58. [58]

    In: European Conference on Computer Vision

    Tu, H., Cui, C., Wang, Z., Zhou, Y., Zhao, B., Han, J., Zhou, W., Yao, H., Xie, C.: How many are in this image a safety evaluation benchmark for vision llms. In: European Conference on Computer Vision. pp. 37–55. Springer (2024)

  59. [59]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Tu, H., Feng, W., Chen, H., Liu, H., Tang, X., Xie, C.: Vilbench: A suite for vision-language process reward modeling. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 6775–6790 (2025)

  60. [60]

    UiPath: Uipath screen agent.https://github.com/xlang- ai/OSWorld/tree/ main/mm_agents/uipath(2025)

  61. [61]

    Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., Zhong, W., Ye, Y., Qin, Y., Xiong, Y., Song, Y., Wu, Z., Li, A., Li, B., Dun, C., Liu, C., Zan, D., Leng, F., Wang, H., Yu, H., Chen, H., Guo, H., Su, J., Huang, J., Shen, K., Shi, K., Yan, L., Zhao, P., Liu, P., Ye, Q., Zheng, 20 Q. Han, H. Tu et al. R., Xin,...

  62. [62]

    Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

    Wang, J., Mo, H., Ishii, E., et al.: Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158 (2024)

  63. [63]

    Frontiers of Computer Science (2024)

    Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W.X., Wei, Z., Wen, J.R.: A survey on large language model based autonomous agents. Frontiers of Computer Science (2024)

  64. [64]

    In: ICML (2024)

    Wang, X., Chen, Y., Yuan, L., Zhang, Y., Li, Y., Peng, H., Ji, H.: Executable code actions elicit better LLM agents. In: ICML (2024)

  65. [65]

    Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123, 2025

    Wang, X., Wang, B., Lu, D., Yang, J., Xie, T., Wang, J., Deng, J., Guo, X., Xu, Y., Wu, C.H., Shen, Z., Li, Z., Li, R., Li, X., Chen, J., Zheng, B., Li, P., Lei, F., Cao, R., Fu, Y., Shin, D., Shin, M., Hu, J., Wang, Y., Chen, J., Ye, Y., Zhang, D., Du, D., Hu, H., Chen, H., Zhou, Z., Yao, H., Chen, Z., Gu, Q., Wang, Y., Wang, H., Yang, D., Zhong, V., Sun...

  66. [66]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P.P., Qiao, Y.: OS-Atlas: A foundation action model for generalist GUI agents. arXiv preprint arXiv:2410.23218 (2024)

  67. [67]

    In: NeurIPS (2024)

    Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., Yu, T.: OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In: NeurIPS (2024)

  68. [68]

    Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454, 2024

    Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., Xiong, C.: Aguvis: Unified pure vision agents for autonomous GUI interaction. arXiv preprint arXiv:2412.04454 (2024)

  69. [69]

    Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

    Xue, T., Peng, C., Huang, M., Guo, L., Han, T., Wang, H., Wang, J., Zhang, X., Yang, X., Zhao, D., Ding, J., Ma, X., Xie, Y., Pei, P., Cai, X., Qiu, X.: EvoCUA: Evolving computer use agents via learning from scalable synthetic experience. arXiv preprint arXiv:2601.15876 (2026)

  70. [70]

    Os-symphony: A holistic framework for robust and generalist computer-using agent.arXiv preprint arXiv:2601.07779, 2026

    Yang, B., Jin, K., Wu, Z., Liu, Z., Sun, Q., Li, Z., Xie, J., Liu, Z., Xu, F., Cheng, K., Li,Q.,Wang,Y.,Qiao,Y.,Wang,Z.,Ding,Z.:OS-Symphony:Aholisticframework for robust and generalist computer-using agent. arXiv preprint arXiv:2601.07779 (2026)

  71. [71]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: SWE-agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793 (2024)

  72. [72]

    macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

    Yang, P., Ci, H., Shou, M.Z.: macOSWorld: A multilingual interactive benchmark for GUI agents. arXiv preprint arXiv:2506.04135 (2025)

  73. [73]

    java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

    Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., et al.: GTA1: GUI test-time scaling agent. arXiv preprint arXiv:2507.05791 (2025) VLAA-GUI 21

  74. [74]

    Aria-ui: Visual grounding for gui instruc- tions.arXiv preprint arXiv:2412.16256, 2024

    Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., Li, J.: Aria-UI: Visual grounding for GUI instructions. arXiv preprint arXiv:2412.16256 (2024)

  75. [75]

    In: NeurIPS (2022)

    Yao, S., Chen, H., Yang, J., et al.: WebShop: Towards scalable real-world web interaction with grounded language agents. In: NeurIPS (2022)

  76. [76]

    In: NeurIPS (2023)

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. In: NeurIPS (2023)

  77. [77]

    In: ICLR (2023)

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: ICLR (2023)

  78. [78]

    Yoran, O., et al.: AssistantBench: Can web agents solve realistic and time- consuming tasks? arXiv preprint arXiv:2407.15711 (2024)

  79. [79]

    Large language model-brained gui agents: A survey,

    Zhang, C., He, S., Li, L., Li, S., Qian, Y., Wang, Y., Ma, M., Kang, Y., Lin, Q., Rajmohan, S., et al.: LLM-brained GUI agents: A survey. arXiv preprint arXiv:2411.18279 (2024)

  80. [80]

    arXiv preprint arXiv:2402.07939 , year=

    Zhang, C., Li, L., He, S., Zhang, X., Qiao, B., Qin, S., Ma, M., Kang, Y., Lin, Q., Rajmohan, S., et al.: UFO: A UI-focused agent for windows OS interaction. arXiv preprint arXiv:2402.07939 (2024)

Showing first 80 references.