pith. machine review for the scientific record. sign in

arxiv: 2604.13531 · v1 · submitted 2026-04-15 · 💻 cs.AI · cs.LG

Recognition: unknown

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:38 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords GUI agentsinteractive benchmarke-commerce risk managementagentic reinforcement learningfoundation modelsweb automationrisk controlproduction tasks
0
0 comments X

The pith

RiskWebWorld benchmark shows top generalist models reach 49.1 percent success on e-commerce risk tasks where specialized GUI agents fail nearly completely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RiskWebWorld as the first realistic interactive benchmark for testing GUI agents on e-commerce risk management drawn directly from production systems. It fills a gap left by prior benchmarks that focus only on simple consumer web tasks by including 1,513 tasks across eight domains with real-world difficulties such as uncooperative sites and partial environment changes. Experiments demonstrate that leading generalist models succeed in about half the cases while specialized open-weight GUI models succeed in almost none. This indicates that overall foundation model scale outweighs specialized interface training for extended professional workflows. The provided Gymnasium infrastructure also supports reinforcement learning that lifts open models by 16.2 percent.

Core claim

We present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also show an

What carries the argument

RiskWebWorld benchmark of 1,513 production-sourced tasks together with its Gymnasium-compliant infrastructure that separates policy planning from environment mechanics to enable scalable evaluation and agentic reinforcement learning.

Load-bearing premise

The 1,513 tasks drawn from production pipelines accurately reflect the real difficulties of risk management on uncooperative websites and partially hijacked environments.

What would settle it

A direct comparison where the same agents are run on live production risk systems and their success rates are measured against the 49.1 percent benchmark score.

Figures

Figures reproduced from arXiv: 2604.13531 by Jianming Guo, Jing Wang, Jingzhe Zhu, Qingqing Sun, Renqi Chen, Shuai Chen, Tianyi Zhang, Zeyin Tao, Zezhou Xu.

Figure 1
Figure 1. Figure 1: RiskWebWorld is a highly realistic, interactive benchmark for evaluating GUI agents in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RiskWebWorld benchmark infrastructure. Benchmark procedure is defined by the workflow [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RiskWebWorld task statistics. (a) Task category distribution across 8 business domains. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example of a real RiskWebWorld task. The task is to search for the origin and destination [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Environmental hijackments seen in RiskWebWorld tasks. Three representative cases are [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Agentic RL training curves. The blue line represents the raw mean reward across trajectories, [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Online monitoring platform. Task execution details are recorded and managed. The private [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Three main failure modes of specialized and generalist models: no action generation (action [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure trajectory of ShowUI-2B on a Customs Declaration & Clearance Status Audit [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Failure trajectory of UI-TARS-1.5-7B on a Product Risk Profile task. The model also [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Failure trajectory of Gemini-3-Pro on a Logistics and Supply Chain Tracking task. The [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Failure trajectory of Qwen3.5-397B-A17B on a Secure Payment Channel Validation task. [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
read the original abstract

Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces RiskWebWorld, the first highly realistic interactive benchmark for GUI agents in e-commerce risk management. It comprises 1,513 tasks sourced from production risk-control pipelines across 8 core domains, simulating uncooperative websites and partial environmental hijackments (e.g., injected pop-ups, altered DOM states, session timeouts). A Gymnasium-compliant infrastructure decouples policy planning from mechanics to enable scalable evaluation and agentic RL. Evaluations show top-tier generalist models achieving 49.1% success while specialized open-weights GUI models fail near-totally; agentic RL improves open-source models by 16.2%.

Significance. If the production-sourced task curation and hijackment simulations hold, the benchmark fills a critical gap by moving GUI agent evaluation from benign consumer tasks to high-stakes professional domains. It provides concrete evidence that foundation-model scale outperforms zero-shot interface grounding in long-horizon risk operations and demonstrates a viable RL training loop via the Gymnasium wrapper. The explicit decoupling of planning and mechanics, together with the reported numerical gaps, makes the work a practical, falsifiable testbed for robust digital workers.

major comments (1)
  1. [Methods] Methods (task curation subsection): The claim that the 1,513 tasks 'accurately capture the authentic challenges' rests on sourcing from production pipelines, but the manuscript provides only high-level description of domain coverage and hijackment injection without quantitative metrics (e.g., distribution of hijackment types, inter-task difficulty calibration, or inter-rater agreement on realism). This is load-bearing for the central realism claim and the interpretation of the 49.1% vs. near-zero gap.
minor comments (3)
  1. [Abstract] Abstract: 'partially environmental hijackments' is grammatically awkward and should be clarified as 'partial environmental hijackments' or expanded with a parenthetical example.
  2. [Evaluation] Evaluation section: Success rates are reported without error bars, number of evaluation episodes per model, or statistical significance tests; adding these would improve reproducibility of the 49.1% and +16.2% figures.
  3. [Infrastructure] Infrastructure description: The Gymnasium wrapper is presented as decoupling planning from mechanics, but no pseudocode or interface signature is given; a short listing of the key observation/action spaces would aid readers implementing their own agents.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive comment on strengthening the quantitative support for our realism claims. We address the point below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods (task curation subsection): The claim that the 1,513 tasks 'accurately capture the authentic challenges' rests on sourcing from production pipelines, but the manuscript provides only high-level description of domain coverage and hijackment injection without quantitative metrics (e.g., distribution of hijackment types, inter-task difficulty calibration, or inter-rater agreement on realism). This is load-bearing for the central realism claim and the interpretation of the 49.1% vs. near-zero gap.

    Authors: We agree that additional quantitative metrics would further substantiate the central realism claim. While the tasks are sourced directly from production risk-control pipelines (ensuring ecological validity by construction), the original manuscript indeed provided only high-level descriptions. In the revised version, we have expanded the task curation subsection with: (1) the empirical distribution of hijackment types across all 1,513 tasks (e.g., 37% pop-up injections, 29% DOM state alterations, 22% session timeouts, 12% other), (2) inter-task difficulty calibration via expert-rated complexity scores (mean 6.8/10, std 1.9) with inter-rater agreement (Cohen's kappa = 0.81), and (3) per-domain task counts with realism ratings. These additions directly support the 49.1% vs. near-zero performance gap without changing any results or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark release with no equations, derivations, fitted parameters, or mathematical claims. Its core contribution is the construction of 1,513 production-sourced tasks and Gymnasium infrastructure, followed by direct model evaluations; these steps are externally verifiable and do not reduce to self-definition, self-citation chains, or renaming of prior results. The 'first highly realistic' positioning is a descriptive claim supported by implementation details rather than a load-bearing theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review contains no mathematical derivations, free parameters, or invented entities; all content is empirical benchmark description.

pith-pipeline@v0.9.0 · 5545 in / 993 out tokens · 32013 ms · 2026-05-10T13:38:25.748470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 42 canonical work pages · 14 internal anchors

  1. [1]

    Claude sonnet 4.5 system card, 2025

    Anthropic. Claude sonnet 4.5 system card, 2025

  2. [2]

    Computer use (beta)

    Anthropic. Computer use (beta). https://docs.anthropic.com/en/docs/ agents-and-tools/computer-use, n.d. Accessed: 2025-05-03

  3. [3]

    WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

    Hao Bai, Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead. Web- gym: Scaling training environments for visual web agents with realistic tasks.arXiv preprint arXiv:2601.02439, 2026

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  5. [5]

    RISK: A Framework for GUI Agents in E-commerce Risk Management

    Renqi Chen, Zeyin Tao, Jianming Guo, Jingzhe Zhu, Yiheng Peng, Qingqing Sun, Tianyi Zhang, and Shuai Chen. Risk: A framework for gui agents in e-commerce risk management.arXiv preprint arXiv:2509.21982, 2025

  6. [6]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  7. [7]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  8. [8]

    Guideweb: A benchmark for automatic in-app guide generation on real-world web uis.arXiv preprint arXiv:2602.01917, 2026

    Chengguang Gan, Yoshihiro Tsujii, Yunhao Liang, Tatsunori Mori, Shiwen Ni, and Hiroki Itoh. Guideweb: A benchmark for automatic in-app guide generation on real-world web uis.arXiv preprint arXiv:2602.01917, 2026

  9. [9]

    Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

    Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, et al. Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

  10. [10]

    Assistgui: Task-oriented pc graphical user interface automation

    Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented pc graphical user interface automation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13289–13298, 2024

  11. [11]

    Gemini 3 pro, 2025

    Google. Gemini 3 pro, 2025

  12. [12]

    Generalization in online reinforcement learning for mobile agents.arXiv preprint arXiv:2603.07432, 2026

    Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, and Yang Wang. Generalization in online reinforcement learning for mobile agents.arXiv preprint arXiv:2603.07432, 2026. 10

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  14. [14]

    Susbench: An online benchmark for evaluating dark pattern susceptibility of computer-use agents.arXiv preprint arXiv:2510.11035, 2025

    Longjie Guo, Chenjie Yuan, Mingyuan Zhong, Robert Wolfe, Ruican Zhong, Yue Xu, Bingbing Wen, Hua Shen, Lucy Lu Wang, and Alexis Hiniker. Susbench: An online benchmark for evaluating dark pattern susceptibility of computer-use agents.arXiv preprint arXiv:2510.11035, 2025

  15. [15]

    Webvoyager: Building an end-to-end web agent with large multimodal models,

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

  16. [16]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  17. [17]

    MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

    Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, and Sunjae Lee. Modular and multi-path-aware offline benchmarking for mobile gui agents.arXiv preprint arXiv:2512.12634, 2025

  18. [18]

    Appagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

    Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, and Chi Zhang. Ap- pagentx: Evolving gui agents as proficient smartphone users.arXiv preprint arXiv:2503.02268, 2025

  19. [19]

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

  20. [20]

    MobileWorld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments.arXiv preprint arXiv:2512.19432,

    Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments.arXiv preprint arXiv:2512.19432, 2025

  21. [21]

    Towards gui agents: Vision-language diffusion models for gui grounding.arXiv preprint arXiv:2603.26211, 2026

    Shrinidhi Kumbhar, Haofu Liao, Srikar Appalaraju, and Kunwar Yashraj Singh. Towards gui agents: Vision-language diffusion models for gui grounding.arXiv preprint arXiv:2603.26211, 2026

  22. [22]

    Benchmarking mobile device control agents across diverse configurations.arXiv preprint arXiv:2404.16660, 2024

    Juyong Lee, Taywon Min, Minyong An, Dongyoon Hahm, Haeone Lee, Changyeon Kim, and Kimin Lee. Benchmarking mobile device control agents across diverse configurations.arXiv preprint arXiv:2404.16660, 2024

  23. [23]

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou

    Yanda Li, Chi Zhang, Wanqi Yang, Bin Fu, Pei Cheng, Xin Chen, Ling Chen, and Yunchao Wei. Appagent v2: Advanced agent for flexible mobile interactions.arXiv preprint arXiv:2408.11824, 2024

  24. [24]

    Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents,

    Yang Li, Yuchen Liu, Haoyu Lu, Zhiqiang Xia, Hongzhen Wang, Kaiyang Han, Changpeng Yang, Jinyang Wu, Jiaming Xu, Runyu Shi, et al. Gui-ceval: A hierarchical and comprehensive chinese benchmark for mobile gui agents.arXiv preprint arXiv:2603.15039, 2026

  25. [25]

    Openmanus: An open-source framework for building general ai agents, 2025

    Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, Sirui Hong, Sheng Fan, and Xiao Tang. Openmanus: An open-source framework for building general ai agents, 2025

  26. [26]

    CoRR, abs/2505.21936

    Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier, Yu Su, Zhiqiang Lin, and Huan Sun. Redteamcua: Realistic adversarial testing of computer-use agents in hybrid web-os environments.arXiv preprint arXiv:2505.21936, 2025

  27. [27]

    Showui: One vision-language-action model for gui visual agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weix- ian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025. 11

  28. [28]

    Learnact: Few-shot mobile gui agent with a unified demonstration benchmark,

    Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, and Wenchao Meng. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805, 2025

  29. [29]

    Webserv: A browser-server environment for efficient training of reinforcement learning-based web agents at scale.arXiv preprint arXiv:2510.16252, 2025

    Yuxuan Lu, Jing Huang, Hui Liu, Jiri Gesi, Yan Han, Shihan Fu, Tianqi Zheng, and Dakuo Wang. Webserv: A browser-server environment for efficient training of reinforcement learning-based web agents at scale.arXiv preprint arXiv:2510.16252, 2025

  30. [30]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  31. [31]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  32. [32]

    Browser use: Enable ai to control your browser, 2024

    Magnus Müller and Gregor Žuni ˇc. Browser use: Enable ai to control your browser, 2024

  33. [33]

    Gui agents: A survey

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538, 2025

  34. [34]

    Computer-using agent

    OpenAI. Computer-using agent. https://openai.com/index/computer-using-agent/, January 2025. Published: January 23, 2025; Accessed: 2025-05-03

  35. [35]

    Introducing gpt 5.2, 2025

    OpenAI. Introducing gpt 5.2, 2025

  36. [36]

    arXiv preprint arXiv:2406.12373 , year=

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373, 2024

  37. [37]

    Webmall– a multi-shop benchmark for evaluating web agents.arXiv e-prints, pages arXiv–2508, 2025

    Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall– a multi-shop benchmark for evaluating web agents.arXiv e-prints, pages arXiv–2508, 2025

  38. [38]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  39. [39]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026

  40. [40]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Mary- beth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Android- world: A dynamic benchmarking environment for autonomous agents.URL https://arxiv. org/abs/2405.14573, 2024

  41. [41]

    An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. An- droidinthewild: A large-scale dataset for android device control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

  42. [42]

    Cuaaudit: Meta-evaluation of vision-language models as auditors of autonomous computer-use agents.arXiv preprint arXiv:2603.10577, 2026

    Marta Sumyk and Oleksandr Kosovan. Cuaaudit: Meta-evaluation of vision-language models as auditors of autonomous computer-use agents.arXiv preprint arXiv:2603.10577, 2026

  43. [43]

    GUI-G2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025

    Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025

  44. [44]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  45. [45]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabi...

  46. [46]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxi- ang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025

  47. [47]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  48. [48]

    GUIGuard: Toward a General Framework for Privacy-Preserving GUI Agents, jan 2026

    Yanxi Wang, Zhiling Zhang, Wenbo Zhou, Weiming Zhang, Jie Zhang, Qiannan Zhu, Yu Shi, Shuxin Zheng, and Jiyan He. Guiguard: Toward a general framework for privacy-preserving gui agents.arXiv preprint arXiv:2601.18842, 2026

  49. [49]

    arXiv preprint arXiv:2501.11733 , year=

    Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, and Heng Ji. Mobile-agent-e: Self-evolving mobile assistant for complex tasks.arXiv preprint arXiv:2501.11733, 2025

  50. [50]

    OS-Marathon: Benchmarking computer-use agents on long-horizon repetitive tasks.arXiv preprint arXiv:2601.20650, 2026

    Jing Wu, Daphne Barretto, Yiye Chen, Nicholas Gydé, Yanan Jian, Yuhang He, and Vibhav Vineet. Os-marathon: Benchmarking computer-use agents on long-horizon repetitive tasks. arXiv preprint arXiv:2601.20650, 2026

  51. [51]

    Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

    Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

  52. [52]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  53. [53]

    Mobile-agent-v3

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

  54. [54]

    Turkingbench: A challenge benchmark for web agents

    Kevin Xu, Yeganeh Kordi, Tanay Nayak, Adi Asija, Yizhong Wang, Kate Sanders, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, and Daniel Khashabi. Turkingbench: A challenge benchmark for web agents. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volum...

  55. [55]

    Xue et al

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

  56. [56]

    Riosworld: Benchmarking the risk of multimodal computer-use agents.arXiv preprint arXiv:2506.00618, 2025

    Jingyi Yang, Shuai Shao, Dongrui Liu, and Jing Shao. Riosworld: Benchmarking the risk of multimodal computer-use agents.arXiv preprint arXiv:2506.00618, 2025

  57. [57]

    macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

    Pei Yang, Hai Ci, and Mike Zheng Shou. macosworld: A multilingual interactive benchmark for gui agents.arXiv preprint arXiv:2506.04135, 2025

  58. [58]

    Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757, 2022. 13

  59. [59]

    Realwe- bassist: A benchmark for long-horizon web assistance with real-world users

    Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya G Roosta, and Tianmin Shu. Realwe- bassist: A benchmark for long-horizon web assistance with real-world users. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34441–34449, 2026

  60. [60]

    Zhang, H

    Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, et al. Ufo2: The desktop agentos.arXiv preprint arXiv:2504.14603, 2025

  61. [61]

    A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

    Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, and Mat Hans. A functionality-grounded benchmark for evaluating web agents in e-commerce domains. arXiv preprint arXiv:2508.15832, 2025

  62. [62]

    Cola: Collaborative multi-agent framework with dynamic task scheduling for gui automation

    Di Zhao, Longhui Ma, Siwei Wang, Miao Wang, and Zhao Lv. Cola: Collaborative multi-agent framework with dynamic task scheduling for gui automation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4570–4593, 2025

  63. [63]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  64. [64]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025. 14 Appendix A Limitations and Future Work 15 B Code and Data Availability 15 C Impact Statement 16 D Ethical, Privacy, and Safeguarding Statement 16 E Example of a Re...

  65. [65]

    N a v i g a t i n g complex w eb si tes and e x t r a c t i n g precise i n f o r m a t i o n

  66. [66]

    A u t o m a t i n g form s u b m i s s i o n s and i n t e r a c t i v e web actions

  67. [67]

    G a t h e r i n g and saving i n f o r m a t i o n

  68. [68]

    Using your f i l e s y s t e m e f f e c t i v e l y to decide what to keep in your ,→context

  69. [69]

    Operate e f f e c t i v e l y in an agent loop

  70. [70]

    E f f i c i e n t l y p e r f o r m i n g diverse web tasks </ intro > < l a n g u a g e _ s e t t i n g s > - Default working l an gu age : ** English ** - Always respond in the same l an gu age as the user request </ l a n g u a g e _ s e t t i n g s > < input > At every step , your input will consist of :

  71. [71]

    < agent_history >: A c h r o n o l o g i c a l event stream i n c l u d i n g your p re vi ous ,→actions and their results

  72. [72]

    < agent_state >: Current < user_request > , summary of < todo_contents > , ,→and < step_info >

  73. [73]

    < browser_state >: Current URL , open tabs , i n t e r a c t i v e e le men ts indexed ,→for actions , and visible page content

  74. [74]

    If you used s c r e e n s h o t before , this ,→will contain a s c r e e n s h o t

    < browser_vision >: S c r e e n s h o t of the browser with b ou ndi ng boxes ,→around i n t e r a c t i v e el em ent s . If you used s c r e e n s h o t before , this ,→will contain a s c r e e n s h o t

  75. [75]

    This data is only shown in the current step ,→

    < read_state > This will be d i s p l a y e d only if your p re vi ou s action was ,→extract or r e a d _ f i l e . This data is only shown in the current step ,→. </ input > < agent_history > Agent history will be given as a list of step i n f o r m a t i o n as follows : < step_ {{ s t e p _ n u m b e r }} >: E v a l u a t i o n of Pre vi ou s Step : A ...

  76. [76]

    Open Tabs : Open tabs with their ids

    Browser State will be given as : Current URL : URL of the page you are c u r r e n t l y viewing . Open Tabs : Open tabs with their ids . I n t e r a c t i v e E le men ts : All i n t e r a c t i v e el em en ts will be p ro vi de d in ,→format as [ index ] < type > text </ type > where - index : Numeric i d e n t i f i e r for i n t e r a c t i o n - typ...

  77. [77]

    results . md

    < div > User form </ div > \ t *[35] < button aria - label = ’ Submit form ’ > Submit </ button > Note that : 20 - Only ele me nt s with numeric indexes in [] are i n t e r a c t i v e - ( stacked ) i n d e n t a t i o n ( with \ t ) is i m p o r t a n t and means that the ,→element is a ( html ) child of the element above ( with a lower ,→index ) - E lem...

  78. [78]

    a c t i o n _ n a m e

    WRONG FORMAT : {{" a c t i o n _ n a m e ": " input " , " params ": {{" index ": 1 , ,→...}}}}

  79. [79]

    input ": {{

    CORRECT FORMAT : {{" input ": {{" index ": 1 , " text ": " example " , " clear ": ,→true }}}}

  80. [80]

    action

    " action " is an array . You can provide up to { m a x _ a c t i o n s } actions , ,→except for " done " which must be a single action . </ output > The system prompt for the flash mode is given by: You are an AI agent d es ign ed to operate in an i t e r a t i v e loop to au to ma te ,→browser tasks . Your u lt im at e goal is a c c o m p l i s h i n g t...