pith. the verified trust layer for science. sign in

arxiv: 2507.04227 · v2 · submitted 2025-07-06 · 💻 cs.CR · cs.AI

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Pith reviewed 2026-05-19 06:54 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords mobile GUI agentsLLM agentsthird-party contentsecurity threatsapp instrumentationbenchmarkreal-world deploymentmisleading rate
0
0 comments X p. Extension

The pith

Mobile GUI agents are misled by third-party content such as ads and user posts at average rates of 42 percent and 36.1 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLM-based mobile GUI agents can maintain performance when real apps include untrustworthy third-party material. Standard benchmarks rely on clean static screens for consistency across tests, but everyday apps contain advertisements, user posts, and other external content. The authors build an instrumentation framework that inserts targeted modifications into existing commercial apps and evaluate both open-source and commercial agents. In a dynamic test suite of 122 tasks and a static collection of over 3,000 scenarios, all agents show substantial degradation, with the reported misleading rates. This evaluation addresses a missing pre-deployment check for agents intended for everyday device use.

Core claim

By introducing a scalable app content instrumentation framework, the authors enable reproducible insertion of third-party content into commercial applications and demonstrate that every examined GUI agent suffers significant performance loss, reaching an average misleading rate of 42.0 percent in dynamic environments and 36.1 percent in static environments.

What carries the argument

Scalable app content instrumentation framework that performs flexible, targeted modifications inside existing applications to create realistic third-party content states.

Load-bearing premise

The content modifications inside the apps produce screen states that match genuine third-party material without adding detectable artifacts or changing app behavior in ways that would not happen outside the test setup.

What would settle it

Agents completing the 122 dynamic tasks and the 3,000 static scenarios at success rates comparable to clean benchmarks would show that the third-party content does not cause the reported degradation.

Figures

Figures reproduced from arXiv: 2507.04227 by Guohong Liu, Jiacheng Liu, Jialei Ye, Jian Luan, Pengzhi Gao, Wei Liu, Yuanchun Li, Yunxin Liu.

Figure 1
Figure 1. Figure 1: Example of agent being misled by third-party [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AgentHazard dynamic task execution environment. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example configuration of one target screen, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task number of each app in the dynamic benchmarking environment. We design attack scenarios in two aspects, namely the complexity level and the misleading action. First, we categorize attacks into three levels of complexity: Simple, Medium, and Complex, designed to systematically evalu￾ate agent robustness against varying degrees of attack com￾plexity and task relevance. Among these levels, Simple and Medi… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of tasks across different app cat [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of average SR Drop and MR across (a) different types of misleading actions against complex attack content and (b) different complexity levels of misleading content [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We found that all LLMs have an average misleading [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of misleading rates across dif [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case study: GUI agent decides to delete user data without requesting confirmation when seeing misleading [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
read the original abstract

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of these agents on standard benchmarks has raised expectations for large-scale real-world deployment, and there are already several commercial agents released and used by early adopters. However, are we really ready for GUI agents integrated into our daily devices as system building blocks? We argue that an important pre-deployment validation is missing to examine whether the agents can maintain their performance under real-world threats. Specifically, unlike existing common benchmarks that are based on simple static app contents (they have to do so to ensure environment consistency between different tests), real-world apps are filled with contents from untrustworthy third parties, such as advertisement emails, user-generated posts and medias, etc. ... To this end, we introduce a scalable app content instrumentation framework to enable flexible and targeted content modifications within existing applications. Leveraging this framework, we create a test suite comprising both a dynamic task execution environment and a static dataset of challenging GUI states. The dynamic environment encompasses 122 reproducible tasks, and the static dataset consists of over 3,000 scenarios constructed from commercial apps. We perform experiments on both open-source and commercial GUI agents. Our findings reveal that all examined agents can be significantly degraded due to third-party contents, with an average misleading rate of 42.0% and 36.1% in dynamic and static environments respectively. The framework and benchmark has been released at https://agenthazard.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that current mobile GUI agents, both open-source and commercial, are vulnerable to degradation from real-world third-party contents (ads, user-generated posts, media) in apps. It introduces a scalable instrumentation framework for targeted content modifications in existing applications, constructs a benchmark with 122 reproducible dynamic tasks and over 3,000 static GUI scenarios from commercial apps, and reports average misleading rates of 42.0% (dynamic) and 36.1% (static) across evaluated agents, arguing that pre-deployment validation under such threats is missing from standard benchmarks.

Significance. If the results hold under validated conditions, the work provides a timely empirical measurement of robustness gaps in LLM-powered GUI agents, which is significant for the security and deployment of autonomous device-control systems. The released framework and benchmark at agenthazard.github.io represent a concrete contribution that enables future reproducible studies on agent resilience to untrusted content, moving beyond static benchmarks to more realistic threat models.

major comments (2)
  1. [Section 3] Section 3 (instrumentation framework): The headline misleading rates (42.0% dynamic, 36.1% static) rest on the unvalidated assumption that the framework's content modifications faithfully emulate genuine third-party material without introducing layout, visual, or semantic artifacts that would not occur in the wild. No quantitative validation (e.g., visual similarity metrics, element-hierarchy distribution comparisons, or user studies) is reported to confirm that modified screens match the distribution of naturally occurring third-party content in commercial apps; this directly affects whether the measured degradation reflects authentic threats.
  2. [Section 4] Section 4 / experimental protocol: The abstract and results report concrete misleading rates on 122 tasks and >3,000 scenarios, but the manuscript does not detail data-exclusion rules, task selection criteria, or statistical tests (e.g., significance levels or confidence intervals on the rates). Without these, it is not possible to verify whether post-hoc choices influenced the reported averages or whether the degradation is statistically robust across agents.
minor comments (2)
  1. [Section 3] The description of the static dataset construction could clarify how the >3,000 scenarios were sampled from commercial apps to avoid selection bias toward particularly vulnerable states.
  2. [Section 4] Figure or table presenting per-agent misleading rates would benefit from error bars or per-category breakdowns (e.g., ad vs. user-generated content) to improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our paper. We address each of the major comments below and outline the revisions we intend to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (instrumentation framework): The headline misleading rates (42.0% dynamic, 36.1% static) rest on the unvalidated assumption that the framework's content modifications faithfully emulate genuine third-party material without introducing layout, visual, or semantic artifacts that would not occur in the wild. No quantitative validation (e.g., visual similarity metrics, element-hierarchy distribution comparisons, or user studies) is reported to confirm that modified screens match the distribution of naturally occurring third-party content in commercial apps; this directly affects whether the measured degradation reflects authentic threats.

    Authors: We appreciate the referee highlighting the importance of validating the realism of our content modifications. The instrumentation framework is designed to perform in-place replacements of third-party content elements (such as ad banners or user posts) within the original app layouts, thereby preserving the structural and visual integrity by construction. However, we acknowledge that the manuscript does not provide quantitative comparisons to naturally occurring content. In the revised version, we will add a validation subsection in Section 3, including visual similarity metrics (e.g., SSIM and LPIPS) between modified and unmodified screens, as well as comparisons of UI element distributions and a qualitative analysis of semantic fidelity using examples from real apps. revision: yes

  2. Referee: [Section 4] Section 4 / experimental protocol: The abstract and results report concrete misleading rates on 122 tasks and >3,000 scenarios, but the manuscript does not detail data-exclusion rules, task selection criteria, or statistical tests (e.g., significance levels or confidence intervals on the rates). Without these, it is not possible to verify whether post-hoc choices influenced the reported averages or whether the degradation is statistically robust across agents.

    Authors: We agree that additional details on the experimental protocol are necessary for full reproducibility and to demonstrate statistical robustness. The 122 dynamic tasks were selected to represent a diverse set of common user interactions across popular commercial apps, with reproducibility ensured through fixed starting states and scripted actions. For the static dataset, scenarios were sampled from over 3,000 GUI states collected from real app executions. In the revision, we will expand Section 4 to explicitly describe the task selection criteria, any data exclusion rules applied (such as filtering out incomplete recordings), and include statistical analyses, including 95% confidence intervals computed via bootstrapping on the misleading rates for each agent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with direct observation of agent behavior

full rationale

The paper conducts an empirical evaluation by introducing a content instrumentation framework, constructing dynamic and static test suites from commercial apps, and directly measuring misleading rates from agent actions against ground-truth task success. No mathematical derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text or abstract. The reported rates (42.0% dynamic, 36.1% static) are computed outcomes of the experiments rather than reductions to inputs by construction. The work is self-contained as a measurement campaign without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical definition of 'misleading rate' and the assumption that the instrumentation framework faithfully reproduces real-world third-party content. No free parameters are fitted to produce the headline rates; the rates are direct measurements. No new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption LLM-based GUI agents interpret screen content and follow natural-language instructions without built-in provenance checks for third-party elements.
    Implicit in the threat model and in the decision to treat all visible content as equally trustworthy.

pith-pipeline@v0.9.0 · 5834 in / 1291 out tokens · 28887 ms · 2026-05-19T06:54:49.661351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 15 internal anchors

  1. [1]

    Naveed Akhtar and Ajmal Mian. 2018. Threat of Adversar- ial Attacks on Deep Learning in Computer Vision: A Survey. arXiv:1801.00553 [cs.CV] https://arxiv.org/abs/1801.00553

  2. [2]

    Real Attack- ers Don’t Compute Gradients

    Giovanni Apruzzese, Hyrum S. Anderson, Savino Dambra, David Freeman, Fabio Pierazzi, and Kevin A. Roundy. 2022. "Real Attack- ers Don’t Compute Gradients": Bridging the Gap Between Adver- sarial ML Research and Practice. arXiv:2212.14315 [cs.CR] https: //arxiv.org/abs/2212.14315

  3. [3]

    Nicholas Carlini and David Wagner. 2018. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. arXiv:1801.01944 [cs.LG] https: //arxiv.org/abs/1801.01944

  4. [4]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. arXiv:2401.10935 [cs.HC] https://arxiv.org/abs/2401.10935

  5. [5]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL] https://arxiv.org/abs/ 2306.06070

  6. [6]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2024. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. arXiv:2410.05243 [cs.AI] https://arxiv.org/abs/2410.05243

  7. [7]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hong- ming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv:2401.13919 [cs.CL] https://arxiv.org/abs/2401.13919

  8. [8]

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. 2024. CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV] https: //arxiv.org/abs/2312.08914

  9. [9]

    Joyce, Dev Amlani, Charles Nicholas, and Edward Raff

    Robert J. Joyce, Dev Amlani, Charles Nicholas, and Edward Raff. 2021. MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. arXiv:2111.15031 [cs.LG] https://arxiv.org/abs/2111.15031

  10. [10]

    Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. 2024. AutoWebGLM: A Large Language Model-based Web Navigating Agent. arXiv:2404.03648 [cs.CL] https://arxiv.org/ abs/2404.03648

  11. [11]

    Bradley Knox, and Kimin Lee

    Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee. 2024. MobileSafetyBench: Evaluating Safety of Au- tonomous Agents in Mobile Device Control. arXiv:2410.17520 [cs.LG] https://arxiv.org/abs/2410.17520

  12. [12]

    Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation,

    Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun Choi, Steven Y. Ko, Sangeun Oh, and Insik Shin. 2024. Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation. arXiv:2312.03003 [cs.HC] https://arxiv.org/ abs/2312.03003

  13. [13]

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge

  14. [14]

    arXiv:2005.03776 [cs.CL] https://arxiv.org/abs/2005.03776

    Mapping Natural Language Instructions to Mobile UI Action Sequences. arXiv:2005.03776 [cs.CL] https://arxiv.org/abs/2005.03776

  15. [15]

    Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhu- osheng Zhang, and Hai Zhao. 2024. Caution for the Environment: Multimodal Agents Are Susceptible to Environmental Distractions. doi:10.48550/arXiv.2408.02544 arXiv:2408.02544

  16. [16]

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large Language Models: A Survey. arXiv:2402.06196 [cs.CL] https://arxiv. org/abs/2402.06196

  17. [17]

    Gui agents: A survey,

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zheng- mian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A. ...

  18. [18]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Conference ac...

  19. [19]

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyama- gundlu, Timothy Lillicrap, and Oriana Riva. 2024. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. doi:10.48550/arXiv.2405.14573 arX...

  20. [20]

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2023. Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG] https://arxiv. org/abs/2307.10088

  21. [21]

    "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024. "Do Anything Now": Characterizing and Evalu- ating In-The-Wild Jailbreak Prompts on Large Language Models. arXiv:2308.03825 [cs.CR] https://arxiv.org/abs/2308.03825

  22. [22]

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. 2017. World of Bits: An Open-Domain Platform for Web- Based Agents. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70) , Doina Precup and Yee Whye Teh (Eds.). PMLR, 3135–3144. https: //proceedings.mlr.press...

  23. [23]

    Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan. 2023. UGIF: UI Grounded Instruction Following. arXiv:2211.07615 [cs.CL] https://arxiv.org/abs/2211.07615

  24. [24]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. 2024. A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science 18, 6 (Dec. 2024), 186345. doi:10.1007/s11704-024-40231-1 arXiv:2308.11432 [cs]

  25. [25]

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu

  26. [26]

    arXiv:2308.15272 [cs.AI] https://arxiv.org/abs/2308.15272

    AutoDroid: LLM-powered Task Automation in Android. arXiv:2308.15272 [cs.AI] https://arxiv.org/abs/2308.15272

  27. [27]

    Hao Wen, Shizuo Tian, Borislav Pavlov, Wenjie Du, Yixuan Li, Ge Chang, Shanhui Zhao, Jiacheng Liu, Yunxin Liu, Ya-Qin Zhang, and Yuanchun Li. 2024. AutoDroid-V2: Boosting SLM-based GUI Agents via Code Generation. arXiv:2412.18116 [cs.AI] https://arxiv.org/abs/ 2412.18116

  28. [29]

    Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried, and Aditi Raghunathan. 2025. Dissecting Adversarial Robustness of Multimodal LM Agents. arXiv:2406.12814 [cs.LG] https://arxiv.org/ abs/2406.12814

  29. [30]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. 2024. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents. arXiv:2410.23218 [cs.CL] https://arxiv.org/ abs/2410.23218

  30. [31]

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shi- han Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang,...

  31. [32]

    Mingzhe Xing, Rongkai Zhang, Hui Xue, Qi Chen, Fan Yang, and Zhen Xiao. 2024. Understanding the Weakness of Large Lan- guage Model Agents within a Complex Android Environment. arXiv:2402.06596 [cs.AI] https://arxiv.org/abs/2402.06596

  32. [33]

    Chejian Xu, Mintong Kang, Jiawei Zhang, Zeyi Liao, Lingbo Mo, Mengqi Yuan, Huan Sun, and Bo Li. 2024. AdvWeb: Controllable Black- box Attacks on VLM-powered Web Agents. arXiv:2410.17401 [cs.CR] https://arxiv.org/abs/2410.17401

  33. [34]

    Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. 2024. Aria-UI: Visual Grounding for GUI Instructions. arXiv:2412.16256 [cs.HC] https://arxiv.org/abs/2412.16256

  34. [35]

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024. Vision- Language Models for Vision Tasks: A Survey. arXiv:2304.00685 [cs.CV] https://arxiv.org/abs/2304.00685

  35. [36]

    Yanzhe Zhang, Tao Yu, and Diyi Yang. 2024. Attacking Vision- Language Computer Agents via Pop-ups. arXiv:2411.02391 [cs.CL] https://arxiv.org/abs/2411.02391

  36. [37]

    Shuai Zhao, Jinming Wen, Anh Luu, Junbo Zhao, and Jie Fu. 2023. Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. In Proceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing , Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 12303–1...

  37. [40]

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su

  38. [41]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv:2401.01614 [cs.IR] https://arxiv.org/abs/2401.01614

  39. [42]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Envi- ronment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI] https://arxiv.org/abs/2307.13854