pith. sign in

arxiv: 2411.18279 · v12 · pith:7LNTAEUAnew · submitted 2024-11-27 · 💻 cs.AI · cs.CL· cs.HC

Large Language Model-Brained GUI Agents: A Survey

Pith reviewed 2026-05-19 11:02 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC
keywords LLM-brained GUI agentsGUI automationlarge language modelshuman-computer interactionagent frameworksbenchmarks and metricsroadmap
0
0 comments X p. Extension
pith:7LNTAEUA Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{7LNTAEUA}

Prints a linked pith:7LNTAEUA badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A survey of LLM-brained GUI agents organizes frameworks, training data, models, benchmarks, and applications while mapping research gaps and a future roadmap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys the new class of agents that use large language models to control graphical user interfaces through natural language commands. It reviews how these agents perceive screen elements, plan sequences of actions, and carry out tasks across web pages, mobile apps, and desktop software. The authors collect and structure information on existing agent designs, the data used to train them, specialized large action models, and ways to measure performance. By highlighting current limitations in the field, the survey supplies a clear path for building more capable systems that let ordinary users complete complex digital work without manual clicking or coding.

Core claim

LLM-brained GUI agents mark a shift by letting multimodal models read complex interface layouts and autonomously perform multi-step operations from simple spoken or typed instructions. The survey traces their development from earlier rule-based tools to current frameworks that combine visual understanding with reasoning and action selection. It details how training data is gathered and used, how large action models are adapted for GUI work, and which metrics and benchmarks best track progress, while also listing early applications and the main open problems that must be solved next.

What carries the argument

LLM-brained GUI agent: an autonomous system that combines a multimodal large language model with modules for perceiving GUI elements, reasoning about user goals, and outputting sequences of interface actions.

If this is right

  • Agents become practical for web navigation, mobile app control, and desktop automation when trained on appropriate GUI-specific data.
  • Large action models tailored to interface tasks improve accuracy over general-purpose language models.
  • Standardized benchmarks and metrics make it possible to compare different agent designs directly.
  • Applications in everyday software use will let users finish intricate jobs through conversation instead of manual steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the agents mature, non-technical users could manage complex software without learning menus or shortcuts.
  • Combining these agents with other tools might allow end-to-end automation across several programs at once.
  • Handling frequent changes in interface designs will likely require new techniques for ongoing adaptation.

Load-bearing premise

Current research papers and industry prototypes already cover enough ground that one survey can spot the most important missing pieces and draw a reliable map for what comes next.

What would settle it

Implement the roadmap steps and measure whether new agents reach reliable success rates above 70 percent on a fixed set of multi-step tasks that existing systems still fail, such as cross-app workflows that involve changing screen layouts.

read the original abstract

GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript is a survey on LLM-brained GUI agents. It reviews the historical evolution of GUI automation, core components of such agents, existing frameworks, methods for data collection and utilization, development of large action models, evaluation metrics and benchmarks, emerging applications across web, mobile, and desktop, key research gaps, and a proposed roadmap for future work.

Significance. If the reviewed literature is representative, the survey would provide a useful consolidation of an emerging interdisciplinary area at the intersection of LLMs, HCI, and automation. It could serve as an entry point for researchers by organizing frameworks, benchmarks, and open problems, though its long-term impact depends on how well it captures both academic and industry contributions in a fast-moving domain.

major comments (2)
  1. [Abstract and §1] Abstract and §1 (Introduction): The claim of presenting a 'comprehensive' overview and reliable roadmap rests on the representativeness of the selected works, yet no search protocol, database list, inclusion/exclusion criteria, or literature cutoff date is described. This directly affects the load-bearing claim that the identified gaps are the most important ones.
  2. [Roadmap and Research Gaps section] § on Roadmap and Research Gaps: The proposed future directions are presented as synthesis outcomes without an explicit discussion of how potential omissions (e.g., recent industry systems or non-English GUI work) were mitigated, which weakens the defensibility of the roadmap as a field guide.
minor comments (3)
  1. [Terminology] Ensure consistent terminology for 'large action models' versus standard terms like vision-language-action models throughout the text.
  2. [Frameworks section] Add a table summarizing the main frameworks, their key features, and publication years for easier comparison.
  3. [Benchmarks section] Verify that all cited benchmarks include the most recent versions or follow-up papers, given the rapid progress noted in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the transparency and defensibility of our survey. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim of presenting a 'comprehensive' overview and reliable roadmap rests on the representativeness of the selected works, yet no search protocol, database list, inclusion/exclusion criteria, or literature cutoff date is described. This directly affects the load-bearing claim that the identified gaps are the most important ones.

    Authors: We agree that an explicit description of the literature selection process is needed to support the 'comprehensive' claim and the identified gaps. In the revised version, we will insert a new subsection (tentatively 'Survey Scope and Methodology' in §1) that details the search protocol. This will specify the primary databases (arXiv, Google Scholar, ACM Digital Library), search keywords and combinations used, inclusion criteria (peer-reviewed or preprint works on LLM-based GUI agents from 2023 onward with empirical components), exclusion criteria (works focused solely on non-GUI agents or lacking technical details), and the cutoff date of October 2024. This addition will clarify the basis for the roadmap without altering the core content. revision: yes

  2. Referee: [Roadmap and Research Gaps section] § on Roadmap and Research Gaps: The proposed future directions are presented as synthesis outcomes without an explicit discussion of how potential omissions (e.g., recent industry systems or non-English GUI work) were mitigated, which weakens the defensibility of the roadmap as a field guide.

    Authors: We concur that explicitly addressing scope limitations and mitigation strategies would make the roadmap more robust. In the revised Roadmap and Research Gaps section, we will add a short paragraph on limitations and mitigation. It will state that the survey emphasizes academic literature and prominent industry systems (e.g., those from OpenAI, Google, and Apple) that were publicly documented by the cutoff date, while noting that very recent preprints or non-English works may be underrepresented due to the field's rapid pace and language accessibility. Mitigation steps included reviewing recent industry reports and cross-checking against related surveys; we will also recommend multilingual and industry-focused extensions as future work. revision: yes

Circularity Check

0 steps flagged

No circularity: survey synthesizes external literature without self-referential derivations or predictions

full rationale

This is a survey paper whose central claims consist of reviewing existing GUI agent frameworks, data collection methods, large action models, metrics, benchmarks, applications, gaps, and a future roadmap. No mathematical derivations, equations, fitted parameters, or first-principles predictions appear in the abstract or described structure. The synthesis draws from the broader research and industry literature rather than reducing any result to the paper's own inputs by construction. Self-citations, if present, are not load-bearing for the identification of gaps or roadmap, which are framed as analysis of the reviewed field. The paper is self-contained against external benchmarks as a literature overview.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a literature survey the paper introduces no new free parameters, mathematical axioms, or invented entities; it reviews existing frameworks and techniques from the cited literature.

pith-pipeline@v0.9.0 · 5853 in / 1070 out tokens · 29483 ms · 2026-05-19T11:02:56.244139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

    cs.CV 2026-05 conditional novelty 7.0

    GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.

  2. From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation

    cs.SE 2025-09 unverdicted novelty 7.0

    An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.

  3. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  4. Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.

  5. MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs

    cs.HC 2026-04 unverdicted novelty 6.0

    MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.

  6. Quantifying Trust: Financial Risk Management for Trustworthy AI Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.

  7. MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

    cs.AI 2025-10 unverdicted novelty 6.0

    MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.

  8. RISK: A Framework for GUI Agents in E-commerce Risk Management

    cs.AI 2025-09 unverdicted novelty 6.0

    RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.

  9. VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

    cs.CL 2025-09 unverdicted novelty 6.0

    VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...

  10. LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

    cs.LG 2025-06 unverdicted novelty 6.0

    LPO optimizes GUI agent positional accuracy by combining information entropy for zone selection with a physical-distance reward inside a Group Relative Preference Optimization framework, claiming SOTA results on bench...

  11. Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

    cs.AI 2026-04 unverdicted novelty 5.0

    General-purpose coding agents achieve highest success on SciVis tasks but at high cost, while domain-specific agents are efficient yet less flexible and computer-use agents struggle with long workflows.

  12. Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

    cs.AI 2026-04 unverdicted novelty 5.0

    General-purpose coding agents achieve highest success on SciVis tasks but cost more compute, while domain-specific agents are efficient yet less flexible and computer-use agents falter on long workflows.

  13. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  14. Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents

    cs.HC 2025-09 unverdicted novelty 5.0

    Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Opera...

  15. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  16. LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents

    cs.CR 2025-07 conditional novelty 5.0

    LaSM is a layer-wise scaling mechanism that amplifies attention and MLP modules in critical layers to defend GUI agents against pop-up attacks by correcting attention misalignment.

  17. Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

    cs.CL 2025-03 unverdicted novelty 5.0

    Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.

  18. Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    cs.MA 2026-02 unverdicted novelty 4.0

    The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.

  19. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

297 extracted references · 297 canonical work pages · cited by 18 Pith papers · 50 internal anchors

  1. [1]

    The graphical user interface,

    B. J. Jansen, “The graphical user interface,” ACM SIGCHI Bull., vol. 30, pp. 22–26, 1998. [Online]. Available: https: //api.semanticscholar.org/CorpusID:18416305

  2. [2]

    Accessibility of command line interfaces,

    H. Sampath, A. Merrick, and A. P . Macvean, “Accessibility of command line interfaces,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID: 233987139

  3. [3]

    The effects of graphical interface design characteristics on human-computer interaction task efficiency

    R. Michalski, J. Grobelny, and W. Karwowski, “The effects of graphical interface design characteristics on human-computer interaction task efficiency,” ArXiv, vol. abs/1211.6712, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID: 14695409

  4. [4]

    Rule-based exploratory testing of graphical user interfaces,

    T. D. Hellmann and F . Maurer, “Rule-based exploratory testing of graphical user interfaces,” in2011 Agile Conference. IEEE, 2011, pp. 107–116

  5. [5]

    jrapture: A capture/replay tool for observation-based testing,

    J. Steven, P . Chandra, B. Fleck, and A. Podgurski, “jrapture: A capture/replay tool for observation-based testing,” SIGSOFT Softw. Eng. Notes, vol. 25, no. 5, p. 158–167, Aug. 2000. [Online]. Available: https://doi.org/10.1145/347636.348993

  6. [6]

    Robotic process automation: systematic literature review,

    L. Ivanˇci´c, D. Suša Vugec, and V. Bosilj Vukši´c, “Robotic process automation: systematic literature review,” in Business Process Management: Blockchain and Central and Eastern Europe Forum: BPM 2019 Blockchain and CEE Forum, Vienna, Austria, Septem- ber 1–6, 2019, Proceedings 17. Springer, 2019, pp. 280–295

  7. [7]

    Large language model — wikipedia, the free encyclopedia,

    W. contributors, “Large language model — wikipedia, the free encyclopedia,” 2024, accessed: 2024-11-25. [Online]. Available: https://en.wikipedia.org/wiki/Large_language_model

  8. [8]

    A Survey of Large Language Models

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

  9. [9]

    A Comprehensive Overview of Large Language Models

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”arXiv preprint arXiv:2307.06435, 2023

  10. [10]

    A Survey on Multimodal Large Language Models

    S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023

  11. [11]

    A brief overview of chatgpt: The history, status quo and potential future development,

    T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y . Tang, “A brief overview of chatgpt: The history, status quo and potential future development,”IEEE/CAA Journal of Automatica Sinica , vol. 10, no. 5, pp. 1122–1136, 2023

  12. [12]

    Large Language Model-Based Agents for Software Engineering: A Survey

    J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A survey,”arXiv preprint arXiv:2409.02977, 2024

  13. [13]

    Llm with tools: A survey,

    Z. Shen, “Llm with tools: A survey,”arXiv preprint arXiv:2409.18807, 2024

  14. [14]

    How far are we from agi: Are llms all we need?

    T. Feng, C. Jin, J. Liu, K. Zhu, H. Tu, Z. Cheng, G. Lin, and J. Y ou, “How far are we from agi: Are llms all we need?”Transactions on Machine Learning Research

  15. [15]

    Cogagent: A visual language model for gui agents

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Zhang, J. Li, B. Xu, Y . Dong, M. Ding, and J. Tang, “Cogagent: A visual language model for gui agents,” 2023. [Online]. Available: https://arxiv.org/abs/2312.08914 JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 86

  16. [16]

    Every software as an agent: Blueprint and case study,

    M. Xu, “Every software as an agent: Blueprint and case study,” arXiv preprint arXiv:2502.04747, 2025

  17. [17]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    B. Zheng, B. Gou, J. Kil, H. Sun, and Y . Su, “Gpt-4v(ision) is a generalist web agent, if grounded,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01614

  18. [18]

    AppAgent: Multimodal Agents as Smartphone Users

    C. Zhang, Z. Y ang, J. Liu, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” 2023. [Online]. Available: https://arxiv.org/abs/2312.13771

  19. [19]

    UFO: A UI-Focused Agent for Windows OS Interaction,

    C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y . Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang, “UFO: A UI-Focused Agent for Windows OS Interaction,”arXiv preprint arXiv:2402.07939, 2024

  20. [20]

    Intelligent virtual assistants with llm-based process automation,

    Y . Guan, D. Wang, Z. Chu, S. Wang, F . Ni, R. Song, L. Li, J. Gu, and C. Zhuang, “Intelligent virtual assistants with llm-based process automation,”ArXiv, vol. abs/2312.06677, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266174422

  21. [21]

    Operating system and artificial intelligence: A systematic review,

    Y . Zhang, X. Zhao, J. Yin, L. Zhang, and Z. Chen, “Operating system and artificial intelligence: A systematic review,” arXiv preprint arXiv:2407.14567, 2024

  22. [22]

    Aios: Llm agent operating system,

    K. Mei, Z. Li, S. Xu, R. Y e, Y . Ge, and Y . Zhang, “Aios: Llm agent operating system,”arXiv e-prints, pp. arXiv–2403, 2024

  23. [23]

    Does chatgpt generate accessible code? investigating accessibility challenges in llm-generated source code,

    W. Aljedaani, A. Habib, A. Aljohani, M. M. Eler, and Y . Feng, “Does chatgpt generate accessible code? investigating accessibility challenges in llm-generated source code,” inInternational Cross- Disciplinary Conference on Web Accessibility , 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273550267

  24. [24]

    Human-centered llm-agent user interface: A position paper,

    D. Chin, Y . Wang, and G. G. Xia, “Human-centered llm-agent user interface: A position paper,” ArXiv, vol. abs/2405.13050, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID: 269982753

  25. [25]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    K. Cheng, Q. Sun, Y . Chu, F . Xu, Y . Li, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,” 2024. [Online]. Available: https://arxiv.org/abs/2401.10935

  26. [26]

    Agent-as-a-judge: Evaluate agents with agents,

    M. Zhuge, C. Zhao, D. R. Ashley, W. Wang, D. Khizbullin, Y . Xiong, Z. Liu, E. Chang, R. Krishnamoorthi, Y . Tian, Y . Shi, V. Chandra, and J. Schmidhuber, “Agent-as-a-judge: Evaluate agents with agents,” 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273350802

  27. [27]

    Li and M

    K. Li and M. Wu, Effective GUI testing automation: Developing an automated GUI testing tool. John Wiley & Sons, 2006

  28. [28]

    30 years of automated gui testing: a bibliometric analysis,

    O. Rodríguez-Valdés, T. E. Vos, P . Aho, and B. Marín, “30 years of automated gui testing: a bibliometric analysis,” in Quality of Information and Communications Technology: 14th International Conference, QUATIC 2021, Algarve, Portugal, September 8–11, 2021, Proceedings 14. Springer, 2021, pp. 473–488

  29. [29]

    A Systematic Literature Review of Automated Techniques for Functional GUI Testing of Mobile Applications

    Y . L. Arnatovich and L. Wang, “A systematic literature review of automated techniques for functional gui testing of mobile applications,”arXiv preprint arXiv:1812.11470, 2018

  30. [30]

    Gui testing for mobile applications: objectives, approaches and challenges,

    K. S. Said, L. Nie, A. A. Ajibode, and X. Zhou, “Gui testing for mobile applications: objectives, approaches and challenges,” in Proceedings of the 12th Asia-Pacific Symposium on Internetware, 2020, pp. 51–60

  31. [31]

    Gui testing for android applications: a survey,

    X. Li, “Gui testing for android applications: a survey,” in2023 7th International Conference on Computer, Software and Modeling (ICCSM). IEEE, 2023, pp. 6–10

  32. [32]

    Test automation for windows gui application,

    J.-J. Oksanen, “Test automation for windows gui application,” 2023

  33. [33]

    Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,

    P . S. Deshmukh, S. S. Date, P . N. Mahalle, and J. Barot, “Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,” in International Conference on ICT for Sustainable Development. Springer, 2023, pp. 619–628

  34. [34]

    A survey on the use of computer vision to improve software engineering tasks,

    M. Bajammal, A. Stocco, D. Mazinanian, and A. Mesbah, “A survey on the use of computer vision to improve software engineering tasks,”IEEE Transactions on Software Engineering, vol. 48, no. 5, pp. 1722–1742, 2020

  35. [35]

    Vision-based mobile app gui testing: A survey,

    S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023

  36. [36]

    Robotic process automation: contemporary themes and challenges,

    R. Syed, S. Suriadi, M. Adams, W. Bandara, S. J. Leemans, C. Ouyang, A. H. Ter Hofstede, I. Van De Weerd, M. T. Wynn, and H. A. Reijers, “Robotic process automation: contemporary themes and challenges,”Computers in Industry, vol. 115, p. 103162, 2020

  37. [37]

    From robotic process automation to intelli- gent process automation: –emerging trends–,

    T. Chakraborti, V. Isahagian, R. Khalaf, Y . Khazaeni, V. Muthusamy, Y . Rizk, and M. Unuvar, “From robotic process automation to intelli- gent process automation: –emerging trends–,” inBusiness Process Management: Blockchain and Robotic Process Automation Forum: BPM 2020 Blockchain and RPA Forum, Seville, Spain, September 13–18, 2020, Proceedings 18. Spr...

  38. [38]

    Robotic process automation: a scientific and industrial systematic mapping study,

    J. G. Enríquez, A. Jiménez-Ramírez, F . J. Domínguez-Mayo, and J. A. García-García, “Robotic process automation: a scientific and industrial systematic mapping study,”IEEE Access, vol. 8, pp. 39 113–39 129, 2020

  39. [39]

    Robotic process automation and artificial intelligence in industry 4.0–a literature review,

    J. Ribeiro, R. Lima, T. Eckhardt, and S. Paiva, “Robotic process automation and artificial intelligence in industry 4.0–a literature review,”Procedia Computer Science, vol. 181, pp. 51–58, 2021

  40. [40]

    Why many challenges with gui test automation (will) remain,

    M. Nass, E. Alégroth, and R. Feldt, “Why many challenges with gui test automation (will) remain,”Information and Software Technology, vol. 138, p. 106625, 2021

  41. [41]

    Research challenges for intelligent robotic process automation,

    S. Agostinelli, A. Marrella, and M. Mecella, “Research challenges for intelligent robotic process automation,” in Business Process Management Workshops: BPM 2019 International Workshops, Vienna, Austria, September 1–6, 2019, Revised Selected Papers

  42. [42]

    Springer, 2019, pp. 12–18

  43. [43]

    Task automation intel- ligent agents: A review,

    A. Wali, S. Mahamad, and S. Sulaiman, “Task automation intel- ligent agents: A review,” Future Internet, vol. 15, no. 6, p. 196, 2023

  44. [44]

    An in-depth survey of large language model-based artificial intelligence agents,

    P . Zhao, Z. Jin, and N. Cheng, “An in-depth survey of large language model-based artificial intelligence agents,”arXiv preprint arXiv:2309.14365, 2023

  45. [45]

    Exploring large language model based intelligent agents: Definitions, methods, and prospects,

    Y . Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F . Yin, J. Zhaoet al., “Exploring large language model based intelligent agents: Definitions, methods, and prospects,” arXiv preprint arXiv:2401.03428, 2024

  46. [46]

    Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

    Y . Li, H. Wen, W. Wang, X. Li, Y . Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y . Sun et al. , “Personal llm agents: Insights and survey about the capability, efficiency and security,”arXiv preprint arXiv:2401.05459, 2024

  47. [47]

    The Rise and Potential of Large Language Model Based Agents: A Survey

    Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,”arXiv preprint arXiv:2309.07864, 2023

  48. [48]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Y ang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

  49. [49]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi- agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024

  50. [50]

    LLM Multi-Agent Systems: Challenges and Open Problems

    S. Han, Q. Zhang, Y . Y ao, W. Jin, Z. Xu, and C. He, “Llm multi- agent systems: Challenges and open problems,” arXiv preprint arXiv:2402.03578, 2024

  51. [51]

    Llm-based multi-agent rein- forcement learning: Current and future directions,

    C. Sun, S. Huang, and D. Pompili, “Llm-based multi-agent rein- forcement learning: Current and future directions,”arXiv preprint arXiv:2405.11106, 2024

  52. [52]

    Understanding the planning of LLM agents: A survey

    X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of llm agents: A survey,”arXiv preprint arXiv:2402.02716, 2024

  53. [53]

    A survey on large language models for automated planning,

    M. Aghzal, E. Plaku, G. J. Stein, and Z. Y ao, “A survey on large language models for automated planning,” arXiv preprint arXiv:2502.12435, 2025

  54. [54]

    arXiv preprint arXiv:2501.07278 , year =

    J. Zheng, C. Shi, X. Cai, Q. Li, D. Zhang, C. Li, D. Yu, and Q. Ma, “Lifelong learning of large language model based agents: A roadmap,”arXiv preprint arXiv:2501.07278, 2025

  55. [55]

    A Survey on the Memory Mechanism of Large Language Model based Agents

    Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A survey on the memory mechanism of large language model based agents,”arXiv preprint arXiv:2404.13501, 2024

  56. [56]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Y ang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024

  57. [57]

    A survey on multimodal benchmarks: In the era of large ai models,

    L. Li, G. Chen, H. Shi, J. Xiao, and L. Chen, “A survey on multimodal benchmarks: In the era of large ai models,” arXiv preprint arXiv:2409.18142, 2024

  58. [58]

    Benchmark evaluations, applications, and challenges of large vision language models: A survey,

    Z. Li, X. Wu, H. Du, H. Nghiem, and G. Shi, “Benchmark evaluations, applications, and challenges of large vision language models: A survey,”arXiv preprint arXiv:2501.02189, 2025

  59. [59]

    A survey on evaluation of multimodal large language models

    J. Huang and J. Zhang, “A survey on evaluation of multimodal large language models,”arXiv preprint arXiv:2408.15769, 2024

  60. [60]

    arXiv preprint arXiv:2402.15116 (2024)

    J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents: A survey,”arXiv preprint arXiv:2402.15116, 2024

  61. [61]

    Agent AI: Surveying the Horizons of Multimodal Interaction

    Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y . Noda, D. Terzopoulos, Y . Choi et al. , “Agent ai: JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 87 Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

  62. [62]

    arXiv preprint arXiv:2411.02006 (2024)

    B. Wu, Y . Li, M. Fang, Z. Song, Z. Zhang, Y . Wei, and L. Chen, “Foundations and recent trends in multimodal mobile agents: A survey,”arXiv preprint arXiv:2411.02006, 2024

  63. [63]

    Gui agents with foundation models: A comprehensive survey,

    S. Wang, W. Liu, J. Chen, W. Gan, X. Zeng, S. Yu, X. Hao, K. Shao, Y . Wang, and R. Tang, “Gui agents with foundation models: A comprehensive survey,” 2024. [Online]. Available: https://arxiv.org/abs/2411.04890

  64. [64]

    Generalist virtual agents: A survey on autonomous agents across digital platforms,

    M. Gao, W. Bu, B. Miao, Y . Wu, Y . Li, J. Li, S. Tang, Q. Wu, Y . Zhuang, and M. Wang, “Generalist virtual agents: A survey on autonomous agents across digital platforms,” arXiv preprint arXiv:2411.10943, 2024

  65. [65]

    Gui agents: A survey,

    D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia, X. Li, J. Shi, H. Chen, V. D. Lai, Z. Xie, S. Kim, R. Zhang, T. Yu, M. Tanjim, N. K. Ahmed, P . Mathur, S. Y oon, L. Y ao, B. Kveton, T. H. Nguyen, T. Bui, T. Zhou, R. A. Rossi, and F . Dernoncourt, “Gui agents: A survey,” 2024. [Online]. Available: https://arxiv.org/a...

  66. [66]

    Llm-powered gui agents in phone automation: Surveying progress and prospects,

    G. Liu, P . Zhao, L. Liu, Y . Guo, H. Xiao, W. Lin, Y . Chai, Y . Han, S. Ren, H. Wang et al. , “Llm-powered gui agents in phone automation: Surveying progress and prospects,” arXiv preprint arXiv:2504.19838, 2025

  67. [67]

    Os agents: A survey on mllm-based agents for general computing devices use,

    X. Hu, T. Xiong, B. Yi, Z. Wei, R. Xiao, Y . Chen, J. Y e, M. Tao, X. Zhou, Z. Zhao et al., “Os agents: A survey on mllm-based agents for general computing devices use,” 2024

  68. [68]

    Towards trustworthy gui agents: A survey,

    Y . Shi, W. Yu, W. Y ao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

  69. [69]

    A survey of webagents: Towards next-generation ai agents for web automation with large foundation models,

    L. Ning, Z. Liang, Z. Jiang, H. Qu, Y . Ding, W. Fan, X.-y. Wei, S. Lin, H. Liu, P . S. Yuet al., “A survey of webagents: Towards next-generation ai agents for web automation with large foundation models,”arXiv preprint arXiv:2503.23350, 2025

  70. [70]

    A survey on (m)llm-based gui agents,

    F . Tang, H. Xu, H. Zhang, S. Chen, X. Wu, Y . Shen, W. Zhang, G. Hou, Z. Tan, Y . Y an, K. Song, J. Shao, W. Lu, J. Xiao, and Y . Zhuang, “A survey on (m)llm-based gui agents,” 2025. [Online]. Available: https://arxiv.org/abs/2504.13865

  71. [71]

    A summary on gui agents with foundation models enhanced by reinforcement learning,

    J. Li and K. Huang, “A summary on gui agents with foundation models enhanced by reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2504.20464

  72. [72]

    A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

    P . J. Sager, B. Meyer, P . Y an, R. von Wartburg-Kottler, L. Etaiwi, A. Enayati, G. Nobel, A. Abdulkadir, B. F . Grewe, and T. Stadel- mann, “Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants,”arXiv preprint arXiv:2501.16150, 2025

  73. [73]

    Cytestion: Automated gui testing for web applications,

    T. S. d. Moura, E. L. Alves, H. F . d. Figueirêdo, and C. d. S. Baptista, “Cytestion: Automated gui testing for web applications,” in Proceedings of the XXXVII Brazilian Symposium on Software Engineering, 2023, pp. 388–397

  74. [74]

    Sikuli: using gui screenshots for search and automation,

    T. Y eh, T.-H. Chang, and R. C. Miller, “Sikuli: using gui screenshots for search and automation,” in Proceedings of the 22nd annual ACM symposium on User interface software and technology, 2009, pp. 183–192

  75. [75]

    Prediction and entropy of printed english,

    C. E. Shannon, “Prediction and entropy of printed english,”Bell system technical journal, vol. 30, no. 1, pp. 50–64, 1951

  76. [76]

    N-gram-based text catego- rization,

    W. B. Cavnar, J. M. Trenkle et al., “N-gram-based text catego- rization,” inProceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175. Ann Arbor, Michigan, 1994, p. 14

  77. [77]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014

  78. [78]

    Language Models are Few-Shot Learners

    B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, vol. 1, 2020

  79. [79]

    Finetuned Language Models Are Zero-Shot Learners

    J. Wei, M. Bosma, V. Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021

  80. [80]

    Recurrent neural networks,

    L. R. Medsker, L. Jain et al., “Recurrent neural networks,”Design and Applications, vol. 5, no. 64-67, p. 2, 2001

Showing first 80 references.