Large Language Model-Brained GUI Agents: A Survey

Bowen Li; Chaoyun Zhang; Dongmei Zhang; Guyue Liu; Jiaxu Qian; Liqun Li; Minghua Ma; Qingwei Lin; Qi Zhang; Saravan Rajmohan

arxiv: 2411.18279 · v12 · pith:7LNTAEUA · submitted 2024-11-27 · cs.AI · cs.CL· cs.HC

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang , Shilin He , Jiaxu Qian , Bowen Li , Liqun Li , Si Qin , Yu Kang , Minghua Ma

show 5 more authors

Guyue Liu Qingwei Lin Saravan Rajmohan Dongmei Zhang Qi Zhang

This is my paper

Reviewed by Pith T0 review T1 audit T2 compute T3 formal T4 reserved 2026-05-19 11:02 UTCgrok-4.3pith:7LNTAEUA record.json open to challenge →

classification cs.AI cs.CLcs.HC

keywords LLM-brained GUI agentsGUI automationlarge language modelshuman-computer interactionagent frameworksbenchmarks and metricsroadmap

0 comments

The pith

A survey of LLM-brained GUI agents organizes frameworks, training data, models, benchmarks, and applications while mapping research gaps and a future roadmap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys the new class of agents that use large language models to control graphical user interfaces through natural language commands. It reviews how these agents perceive screen elements, plan sequences of actions, and carry out tasks across web pages, mobile apps, and desktop software. The authors collect and structure information on existing agent designs, the data used to train them, specialized large action models, and ways to measure performance. By highlighting current limitations in the field, the survey supplies a clear path for building more capable systems that let ordinary users complete complex digital work without manual clicking or coding.

Core claim

LLM-brained GUI agents mark a shift by letting multimodal models read complex interface layouts and autonomously perform multi-step operations from simple spoken or typed instructions. The survey traces their development from earlier rule-based tools to current frameworks that combine visual understanding with reasoning and action selection. It details how training data is gathered and used, how large action models are adapted for GUI work, and which metrics and benchmarks best track progress, while also listing early applications and the main open problems that must be solved next.

What carries the argument

LLM-brained GUI agent: an autonomous system that combines a multimodal large language model with modules for perceiving GUI elements, reasoning about user goals, and outputting sequences of interface actions.

If this is right

Agents become practical for web navigation, mobile app control, and desktop automation when trained on appropriate GUI-specific data.
Large action models tailored to interface tasks improve accuracy over general-purpose language models.
Standardized benchmarks and metrics make it possible to compare different agent designs directly.
Applications in everyday software use will let users finish intricate jobs through conversation instead of manual steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the agents mature, non-technical users could manage complex software without learning menus or shortcuts.
Combining these agents with other tools might allow end-to-end automation across several programs at once.
Handling frequent changes in interface designs will likely require new techniques for ongoing adaptation.

Load-bearing premise

Current research papers and industry prototypes already cover enough ground that one survey can spot the most important missing pieces and draw a reliable map for what comes next.

What would settle it

Implement the roadmap steps and measure whether new agents reach reliable success rates above 70 percent on a fixed set of multi-step tasks that existing systems still fail, such as cross-app workflows that involve changing screen layouts.

read the original abstract

GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript is a survey on LLM-brained GUI agents. It reviews the historical evolution of GUI automation, core components of such agents, existing frameworks, methods for data collection and utilization, development of large action models, evaluation metrics and benchmarks, emerging applications across web, mobile, and desktop, key research gaps, and a proposed roadmap for future work.

Significance. If the reviewed literature is representative, the survey would provide a useful consolidation of an emerging interdisciplinary area at the intersection of LLMs, HCI, and automation. It could serve as an entry point for researchers by organizing frameworks, benchmarks, and open problems, though its long-term impact depends on how well it captures both academic and industry contributions in a fast-moving domain.

major comments (2)

[Abstract and §1] Abstract and §1 (Introduction): The claim of presenting a 'comprehensive' overview and reliable roadmap rests on the representativeness of the selected works, yet no search protocol, database list, inclusion/exclusion criteria, or literature cutoff date is described. This directly affects the load-bearing claim that the identified gaps are the most important ones.
[Roadmap and Research Gaps section] § on Roadmap and Research Gaps: The proposed future directions are presented as synthesis outcomes without an explicit discussion of how potential omissions (e.g., recent industry systems or non-English GUI work) were mitigated, which weakens the defensibility of the roadmap as a field guide.

minor comments (3)

[Terminology] Ensure consistent terminology for 'large action models' versus standard terms like vision-language-action models throughout the text.
[Frameworks section] Add a table summarizing the main frameworks, their key features, and publication years for easier comparison.
[Benchmarks section] Verify that all cited benchmarks include the most recent versions or follow-up papers, given the rapid progress noted in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the transparency and defensibility of our survey. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim of presenting a 'comprehensive' overview and reliable roadmap rests on the representativeness of the selected works, yet no search protocol, database list, inclusion/exclusion criteria, or literature cutoff date is described. This directly affects the load-bearing claim that the identified gaps are the most important ones.

Authors: We agree that an explicit description of the literature selection process is needed to support the 'comprehensive' claim and the identified gaps. In the revised version, we will insert a new subsection (tentatively 'Survey Scope and Methodology' in §1) that details the search protocol. This will specify the primary databases (arXiv, Google Scholar, ACM Digital Library), search keywords and combinations used, inclusion criteria (peer-reviewed or preprint works on LLM-based GUI agents from 2023 onward with empirical components), exclusion criteria (works focused solely on non-GUI agents or lacking technical details), and the cutoff date of October 2024. This addition will clarify the basis for the roadmap without altering the core content. revision: yes
Referee: [Roadmap and Research Gaps section] § on Roadmap and Research Gaps: The proposed future directions are presented as synthesis outcomes without an explicit discussion of how potential omissions (e.g., recent industry systems or non-English GUI work) were mitigated, which weakens the defensibility of the roadmap as a field guide.

Authors: We concur that explicitly addressing scope limitations and mitigation strategies would make the roadmap more robust. In the revised Roadmap and Research Gaps section, we will add a short paragraph on limitations and mitigation. It will state that the survey emphasizes academic literature and prominent industry systems (e.g., those from OpenAI, Google, and Apple) that were publicly documented by the cutoff date, while noting that very recent preprints or non-English works may be underrepresented due to the field's rapid pace and language accessibility. Mitigation steps included reviewing recent industry reports and cross-checking against related surveys; we will also recommend multilingual and industry-focused extensions as future work. revision: yes

Circularity Check

0 steps flagged

No circularity: survey synthesizes external literature without self-referential derivations or predictions

full rationale

This is a survey paper whose central claims consist of reviewing existing GUI agent frameworks, data collection methods, large action models, metrics, benchmarks, applications, gaps, and a future roadmap. No mathematical derivations, equations, fitted parameters, or first-principles predictions appear in the abstract or described structure. The synthesis draws from the broader research and industry literature rather than reducing any result to the paper's own inputs by construction. Self-citations, if present, are not load-bearing for the identification of gaps or roadmap, which are framed as analysis of the reviewed field. The paper is self-contained against external benchmarks as a literature overview.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a literature survey the paper introduces no new free parameters, mathematical axioms, or invented entities; it reviews existing frameworks and techniques from the cited literature.

pith-pipeline@v0.9.0 · 5853 in / 1070 out tokens · 29483 ms · 2026-05-19T11:02:56.244139+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
cs.SE 2026-06 unverdicted novelty 7.0

Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.
AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions
cs.AI 2026-05 unverdicted novelty 7.0

AutoRPA distills ReAct LLM agents into RPA functions that solve similar GUI tasks with 82-96% lower token usage via translator-builder synthesis and hybrid repair.
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
cs.CV 2026-05 conditional novelty 7.0

GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation
cs.SE 2025-09 unverdicted novelty 7.0

An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.
Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents
cs.CV 2026-06 unverdicted novelty 6.0

Failure-driven self-improvement raises OpenCUA-72B success rate on OSWorld from 42.3% to 48.9% via LLM diagnosis and inference-time code patches, without retraining.
ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation
cs.CL 2026-06 unverdicted novelty 6.0

ARKD uses an RL policy network to adaptively balance FKL and RKL in LLM distillation, claiming gains of 0.4-0.6 points on Rouge-L and BertScore over baselines.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
cs.AI 2026-04 unverdicted novelty 6.0

TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs
cs.HC 2026-04 unverdicted novelty 6.0

MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
cs.AI 2025-10 unverdicted novelty 6.0

MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
RISK: A Framework for GUI Agents in E-commerce Risk Management
cs.AI 2025-09 unverdicted novelty 6.0

RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
cs.CL 2025-09 unverdicted novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
cs.LG 2025-06 unverdicted novelty 6.0

LPO optimizes GUI agent positional accuracy by combining information entropy for zone selection with a physical-distance reward inside a Group Relative Preference Optimization framework, claiming SOTA results on bench...
AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society
cs.SI 2025-02 unverdicted novelty 6.0

AgentSociety is a large-scale LLM agent-based social simulator validated on polarization, UBI, disasters, and sustainability issues with alignment to real experiments.
Understanding How Enterprises Adopt the Model Context Protocol for LLM-Driven Software Engineering
cs.SE 2026-06 unverdicted novelty 5.0

Interviews with 20 practitioners show MCP supports cross-system collaboration and task decoupling in LLM workflows but is limited by ecosystem fragmentation, coordination issues, and state management problems.
Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents
cs.CL 2026-05 unverdicted novelty 5.0

Mobile-Aptus uses supervised fine-tuning followed by semantic similarity retrieval and direct preference optimization to calibrate confidence scores in mobile agents, yielding over 17% average task success improvement...
Exploring LLM Agent Designs and Interaction Modalities for Scientific Visualization
cs.AI 2026-04 unverdicted novelty 5.0

General-purpose coding agents achieve highest success on SciVis tasks but cost more compute, while domain-specific agents are efficient yet less flexible and computer-use agents falter on long workflows.
Exploring LLM Agent Designs and Interaction Modalities for Scientific Visualization
cs.AI 2026-04 unverdicted novelty 5.0

General-purpose coding agents achieve highest success on SciVis tasks but at high cost, while domain-specific agents are efficient yet less flexible and computer-use agents struggle with long workflows.
Exploring LLM Agent Designs and Interaction Modalities for Scientific Visualization
cs.AI 2026-04 unverdicted novelty 5.0

Empirical comparison of domain-specific, computer-use, and general-purpose LLM agents plus CLI/GUI modalities on SciVis tasks reveals general-purpose agents highest in success rate but costliest, domain-specific agent...
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents
cs.HC 2025-09 unverdicted novelty 5.0

Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Opera...
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
cs.CR 2025-07 conditional novelty 5.0

LaSM is a layer-wise scaling mechanism that amplifies attention and MLP modules in critical layers to defend GUI agents against pop-up attacks by correcting attention misalignment.
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
cs.CL 2025-03 unverdicted novelty 5.0

Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
cs.AI 2025-01 unverdicted novelty 5.0

A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.
ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
cs.SE 2026-06 unverdicted novelty 4.0

Introduces COM-as-Action paradigm for professional software manipulation, ComCADBench benchmark for CAD, and ComActor agent claiming SOTA performance and long-horizon resilience.
GUI-AC: Enhancing Continual Learning in GUI Agents
cs.CV 2026-06 unverdicted novelty 4.0

GUI-AC stabilizes RFT for non-stationary GUI data by down-weighting noisy advantages and relaxing clipping bounds via a grounding certainty term.
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
cs.MA 2026-02 unverdicted novelty 4.0

The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.
LLM Agents Are the Antidote to Walled Gardens
cs.LG 2025-06 unverdicted novelty 4.0

LLM agents enable universal interoperability by serving as automatic translators and adapters between proprietary digital services.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

297 extracted references · 297 canonical work pages · cited by 28 Pith papers · 51 internal anchors

[1]

The graphical user interface,

B. J. Jansen, “The graphical user interface,” ACM SIGCHI Bull., vol. 30, pp. 22–26, 1998. [Online]. Available: https: //api.semanticscholar.org/CorpusID:18416305

work page 1998
[2]

Accessibility of command line interfaces,

H. Sampath, A. Merrick, and A. P . Macvean, “Accessibility of command line interfaces,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID: 233987139

work page 2021
[3]

The effects of graphical interface design characteristics on human-computer interaction task efficiency

R. Michalski, J. Grobelny, and W. Karwowski, “The effects of graphical interface design characteristics on human-computer interaction task efficiency,” ArXiv, vol. abs/1211.6712, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID: 14695409

work page internal anchor Pith review Pith/arXiv arXiv 2006
[4]

Rule-based exploratory testing of graphical user interfaces,

T. D. Hellmann and F . Maurer, “Rule-based exploratory testing of graphical user interfaces,” in2011 Agile Conference. IEEE, 2011, pp. 107–116

work page 2011
[5]

jrapture: A capture/replay tool for observation-based testing,

J. Steven, P . Chandra, B. Fleck, and A. Podgurski, “jrapture: A capture/replay tool for observation-based testing,” SIGSOFT Softw. Eng. Notes, vol. 25, no. 5, p. 158–167, Aug. 2000. [Online]. Available: https://doi.org/10.1145/347636.348993

work page doi:10.1145/347636.348993 2000
[6]

Robotic process automation: systematic literature review,

L. Ivanˇci´c, D. Suša Vugec, and V. Bosilj Vukši´c, “Robotic process automation: systematic literature review,” in Business Process Management: Blockchain and Central and Eastern Europe Forum: BPM 2019 Blockchain and CEE Forum, Vienna, Austria, Septem- ber 1–6, 2019, Proceedings 17. Springer, 2019, pp. 280–295

work page 2019
[7]

Large language model — wikipedia, the free encyclopedia,

W. contributors, “Large language model — wikipedia, the free encyclopedia,” 2024, accessed: 2024-11-25. [Online]. Available: https://en.wikipedia.org/wiki/Large_language_model

work page 2024
[8]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

A Comprehensive Overview of Large Language Models

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”arXiv preprint arXiv:2307.06435, 2023

work page internal anchor Pith review arXiv 2023
[10]

A Survey on Multimodal Large Language Models

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

A brief overview of chatgpt: The history, status quo and potential future development,

T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y . Tang, “A brief overview of chatgpt: The history, status quo and potential future development,”IEEE/CAA Journal of Automatica Sinica , vol. 10, no. 5, pp. 1122–1136, 2023

work page 2023
[12]

Large Language Model-Based Agents for Software Engineering: A Survey

J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A survey,”arXiv preprint arXiv:2409.02977, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

LLM With Tools: A Survey

Z. Shen, “Llm with tools: A survey,”arXiv preprint arXiv:2409.18807, 2024

work page Pith review arXiv 2024
[14]

How far are we from agi: Are llms all we need?

T. Feng, C. Jin, J. Liu, K. Zhu, H. Tu, Z. Cheng, G. Lin, and J. Y ou, “How far are we from agi: Are llms all we need?”Transactions on Machine Learning Research

work page
[15]

arXiv preprint arXiv:2312.08914 , year =

W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Zhang, J. Li, B. Xu, Y . Dong, M. Ding, and J. Tang, “Cogagent: A visual language model for gui agents,” 2023. [Online]. Available: https://arxiv.org/abs/2312.08914 JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 86

work page arXiv 2023
[16]

Every software as an agent: Blueprint and case study,

M. Xu, “Every software as an agent: Blueprint and case study,” arXiv preprint arXiv:2502.04747, 2025

work page arXiv 2025
[17]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

B. Zheng, B. Gou, J. Kil, H. Sun, and Y . Su, “Gpt-4v(ision) is a generalist web agent, if grounded,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01614

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

AppAgent: Multimodal Agents as Smartphone Users

C. Zhang, Z. Y ang, J. Liu, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” 2023. [Online]. Available: https://arxiv.org/abs/2312.13771

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

UFO: A ui-focused agent for windows os interaction.arXiv preprint arXiv:2402.07939, 2024

C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y . Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang, “UFO: A UI-Focused Agent for Windows OS Interaction,”arXiv preprint arXiv:2402.07939, 2024

work page arXiv 2024
[20]

Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

Y . Guan, D. Wang, Z. Chu, S. Wang, F . Ni, R. Song, L. Li, J. Gu, and C. Zhuang, “Intelligent virtual assistants with llm-based process automation,”ArXiv, vol. abs/2312.06677, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266174422

work page arXiv 2023
[21]

Integrating artificial intelligence into operating systems: A survey on techniques, applications, and future directions

Y . Zhang, X. Zhao, J. Yin, L. Zhang, and Z. Chen, “Operating system and artificial intelligence: A systematic review,” arXiv preprint arXiv:2407.14567, 2024

work page arXiv 2024
[22]

Aios: Llm agent operating system,

K. Mei, Z. Li, S. Xu, R. Y e, Y . Ge, and Y . Zhang, “Aios: Llm agent operating system,”arXiv e-prints, pp. arXiv–2403, 2024

work page 2024
[23]

Does chatgpt generate accessible code? investigating accessibility challenges in llm-generated source code,

W. Aljedaani, A. Habib, A. Aljohani, M. M. Eler, and Y . Feng, “Does chatgpt generate accessible code? investigating accessibility challenges in llm-generated source code,” inInternational Cross- Disciplinary Conference on Web Accessibility , 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273550267

work page 2024
[24]

Human-centered llm-agent user interface: A position paper,

D. Chin, Y . Wang, and G. G. Xia, “Human-centered llm-agent user interface: A position paper,” ArXiv, vol. abs/2405.13050, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID: 269982753

work page arXiv 2024
[25]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

K. Cheng, Q. Sun, Y . Chu, F . Xu, Y . Li, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,” 2024. [Online]. Available: https://arxiv.org/abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Agent-as-a-judge: Evaluate agents with agents,

M. Zhuge, C. Zhao, D. R. Ashley, W. Wang, D. Khizbullin, Y . Xiong, Z. Liu, E. Chang, R. Krishnamoorthi, Y . Tian, Y . Shi, V. Chandra, and J. Schmidhuber, “Agent-as-a-judge: Evaluate agents with agents,” 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273350802

work page 2024
[27]

Li and M

K. Li and M. Wu, Effective GUI testing automation: Developing an automated GUI testing tool. John Wiley & Sons, 2006

work page 2006
[28]

30 years of automated gui testing: a bibliometric analysis,

O. Rodríguez-Valdés, T. E. Vos, P . Aho, and B. Marín, “30 years of automated gui testing: a bibliometric analysis,” in Quality of Information and Communications Technology: 14th International Conference, QUATIC 2021, Algarve, Portugal, September 8–11, 2021, Proceedings 14. Springer, 2021, pp. 473–488

work page 2021
[29]

A Systematic Literature Review of Automated Techniques for Functional GUI Testing of Mobile Applications

Y . L. Arnatovich and L. Wang, “A systematic literature review of automated techniques for functional gui testing of mobile applications,”arXiv preprint arXiv:1812.11470, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Gui testing for mobile applications: objectives, approaches and challenges,

K. S. Said, L. Nie, A. A. Ajibode, and X. Zhou, “Gui testing for mobile applications: objectives, approaches and challenges,” in Proceedings of the 12th Asia-Pacific Symposium on Internetware, 2020, pp. 51–60

work page 2020
[31]

Gui testing for android applications: a survey,

X. Li, “Gui testing for android applications: a survey,” in2023 7th International Conference on Computer, Software and Modeling (ICCSM). IEEE, 2023, pp. 6–10

work page 2023
[32]

Test automation for windows gui application,

J.-J. Oksanen, “Test automation for windows gui application,” 2023

work page 2023
[33]

Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,

P . S. Deshmukh, S. S. Date, P . N. Mahalle, and J. Barot, “Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,” in International Conference on ICT for Sustainable Development. Springer, 2023, pp. 619–628

work page 2023
[34]

A survey on the use of computer vision to improve software engineering tasks,

M. Bajammal, A. Stocco, D. Mazinanian, and A. Mesbah, “A survey on the use of computer vision to improve software engineering tasks,”IEEE Transactions on Software Engineering, vol. 48, no. 5, pp. 1722–1742, 2020

work page 2020
[35]

2025 , publisher =

S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023

work page arXiv 2023
[36]

Robotic process automation: contemporary themes and challenges,

R. Syed, S. Suriadi, M. Adams, W. Bandara, S. J. Leemans, C. Ouyang, A. H. Ter Hofstede, I. Van De Weerd, M. T. Wynn, and H. A. Reijers, “Robotic process automation: contemporary themes and challenges,”Computers in Industry, vol. 115, p. 103162, 2020

work page 2020
[37]

From robotic process automation to intelli- gent process automation: –emerging trends–,

T. Chakraborti, V. Isahagian, R. Khalaf, Y . Khazaeni, V. Muthusamy, Y . Rizk, and M. Unuvar, “From robotic process automation to intelli- gent process automation: –emerging trends–,” inBusiness Process Management: Blockchain and Robotic Process Automation Forum: BPM 2020 Blockchain and RPA Forum, Seville, Spain, September 13–18, 2020, Proceedings 18. Spr...

work page 2020
[38]

Robotic process automation: a scientific and industrial systematic mapping study,

J. G. Enríquez, A. Jiménez-Ramírez, F . J. Domínguez-Mayo, and J. A. García-García, “Robotic process automation: a scientific and industrial systematic mapping study,”IEEE Access, vol. 8, pp. 39 113–39 129, 2020

work page 2020
[39]

Robotic process automation and artificial intelligence in industry 4.0–a literature review,

J. Ribeiro, R. Lima, T. Eckhardt, and S. Paiva, “Robotic process automation and artificial intelligence in industry 4.0–a literature review,”Procedia Computer Science, vol. 181, pp. 51–58, 2021

work page 2021
[40]

Why many challenges with gui test automation (will) remain,

M. Nass, E. Alégroth, and R. Feldt, “Why many challenges with gui test automation (will) remain,”Information and Software Technology, vol. 138, p. 106625, 2021

work page 2021
[41]

Research challenges for intelligent robotic process automation,

S. Agostinelli, A. Marrella, and M. Mecella, “Research challenges for intelligent robotic process automation,” in Business Process Management Workshops: BPM 2019 International Workshops, Vienna, Austria, September 1–6, 2019, Revised Selected Papers

work page 2019
[42]

Springer, 2019, pp. 12–18

work page 2019
[43]

Task automation intel- ligent agents: A review,

A. Wali, S. Mahamad, and S. Sulaiman, “Task automation intel- ligent agents: A review,” Future Internet, vol. 15, no. 6, p. 196, 2023

work page 2023
[44]

arXiv preprint arXiv:2309.14365 , year=

P . Zhao, Z. Jin, and N. Cheng, “An in-depth survey of large language model-based artificial intelligence agents,”arXiv preprint arXiv:2309.14365, 2023

work page arXiv 2023
[45]

arXiv preprint arXiv:2401.03428

Y . Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F . Yin, J. Zhaoet al., “Exploring large language model based intelligent agents: Definitions, methods, and prospects,” arXiv preprint arXiv:2401.03428, 2024

work page arXiv 2024
[46]

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Y . Li, H. Wen, W. Wang, X. Li, Y . Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y . Sun et al. , “Personal llm agents: Insights and survey about the capability, efficiency and security,”arXiv preprint arXiv:2401.05459, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

The Rise and Potential of Large Language Model Based Agents: A Survey

Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,”arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Y ang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

work page 2024
[49]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi- agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

LLM Multi-Agent Systems: Challenges and Open Problems

S. Han, Q. Zhang, Y . Y ao, W. Jin, Z. Xu, and C. He, “Llm multi- agent systems: Challenges and open problems,” arXiv preprint arXiv:2402.03578, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

doi:10.48550/arXiv.2405.11106 , abstract =

C. Sun, S. Huang, and D. Pompili, “Llm-based multi-agent rein- forcement learning: Current and future directions,”arXiv preprint arXiv:2405.11106, 2024

work page arXiv 2024
[52]

Understanding the planning of LLM agents: A survey

X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of llm agents: A survey,”arXiv preprint arXiv:2402.02716, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

A survey on large language models for automated planning,

M. Aghzal, E. Plaku, G. J. Stein, and Z. Y ao, “A survey on large language models for automated planning,” arXiv preprint arXiv:2502.12435, 2025

work page arXiv 2025
[54]

arXiv preprint arXiv:2501.07278 , year=

J. Zheng, C. Shi, X. Cai, Q. Li, D. Zhang, C. Li, D. Yu, and Q. Ma, “Lifelong learning of large language model based agents: A roadmap,”arXiv preprint arXiv:2501.07278, 2025

work page arXiv 2025
[55]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A survey on the memory mechanism of large language model based agents,”arXiv preprint arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Y ang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024

work page 2024
[57]

A survey on multimodal benchmarks: In the era of large ai models

L. Li, G. Chen, H. Shi, J. Xiao, and L. Chen, “A survey on multimodal benchmarks: In the era of large ai models,” arXiv preprint arXiv:2409.18142, 2024

work page arXiv 2024
[58]

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Z. Li, X. Wu, H. Du, H. Nghiem, and G. Shi, “Benchmark evaluations, applications, and challenges of large vision language models: A survey,”arXiv preprint arXiv:2501.02189, 2025

work page Pith review arXiv 2025
[59]

A survey on evaluation of multimodal large language models.arXiv preprint arXiv:2408.15769, 2024

J. Huang and J. Zhang, “A survey on evaluation of multimodal large language models,”arXiv preprint arXiv:2408.15769, 2024

work page arXiv 2024
[60]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents: A survey,”arXiv preprint arXiv:2402.15116, 2024

work page arXiv 2024
[61]

Agent AI: Surveying the Horizons of Multimodal Interaction

Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y . Noda, D. Terzopoulos, Y . Choi et al. , “Agent ai: JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 87 Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

arXiv preprint arXiv:2411.02006 (2024)

B. Wu, Y . Li, M. Fang, Z. Song, Z. Zhang, Y . Wei, and L. Chen, “Foundations and recent trends in multimodal mobile agents: A survey,”arXiv preprint arXiv:2411.02006, 2024

work page arXiv 2024
[63]

arXiv preprint arXiv:2411.04890 , year=

S. Wang, W. Liu, J. Chen, W. Gan, X. Zeng, S. Yu, X. Hao, K. Shao, Y . Wang, and R. Tang, “Gui agents with foundation models: A comprehensive survey,” 2024. [Online]. Available: https://arxiv.org/abs/2411.04890

work page arXiv 2024
[64]

https://doi.org/10.48550/arXiv.2411.10943

M. Gao, W. Bu, B. Miao, Y . Wu, Y . Li, J. Li, S. Tang, Q. Wu, Y . Zhuang, and M. Wang, “Generalist virtual agents: A survey on autonomous agents across digital platforms,” arXiv preprint arXiv:2411.10943, 2024

work page arXiv 2024
[65]

Gui agents: A survey,

D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia, X. Li, J. Shi, H. Chen, V. D. Lai, Z. Xie, S. Kim, R. Zhang, T. Yu, M. Tanjim, N. K. Ahmed, P . Mathur, S. Y oon, L. Y ao, B. Kveton, T. H. Nguyen, T. Bui, T. Zhou, R. A. Rossi, and F . Dernoncourt, “Gui agents: A survey,” 2024. [Online]. Available: https://arxiv.org/a...

work page arXiv 2024
[66]

Llm-powered gui agents in phone automation: Surveying progress and prospects

G. Liu, P . Zhao, L. Liu, Y . Guo, H. Xiao, W. Lin, Y . Chai, Y . Han, S. Ren, H. Wang et al. , “Llm-powered gui agents in phone automation: Surveying progress and prospects,” arXiv preprint arXiv:2504.19838, 2025

work page arXiv 2025
[67]

Os agents: A survey on mllm-based agents for general computing devices use,

X. Hu, T. Xiong, B. Yi, Z. Wei, R. Xiao, Y . Chen, J. Y e, M. Tao, X. Zhou, Z. Zhao et al., “Os agents: A survey on mllm-based agents for general computing devices use,” 2024

work page 2024
[68]

arXiv preprint arXiv:2503.23434 , year=

Y . Shi, W. Yu, W. Y ao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

work page arXiv 2025
[69]

Yu, and Qing Li

L. Ning, Z. Liang, Z. Jiang, H. Qu, Y . Ding, W. Fan, X.-y. Wei, S. Lin, H. Liu, P . S. Yuet al., “A survey of webagents: Towards next-generation ai agents for web automation with large foundation models,”arXiv preprint arXiv:2503.23350, 2025

work page arXiv 2025
[70]

arXiv preprint arXiv:2504.13865 , year=

F . Tang, H. Xu, H. Zhang, S. Chen, X. Wu, Y . Shen, W. Zhang, G. Hou, Z. Tan, Y . Y an, K. Song, J. Shao, W. Lu, J. Xiao, and Y . Zhuang, “A survey on (m)llm-based gui agents,” 2025. [Online]. Available: https://arxiv.org/abs/2504.13865

work page arXiv 2025
[71]

A summary on gui agents with foundation models enhanced by reinforcement learning,

J. Li and K. Huang, “A summary on gui agents with foundation models enhanced by reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2504.20464

work page arXiv 2025
[72]

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

P . J. Sager, B. Meyer, P . Y an, R. von Wartburg-Kottler, L. Etaiwi, A. Enayati, G. Nobel, A. Abdulkadir, B. F . Grewe, and T. Stadel- mann, “Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants,”arXiv preprint arXiv:2501.16150, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Cytestion: Automated gui testing for web applications,

T. S. d. Moura, E. L. Alves, H. F . d. Figueirêdo, and C. d. S. Baptista, “Cytestion: Automated gui testing for web applications,” in Proceedings of the XXXVII Brazilian Symposium on Software Engineering, 2023, pp. 388–397

work page 2023
[74]

Sikuli: using gui screenshots for search and automation,

T. Y eh, T.-H. Chang, and R. C. Miller, “Sikuli: using gui screenshots for search and automation,” in Proceedings of the 22nd annual ACM symposium on User interface software and technology, 2009, pp. 183–192

work page 2009
[75]

Prediction and entropy of printed english,

C. E. Shannon, “Prediction and entropy of printed english,”Bell system technical journal, vol. 30, no. 1, pp. 50–64, 1951

work page 1951
[76]

N-gram-based text catego- rization,

W. B. Cavnar, J. M. Trenkle et al., “N-gram-based text catego- rization,” inProceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175. Ann Arbor, Michigan, 1994, p. 14

work page 1994
[77]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[78]

Language Models are Few-Shot Learners

B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, vol. 1, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[79]

Finetuned Language Models Are Zero-Shot Learners

J. Wei, M. Bosma, V. Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[80]

Recurrent neural networks,

L. R. Medsker, L. Jain et al., “Recurrent neural networks,”Design and Applications, vol. 5, no. 64-67, p. 2, 2001

work page 2001

Showing first 80 references.

[1] [1]

The graphical user interface,

B. J. Jansen, “The graphical user interface,” ACM SIGCHI Bull., vol. 30, pp. 22–26, 1998. [Online]. Available: https: //api.semanticscholar.org/CorpusID:18416305

work page 1998

[2] [2]

Accessibility of command line interfaces,

H. Sampath, A. Merrick, and A. P . Macvean, “Accessibility of command line interfaces,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID: 233987139

work page 2021

[3] [3]

The effects of graphical interface design characteristics on human-computer interaction task efficiency

R. Michalski, J. Grobelny, and W. Karwowski, “The effects of graphical interface design characteristics on human-computer interaction task efficiency,” ArXiv, vol. abs/1211.6712, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID: 14695409

work page internal anchor Pith review Pith/arXiv arXiv 2006

[4] [4]

Rule-based exploratory testing of graphical user interfaces,

T. D. Hellmann and F . Maurer, “Rule-based exploratory testing of graphical user interfaces,” in2011 Agile Conference. IEEE, 2011, pp. 107–116

work page 2011

[5] [5]

jrapture: A capture/replay tool for observation-based testing,

J. Steven, P . Chandra, B. Fleck, and A. Podgurski, “jrapture: A capture/replay tool for observation-based testing,” SIGSOFT Softw. Eng. Notes, vol. 25, no. 5, p. 158–167, Aug. 2000. [Online]. Available: https://doi.org/10.1145/347636.348993

work page doi:10.1145/347636.348993 2000

[6] [6]

Robotic process automation: systematic literature review,

L. Ivanˇci´c, D. Suša Vugec, and V. Bosilj Vukši´c, “Robotic process automation: systematic literature review,” in Business Process Management: Blockchain and Central and Eastern Europe Forum: BPM 2019 Blockchain and CEE Forum, Vienna, Austria, Septem- ber 1–6, 2019, Proceedings 17. Springer, 2019, pp. 280–295

work page 2019

[7] [7]

Large language model — wikipedia, the free encyclopedia,

W. contributors, “Large language model — wikipedia, the free encyclopedia,” 2024, accessed: 2024-11-25. [Online]. Available: https://en.wikipedia.org/wiki/Large_language_model

work page 2024

[8] [8]

A Survey of Large Language Models

W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

A Comprehensive Overview of Large Language Models

H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”arXiv preprint arXiv:2307.06435, 2023

work page internal anchor Pith review arXiv 2023

[10] [10]

A Survey on Multimodal Large Language Models

S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

A brief overview of chatgpt: The history, status quo and potential future development,

T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y . Tang, “A brief overview of chatgpt: The history, status quo and potential future development,”IEEE/CAA Journal of Automatica Sinica , vol. 10, no. 5, pp. 1122–1136, 2023

work page 2023

[12] [12]

Large Language Model-Based Agents for Software Engineering: A Survey

J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A survey,”arXiv preprint arXiv:2409.02977, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

LLM With Tools: A Survey

Z. Shen, “Llm with tools: A survey,”arXiv preprint arXiv:2409.18807, 2024

work page Pith review arXiv 2024

[14] [14]

How far are we from agi: Are llms all we need?

T. Feng, C. Jin, J. Liu, K. Zhu, H. Tu, Z. Cheng, G. Lin, and J. Y ou, “How far are we from agi: Are llms all we need?”Transactions on Machine Learning Research

work page

[15] [15]

arXiv preprint arXiv:2312.08914 , year =

W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Zhang, J. Li, B. Xu, Y . Dong, M. Ding, and J. Tang, “Cogagent: A visual language model for gui agents,” 2023. [Online]. Available: https://arxiv.org/abs/2312.08914 JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 86

work page arXiv 2023

[16] [16]

Every software as an agent: Blueprint and case study,

M. Xu, “Every software as an agent: Blueprint and case study,” arXiv preprint arXiv:2502.04747, 2025

work page arXiv 2025

[17] [17]

GPT-4V(ision) is a Generalist Web Agent, if Grounded

B. Zheng, B. Gou, J. Kil, H. Sun, and Y . Su, “Gpt-4v(ision) is a generalist web agent, if grounded,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01614

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

AppAgent: Multimodal Agents as Smartphone Users

C. Zhang, Z. Y ang, J. Liu, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” 2023. [Online]. Available: https://arxiv.org/abs/2312.13771

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

UFO: A ui-focused agent for windows os interaction.arXiv preprint arXiv:2402.07939, 2024

C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y . Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang, “UFO: A UI-Focused Agent for Windows OS Interaction,”arXiv preprint arXiv:2402.07939, 2024

work page arXiv 2024

[20] [20]

Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

Y . Guan, D. Wang, Z. Chu, S. Wang, F . Ni, R. Song, L. Li, J. Gu, and C. Zhuang, “Intelligent virtual assistants with llm-based process automation,”ArXiv, vol. abs/2312.06677, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266174422

work page arXiv 2023

[21] [21]

Integrating artificial intelligence into operating systems: A survey on techniques, applications, and future directions

Y . Zhang, X. Zhao, J. Yin, L. Zhang, and Z. Chen, “Operating system and artificial intelligence: A systematic review,” arXiv preprint arXiv:2407.14567, 2024

work page arXiv 2024

[22] [22]

Aios: Llm agent operating system,

K. Mei, Z. Li, S. Xu, R. Y e, Y . Ge, and Y . Zhang, “Aios: Llm agent operating system,”arXiv e-prints, pp. arXiv–2403, 2024

work page 2024

[23] [23]

Does chatgpt generate accessible code? investigating accessibility challenges in llm-generated source code,

W. Aljedaani, A. Habib, A. Aljohani, M. M. Eler, and Y . Feng, “Does chatgpt generate accessible code? investigating accessibility challenges in llm-generated source code,” inInternational Cross- Disciplinary Conference on Web Accessibility , 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273550267

work page 2024

[24] [24]

Human-centered llm-agent user interface: A position paper,

D. Chin, Y . Wang, and G. G. Xia, “Human-centered llm-agent user interface: A position paper,” ArXiv, vol. abs/2405.13050, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID: 269982753

work page arXiv 2024

[25] [25]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

K. Cheng, Q. Sun, Y . Chu, F . Xu, Y . Li, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,” 2024. [Online]. Available: https://arxiv.org/abs/2401.10935

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Agent-as-a-judge: Evaluate agents with agents,

M. Zhuge, C. Zhao, D. R. Ashley, W. Wang, D. Khizbullin, Y . Xiong, Z. Liu, E. Chang, R. Krishnamoorthi, Y . Tian, Y . Shi, V. Chandra, and J. Schmidhuber, “Agent-as-a-judge: Evaluate agents with agents,” 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273350802

work page 2024

[27] [27]

Li and M

K. Li and M. Wu, Effective GUI testing automation: Developing an automated GUI testing tool. John Wiley & Sons, 2006

work page 2006

[28] [28]

30 years of automated gui testing: a bibliometric analysis,

O. Rodríguez-Valdés, T. E. Vos, P . Aho, and B. Marín, “30 years of automated gui testing: a bibliometric analysis,” in Quality of Information and Communications Technology: 14th International Conference, QUATIC 2021, Algarve, Portugal, September 8–11, 2021, Proceedings 14. Springer, 2021, pp. 473–488

work page 2021

[29] [29]

A Systematic Literature Review of Automated Techniques for Functional GUI Testing of Mobile Applications

Y . L. Arnatovich and L. Wang, “A systematic literature review of automated techniques for functional gui testing of mobile applications,”arXiv preprint arXiv:1812.11470, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Gui testing for mobile applications: objectives, approaches and challenges,

K. S. Said, L. Nie, A. A. Ajibode, and X. Zhou, “Gui testing for mobile applications: objectives, approaches and challenges,” in Proceedings of the 12th Asia-Pacific Symposium on Internetware, 2020, pp. 51–60

work page 2020

[31] [31]

Gui testing for android applications: a survey,

X. Li, “Gui testing for android applications: a survey,” in2023 7th International Conference on Computer, Software and Modeling (ICCSM). IEEE, 2023, pp. 6–10

work page 2023

[32] [32]

Test automation for windows gui application,

J.-J. Oksanen, “Test automation for windows gui application,” 2023

work page 2023

[33] [33]

Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,

P . S. Deshmukh, S. S. Date, P . N. Mahalle, and J. Barot, “Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,” in International Conference on ICT for Sustainable Development. Springer, 2023, pp. 619–628

work page 2023

[34] [34]

A survey on the use of computer vision to improve software engineering tasks,

M. Bajammal, A. Stocco, D. Mazinanian, and A. Mesbah, “A survey on the use of computer vision to improve software engineering tasks,”IEEE Transactions on Software Engineering, vol. 48, no. 5, pp. 1722–1742, 2020

work page 2020

[35] [35]

2025 , publisher =

S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023

work page arXiv 2023

[36] [36]

Robotic process automation: contemporary themes and challenges,

R. Syed, S. Suriadi, M. Adams, W. Bandara, S. J. Leemans, C. Ouyang, A. H. Ter Hofstede, I. Van De Weerd, M. T. Wynn, and H. A. Reijers, “Robotic process automation: contemporary themes and challenges,”Computers in Industry, vol. 115, p. 103162, 2020

work page 2020

[37] [37]

From robotic process automation to intelli- gent process automation: –emerging trends–,

T. Chakraborti, V. Isahagian, R. Khalaf, Y . Khazaeni, V. Muthusamy, Y . Rizk, and M. Unuvar, “From robotic process automation to intelli- gent process automation: –emerging trends–,” inBusiness Process Management: Blockchain and Robotic Process Automation Forum: BPM 2020 Blockchain and RPA Forum, Seville, Spain, September 13–18, 2020, Proceedings 18. Spr...

work page 2020

[38] [38]

Robotic process automation: a scientific and industrial systematic mapping study,

J. G. Enríquez, A. Jiménez-Ramírez, F . J. Domínguez-Mayo, and J. A. García-García, “Robotic process automation: a scientific and industrial systematic mapping study,”IEEE Access, vol. 8, pp. 39 113–39 129, 2020

work page 2020

[39] [39]

Robotic process automation and artificial intelligence in industry 4.0–a literature review,

J. Ribeiro, R. Lima, T. Eckhardt, and S. Paiva, “Robotic process automation and artificial intelligence in industry 4.0–a literature review,”Procedia Computer Science, vol. 181, pp. 51–58, 2021

work page 2021

[40] [40]

Why many challenges with gui test automation (will) remain,

M. Nass, E. Alégroth, and R. Feldt, “Why many challenges with gui test automation (will) remain,”Information and Software Technology, vol. 138, p. 106625, 2021

work page 2021

[41] [41]

Research challenges for intelligent robotic process automation,

S. Agostinelli, A. Marrella, and M. Mecella, “Research challenges for intelligent robotic process automation,” in Business Process Management Workshops: BPM 2019 International Workshops, Vienna, Austria, September 1–6, 2019, Revised Selected Papers

work page 2019

[42] [42]

Springer, 2019, pp. 12–18

work page 2019

[43] [43]

Task automation intel- ligent agents: A review,

A. Wali, S. Mahamad, and S. Sulaiman, “Task automation intel- ligent agents: A review,” Future Internet, vol. 15, no. 6, p. 196, 2023

work page 2023

[44] [44]

arXiv preprint arXiv:2309.14365 , year=

P . Zhao, Z. Jin, and N. Cheng, “An in-depth survey of large language model-based artificial intelligence agents,”arXiv preprint arXiv:2309.14365, 2023

work page arXiv 2023

[45] [45]

arXiv preprint arXiv:2401.03428

Y . Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F . Yin, J. Zhaoet al., “Exploring large language model based intelligent agents: Definitions, methods, and prospects,” arXiv preprint arXiv:2401.03428, 2024

work page arXiv 2024

[46] [46]

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Y . Li, H. Wen, W. Wang, X. Li, Y . Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y . Sun et al. , “Personal llm agents: Insights and survey about the capability, efficiency and security,”arXiv preprint arXiv:2401.05459, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

The Rise and Potential of Large Language Model Based Agents: A Survey

Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,”arXiv preprint arXiv:2309.07864, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

A survey on large language model based autonomous agents,

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Y ang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

work page 2024

[49] [49]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi- agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

LLM Multi-Agent Systems: Challenges and Open Problems

S. Han, Q. Zhang, Y . Y ao, W. Jin, Z. Xu, and C. He, “Llm multi- agent systems: Challenges and open problems,” arXiv preprint arXiv:2402.03578, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

doi:10.48550/arXiv.2405.11106 , abstract =

C. Sun, S. Huang, and D. Pompili, “Llm-based multi-agent rein- forcement learning: Current and future directions,”arXiv preprint arXiv:2405.11106, 2024

work page arXiv 2024

[52] [52]

Understanding the planning of LLM agents: A survey

X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of llm agents: A survey,”arXiv preprint arXiv:2402.02716, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

A survey on large language models for automated planning,

M. Aghzal, E. Plaku, G. J. Stein, and Z. Y ao, “A survey on large language models for automated planning,” arXiv preprint arXiv:2502.12435, 2025

work page arXiv 2025

[54] [54]

arXiv preprint arXiv:2501.07278 , year=

J. Zheng, C. Shi, X. Cai, Q. Li, D. Zhang, C. Li, D. Yu, and Q. Ma, “Lifelong learning of large language model based agents: A roadmap,”arXiv preprint arXiv:2501.07278, 2025

work page arXiv 2025

[55] [55]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A survey on the memory mechanism of large language model based agents,”arXiv preprint arXiv:2404.13501, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Y ang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024

work page 2024

[57] [57]

A survey on multimodal benchmarks: In the era of large ai models

L. Li, G. Chen, H. Shi, J. Xiao, and L. Chen, “A survey on multimodal benchmarks: In the era of large ai models,” arXiv preprint arXiv:2409.18142, 2024

work page arXiv 2024

[58] [58]

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Z. Li, X. Wu, H. Du, H. Nghiem, and G. Shi, “Benchmark evaluations, applications, and challenges of large vision language models: A survey,”arXiv preprint arXiv:2501.02189, 2025

work page Pith review arXiv 2025

[59] [59]

A survey on evaluation of multimodal large language models.arXiv preprint arXiv:2408.15769, 2024

J. Huang and J. Zhang, “A survey on evaluation of multimodal large language models,”arXiv preprint arXiv:2408.15769, 2024

work page arXiv 2024

[60] [60]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents: A survey,”arXiv preprint arXiv:2402.15116, 2024

work page arXiv 2024

[61] [61]

Agent AI: Surveying the Horizons of Multimodal Interaction

Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y . Noda, D. Terzopoulos, Y . Choi et al. , “Agent ai: JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 87 Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

arXiv preprint arXiv:2411.02006 (2024)

B. Wu, Y . Li, M. Fang, Z. Song, Z. Zhang, Y . Wei, and L. Chen, “Foundations and recent trends in multimodal mobile agents: A survey,”arXiv preprint arXiv:2411.02006, 2024

work page arXiv 2024

[63] [63]

arXiv preprint arXiv:2411.04890 , year=

S. Wang, W. Liu, J. Chen, W. Gan, X. Zeng, S. Yu, X. Hao, K. Shao, Y . Wang, and R. Tang, “Gui agents with foundation models: A comprehensive survey,” 2024. [Online]. Available: https://arxiv.org/abs/2411.04890

work page arXiv 2024

[64] [64]

https://doi.org/10.48550/arXiv.2411.10943

M. Gao, W. Bu, B. Miao, Y . Wu, Y . Li, J. Li, S. Tang, Q. Wu, Y . Zhuang, and M. Wang, “Generalist virtual agents: A survey on autonomous agents across digital platforms,” arXiv preprint arXiv:2411.10943, 2024

work page arXiv 2024

[65] [65]

Gui agents: A survey,

D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia, X. Li, J. Shi, H. Chen, V. D. Lai, Z. Xie, S. Kim, R. Zhang, T. Yu, M. Tanjim, N. K. Ahmed, P . Mathur, S. Y oon, L. Y ao, B. Kveton, T. H. Nguyen, T. Bui, T. Zhou, R. A. Rossi, and F . Dernoncourt, “Gui agents: A survey,” 2024. [Online]. Available: https://arxiv.org/a...

work page arXiv 2024

[66] [66]

Llm-powered gui agents in phone automation: Surveying progress and prospects

G. Liu, P . Zhao, L. Liu, Y . Guo, H. Xiao, W. Lin, Y . Chai, Y . Han, S. Ren, H. Wang et al. , “Llm-powered gui agents in phone automation: Surveying progress and prospects,” arXiv preprint arXiv:2504.19838, 2025

work page arXiv 2025

[67] [67]

Os agents: A survey on mllm-based agents for general computing devices use,

X. Hu, T. Xiong, B. Yi, Z. Wei, R. Xiao, Y . Chen, J. Y e, M. Tao, X. Zhou, Z. Zhao et al., “Os agents: A survey on mllm-based agents for general computing devices use,” 2024

work page 2024

[68] [68]

arXiv preprint arXiv:2503.23434 , year=

Y . Shi, W. Yu, W. Y ao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

work page arXiv 2025

[69] [69]

Yu, and Qing Li

L. Ning, Z. Liang, Z. Jiang, H. Qu, Y . Ding, W. Fan, X.-y. Wei, S. Lin, H. Liu, P . S. Yuet al., “A survey of webagents: Towards next-generation ai agents for web automation with large foundation models,”arXiv preprint arXiv:2503.23350, 2025

work page arXiv 2025

[70] [70]

arXiv preprint arXiv:2504.13865 , year=

F . Tang, H. Xu, H. Zhang, S. Chen, X. Wu, Y . Shen, W. Zhang, G. Hou, Z. Tan, Y . Y an, K. Song, J. Shao, W. Lu, J. Xiao, and Y . Zhuang, “A survey on (m)llm-based gui agents,” 2025. [Online]. Available: https://arxiv.org/abs/2504.13865

work page arXiv 2025

[71] [71]

A summary on gui agents with foundation models enhanced by reinforcement learning,

J. Li and K. Huang, “A summary on gui agents with foundation models enhanced by reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2504.20464

work page arXiv 2025

[72] [72]

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

P . J. Sager, B. Meyer, P . Y an, R. von Wartburg-Kottler, L. Etaiwi, A. Enayati, G. Nobel, A. Abdulkadir, B. F . Grewe, and T. Stadel- mann, “Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants,”arXiv preprint arXiv:2501.16150, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Cytestion: Automated gui testing for web applications,

T. S. d. Moura, E. L. Alves, H. F . d. Figueirêdo, and C. d. S. Baptista, “Cytestion: Automated gui testing for web applications,” in Proceedings of the XXXVII Brazilian Symposium on Software Engineering, 2023, pp. 388–397

work page 2023

[74] [74]

Sikuli: using gui screenshots for search and automation,

T. Y eh, T.-H. Chang, and R. C. Miller, “Sikuli: using gui screenshots for search and automation,” in Proceedings of the 22nd annual ACM symposium on User interface software and technology, 2009, pp. 183–192

work page 2009

[75] [75]

Prediction and entropy of printed english,

C. E. Shannon, “Prediction and entropy of printed english,”Bell system technical journal, vol. 30, no. 1, pp. 50–64, 1951

work page 1951

[76] [76]

N-gram-based text catego- rization,

W. B. Cavnar, J. M. Trenkle et al., “N-gram-based text catego- rization,” inProceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175. Ann Arbor, Michigan, 1994, p. 14

work page 1994

[77] [77]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[78] [78]

Language Models are Few-Shot Learners

B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, vol. 1, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[79] [79]

Finetuned Language Models Are Zero-Shot Learners

J. Wei, M. Bosma, V. Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[80] [80]

Recurrent neural networks,

L. R. Medsker, L. Jain et al., “Recurrent neural networks,”Design and Applications, vol. 5, no. 64-67, p. 2, 2001

work page 2001