Large Language Model-Brained GUI Agents: A Survey
Pith reviewed 2026-05-19 11:02 UTC · model grok-4.3
pith:7LNTAEUA Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{7LNTAEUA}
Prints a linked pith:7LNTAEUA badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A survey of LLM-brained GUI agents organizes frameworks, training data, models, benchmarks, and applications while mapping research gaps and a future roadmap.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLM-brained GUI agents mark a shift by letting multimodal models read complex interface layouts and autonomously perform multi-step operations from simple spoken or typed instructions. The survey traces their development from earlier rule-based tools to current frameworks that combine visual understanding with reasoning and action selection. It details how training data is gathered and used, how large action models are adapted for GUI work, and which metrics and benchmarks best track progress, while also listing early applications and the main open problems that must be solved next.
What carries the argument
LLM-brained GUI agent: an autonomous system that combines a multimodal large language model with modules for perceiving GUI elements, reasoning about user goals, and outputting sequences of interface actions.
If this is right
- Agents become practical for web navigation, mobile app control, and desktop automation when trained on appropriate GUI-specific data.
- Large action models tailored to interface tasks improve accuracy over general-purpose language models.
- Standardized benchmarks and metrics make it possible to compare different agent designs directly.
- Applications in everyday software use will let users finish intricate jobs through conversation instead of manual steps.
Where Pith is reading between the lines
- If the agents mature, non-technical users could manage complex software without learning menus or shortcuts.
- Combining these agents with other tools might allow end-to-end automation across several programs at once.
- Handling frequent changes in interface designs will likely require new techniques for ongoing adaptation.
Load-bearing premise
Current research papers and industry prototypes already cover enough ground that one survey can spot the most important missing pieces and draw a reliable map for what comes next.
What would settle it
Implement the roadmap steps and measure whether new agents reach reliable success rates above 70 percent on a fixed set of multi-step tasks that existing systems still fail, such as cross-app workflows that involve changing screen layouts.
read the original abstract
GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is a survey on LLM-brained GUI agents. It reviews the historical evolution of GUI automation, core components of such agents, existing frameworks, methods for data collection and utilization, development of large action models, evaluation metrics and benchmarks, emerging applications across web, mobile, and desktop, key research gaps, and a proposed roadmap for future work.
Significance. If the reviewed literature is representative, the survey would provide a useful consolidation of an emerging interdisciplinary area at the intersection of LLMs, HCI, and automation. It could serve as an entry point for researchers by organizing frameworks, benchmarks, and open problems, though its long-term impact depends on how well it captures both academic and industry contributions in a fast-moving domain.
major comments (2)
- [Abstract and §1] Abstract and §1 (Introduction): The claim of presenting a 'comprehensive' overview and reliable roadmap rests on the representativeness of the selected works, yet no search protocol, database list, inclusion/exclusion criteria, or literature cutoff date is described. This directly affects the load-bearing claim that the identified gaps are the most important ones.
- [Roadmap and Research Gaps section] § on Roadmap and Research Gaps: The proposed future directions are presented as synthesis outcomes without an explicit discussion of how potential omissions (e.g., recent industry systems or non-English GUI work) were mitigated, which weakens the defensibility of the roadmap as a field guide.
minor comments (3)
- [Terminology] Ensure consistent terminology for 'large action models' versus standard terms like vision-language-action models throughout the text.
- [Frameworks section] Add a table summarizing the main frameworks, their key features, and publication years for easier comparison.
- [Benchmarks section] Verify that all cited benchmarks include the most recent versions or follow-up papers, given the rapid progress noted in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help improve the transparency and defensibility of our survey. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1 (Introduction): The claim of presenting a 'comprehensive' overview and reliable roadmap rests on the representativeness of the selected works, yet no search protocol, database list, inclusion/exclusion criteria, or literature cutoff date is described. This directly affects the load-bearing claim that the identified gaps are the most important ones.
Authors: We agree that an explicit description of the literature selection process is needed to support the 'comprehensive' claim and the identified gaps. In the revised version, we will insert a new subsection (tentatively 'Survey Scope and Methodology' in §1) that details the search protocol. This will specify the primary databases (arXiv, Google Scholar, ACM Digital Library), search keywords and combinations used, inclusion criteria (peer-reviewed or preprint works on LLM-based GUI agents from 2023 onward with empirical components), exclusion criteria (works focused solely on non-GUI agents or lacking technical details), and the cutoff date of October 2024. This addition will clarify the basis for the roadmap without altering the core content. revision: yes
-
Referee: [Roadmap and Research Gaps section] § on Roadmap and Research Gaps: The proposed future directions are presented as synthesis outcomes without an explicit discussion of how potential omissions (e.g., recent industry systems or non-English GUI work) were mitigated, which weakens the defensibility of the roadmap as a field guide.
Authors: We concur that explicitly addressing scope limitations and mitigation strategies would make the roadmap more robust. In the revised Roadmap and Research Gaps section, we will add a short paragraph on limitations and mitigation. It will state that the survey emphasizes academic literature and prominent industry systems (e.g., those from OpenAI, Google, and Apple) that were publicly documented by the cutoff date, while noting that very recent preprints or non-English works may be underrepresented due to the field's rapid pace and language accessibility. Mitigation steps included reviewing recent industry reports and cross-checking against related surveys; we will also recommend multilingual and industry-focused extensions as future work. revision: yes
Circularity Check
No circularity: survey synthesizes external literature without self-referential derivations or predictions
full rationale
This is a survey paper whose central claims consist of reviewing existing GUI agent frameworks, data collection methods, large action models, metrics, benchmarks, applications, gaps, and a future roadmap. No mathematical derivations, equations, fitted parameters, or first-principles predictions appear in the abstract or described structure. The synthesis draws from the broader research and industry literature rather than reducing any result to the paper's own inputs by construction. Self-citations, if present, are not load-bearing for the identification of gaps or roadmap, which are framed as analysis of the reviewed field. The paper is self-contained against external benchmarks as a literature overview.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 19 Pith papers
-
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
-
From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation
An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization
TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.
-
MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs
MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.
-
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
RISK: A Framework for GUI Agents in E-commerce Risk Management
RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
-
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
-
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
LPO optimizes GUI agent positional accuracy by combining information entropy for zone selection with a physical-distance reward inside a Group Relative Preference Optimization framework, claiming SOTA results on bench...
-
Exploring Interaction Paradigms for LLM Agents in Scientific Visualization
General-purpose coding agents achieve highest success on SciVis tasks but at high cost, while domain-specific agents are efficient yet less flexible and computer-use agents struggle with long workflows.
-
Exploring Interaction Paradigms for LLM Agents in Scientific Visualization
General-purpose coding agents achieve highest success on SciVis tasks but cost more compute, while domain-specific agents are efficient yet less flexible and computer-use agents falter on long workflows.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents
Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Opera...
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents
LaSM is a layer-wise scaling mechanism that amplifies attention and MLP modules in critical layers to defend GUI agents against pop-up attacks by correcting attention misalignment.
-
Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks
Plan-and-Act trains a dedicated Planner on synthetic plan-annotated trajectories to generate high-level plans that an Executor follows, reaching 57.58% success on WebArena-Lite and 81.36% on WebVoyager.
-
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
The paper surveys agent skills for LLMs across architecture, acquisition, deployment, and security, proposing a four-tier Skill Trust and Lifecycle Governance Framework to address vulnerabilities in community skills.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Reference graph
Works this paper leans on
-
[1]
B. J. Jansen, “The graphical user interface,” ACM SIGCHI Bull., vol. 30, pp. 22–26, 1998. [Online]. Available: https: //api.semanticscholar.org/CorpusID:18416305
work page 1998
-
[2]
Accessibility of command line interfaces,
H. Sampath, A. Merrick, and A. P . Macvean, “Accessibility of command line interfaces,” Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems , 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID: 233987139
work page 2021
-
[3]
R. Michalski, J. Grobelny, and W. Karwowski, “The effects of graphical interface design characteristics on human-computer interaction task efficiency,” ArXiv, vol. abs/1211.6712, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID: 14695409
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[4]
Rule-based exploratory testing of graphical user interfaces,
T. D. Hellmann and F . Maurer, “Rule-based exploratory testing of graphical user interfaces,” in2011 Agile Conference. IEEE, 2011, pp. 107–116
work page 2011
-
[5]
jrapture: A capture/replay tool for observation-based testing,
J. Steven, P . Chandra, B. Fleck, and A. Podgurski, “jrapture: A capture/replay tool for observation-based testing,” SIGSOFT Softw. Eng. Notes, vol. 25, no. 5, p. 158–167, Aug. 2000. [Online]. Available: https://doi.org/10.1145/347636.348993
-
[6]
Robotic process automation: systematic literature review,
L. Ivanˇci´c, D. Suša Vugec, and V. Bosilj Vukši´c, “Robotic process automation: systematic literature review,” in Business Process Management: Blockchain and Central and Eastern Europe Forum: BPM 2019 Blockchain and CEE Forum, Vienna, Austria, Septem- ber 1–6, 2019, Proceedings 17. Springer, 2019, pp. 280–295
work page 2019
-
[7]
Large language model — wikipedia, the free encyclopedia,
W. contributors, “Large language model — wikipedia, the free encyclopedia,” 2024, accessed: 2024-11-25. [Online]. Available: https://en.wikipedia.org/wiki/Large_language_model
work page 2024
-
[8]
A Survey of Large Language Models
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y . Hou, Y . Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,”arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
A comprehensive overview of large language models.arXiv preprint arXiv:2307.06435,
H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,”arXiv preprint arXiv:2307.06435, 2023
-
[10]
A Survey on Multimodal Large Language Models
S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
A brief overview of chatgpt: The history, status quo and potential future development,
T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y . Tang, “A brief overview of chatgpt: The history, status quo and potential future development,”IEEE/CAA Journal of Automatica Sinica , vol. 10, no. 5, pp. 1122–1136, 2023
work page 2023
-
[12]
Large Language Model-Based Agents for Software Engineering: A Survey
J. Liu, K. Wang, Y . Chen, X. Peng, Z. Chen, L. Zhang, and Y . Lou, “Large language model-based agents for software engineering: A survey,”arXiv preprint arXiv:2409.02977, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Z. Shen, “Llm with tools: A survey,”arXiv preprint arXiv:2409.18807, 2024
-
[14]
How far are we from agi: Are llms all we need?
T. Feng, C. Jin, J. Liu, K. Zhu, H. Tu, Z. Cheng, G. Lin, and J. Y ou, “How far are we from agi: Are llms all we need?”Transactions on Machine Learning Research
-
[15]
Cogagent: A visual language model for gui agents
W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Zhang, J. Li, B. Xu, Y . Dong, M. Ding, and J. Tang, “Cogagent: A visual language model for gui agents,” 2023. [Online]. Available: https://arxiv.org/abs/2312.08914 JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 86
-
[16]
Every software as an agent: Blueprint and case study,
M. Xu, “Every software as an agent: Blueprint and case study,” arXiv preprint arXiv:2502.04747, 2025
-
[17]
GPT-4V(ision) is a Generalist Web Agent, if Grounded
B. Zheng, B. Gou, J. Kil, H. Sun, and Y . Su, “Gpt-4v(ision) is a generalist web agent, if grounded,” 2024. [Online]. Available: https://arxiv.org/abs/2401.01614
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
AppAgent: Multimodal Agents as Smartphone Users
C. Zhang, Z. Y ang, J. Liu, Y . Han, X. Chen, Z. Huang, B. Fu, and G. Yu, “Appagent: Multimodal agents as smartphone users,” 2023. [Online]. Available: https://arxiv.org/abs/2312.13771
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
UFO: A UI-Focused Agent for Windows OS Interaction,
C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y . Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang, “UFO: A UI-Focused Agent for Windows OS Interaction,”arXiv preprint arXiv:2402.07939, 2024
-
[20]
Intelligent virtual assistants with llm-based process automation,
Y . Guan, D. Wang, Z. Chu, S. Wang, F . Ni, R. Song, L. Li, J. Gu, and C. Zhuang, “Intelligent virtual assistants with llm-based process automation,”ArXiv, vol. abs/2312.06677, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266174422
-
[21]
Operating system and artificial intelligence: A systematic review,
Y . Zhang, X. Zhao, J. Yin, L. Zhang, and Z. Chen, “Operating system and artificial intelligence: A systematic review,” arXiv preprint arXiv:2407.14567, 2024
-
[22]
Aios: Llm agent operating system,
K. Mei, Z. Li, S. Xu, R. Y e, Y . Ge, and Y . Zhang, “Aios: Llm agent operating system,”arXiv e-prints, pp. arXiv–2403, 2024
work page 2024
-
[23]
W. Aljedaani, A. Habib, A. Aljohani, M. M. Eler, and Y . Feng, “Does chatgpt generate accessible code? investigating accessibility challenges in llm-generated source code,” inInternational Cross- Disciplinary Conference on Web Accessibility , 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273550267
work page 2024
-
[24]
Human-centered llm-agent user interface: A position paper,
D. Chin, Y . Wang, and G. G. Xia, “Human-centered llm-agent user interface: A position paper,” ArXiv, vol. abs/2405.13050, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID: 269982753
-
[25]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
K. Cheng, Q. Sun, Y . Chu, F . Xu, Y . Li, J. Zhang, and Z. Wu, “Seeclick: Harnessing gui grounding for advanced visual gui agents,” 2024. [Online]. Available: https://arxiv.org/abs/2401.10935
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Agent-as-a-judge: Evaluate agents with agents,
M. Zhuge, C. Zhao, D. R. Ashley, W. Wang, D. Khizbullin, Y . Xiong, Z. Liu, E. Chang, R. Krishnamoorthi, Y . Tian, Y . Shi, V. Chandra, and J. Schmidhuber, “Agent-as-a-judge: Evaluate agents with agents,” 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:273350802
work page 2024
- [27]
-
[28]
30 years of automated gui testing: a bibliometric analysis,
O. Rodríguez-Valdés, T. E. Vos, P . Aho, and B. Marín, “30 years of automated gui testing: a bibliometric analysis,” in Quality of Information and Communications Technology: 14th International Conference, QUATIC 2021, Algarve, Portugal, September 8–11, 2021, Proceedings 14. Springer, 2021, pp. 473–488
work page 2021
-
[29]
Y . L. Arnatovich and L. Wang, “A systematic literature review of automated techniques for functional gui testing of mobile applications,”arXiv preprint arXiv:1812.11470, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Gui testing for mobile applications: objectives, approaches and challenges,
K. S. Said, L. Nie, A. A. Ajibode, and X. Zhou, “Gui testing for mobile applications: objectives, approaches and challenges,” in Proceedings of the 12th Asia-Pacific Symposium on Internetware, 2020, pp. 51–60
work page 2020
-
[31]
Gui testing for android applications: a survey,
X. Li, “Gui testing for android applications: a survey,” in2023 7th International Conference on Computer, Software and Modeling (ICCSM). IEEE, 2023, pp. 6–10
work page 2023
-
[32]
Test automation for windows gui application,
J.-J. Oksanen, “Test automation for windows gui application,” 2023
work page 2023
-
[33]
Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,
P . S. Deshmukh, S. S. Date, P . N. Mahalle, and J. Barot, “Auto- mated gui testing for enhancing user experience (ux): A survey of the state of the art,” in International Conference on ICT for Sustainable Development. Springer, 2023, pp. 619–628
work page 2023
-
[34]
A survey on the use of computer vision to improve software engineering tasks,
M. Bajammal, A. Stocco, D. Mazinanian, and A. Mesbah, “A survey on the use of computer vision to improve software engineering tasks,”IEEE Transactions on Software Engineering, vol. 48, no. 5, pp. 1722–1742, 2020
work page 2020
-
[35]
Vision-based mobile app gui testing: A survey,
S. Yu, C. Fang, Z. Tuo, Q. Zhang, C. Chen, Z. Chen, and Z. Su, “Vision-based mobile app gui testing: A survey,” arXiv preprint arXiv:2310.13518, 2023
-
[36]
Robotic process automation: contemporary themes and challenges,
R. Syed, S. Suriadi, M. Adams, W. Bandara, S. J. Leemans, C. Ouyang, A. H. Ter Hofstede, I. Van De Weerd, M. T. Wynn, and H. A. Reijers, “Robotic process automation: contemporary themes and challenges,”Computers in Industry, vol. 115, p. 103162, 2020
work page 2020
-
[37]
From robotic process automation to intelli- gent process automation: –emerging trends–,
T. Chakraborti, V. Isahagian, R. Khalaf, Y . Khazaeni, V. Muthusamy, Y . Rizk, and M. Unuvar, “From robotic process automation to intelli- gent process automation: –emerging trends–,” inBusiness Process Management: Blockchain and Robotic Process Automation Forum: BPM 2020 Blockchain and RPA Forum, Seville, Spain, September 13–18, 2020, Proceedings 18. Spr...
work page 2020
-
[38]
Robotic process automation: a scientific and industrial systematic mapping study,
J. G. Enríquez, A. Jiménez-Ramírez, F . J. Domínguez-Mayo, and J. A. García-García, “Robotic process automation: a scientific and industrial systematic mapping study,”IEEE Access, vol. 8, pp. 39 113–39 129, 2020
work page 2020
-
[39]
Robotic process automation and artificial intelligence in industry 4.0–a literature review,
J. Ribeiro, R. Lima, T. Eckhardt, and S. Paiva, “Robotic process automation and artificial intelligence in industry 4.0–a literature review,”Procedia Computer Science, vol. 181, pp. 51–58, 2021
work page 2021
-
[40]
Why many challenges with gui test automation (will) remain,
M. Nass, E. Alégroth, and R. Feldt, “Why many challenges with gui test automation (will) remain,”Information and Software Technology, vol. 138, p. 106625, 2021
work page 2021
-
[41]
Research challenges for intelligent robotic process automation,
S. Agostinelli, A. Marrella, and M. Mecella, “Research challenges for intelligent robotic process automation,” in Business Process Management Workshops: BPM 2019 International Workshops, Vienna, Austria, September 1–6, 2019, Revised Selected Papers
work page 2019
-
[42]
Springer, 2019, pp. 12–18
work page 2019
-
[43]
Task automation intel- ligent agents: A review,
A. Wali, S. Mahamad, and S. Sulaiman, “Task automation intel- ligent agents: A review,” Future Internet, vol. 15, no. 6, p. 196, 2023
work page 2023
-
[44]
An in-depth survey of large language model-based artificial intelligence agents,
P . Zhao, Z. Jin, and N. Cheng, “An in-depth survey of large language model-based artificial intelligence agents,”arXiv preprint arXiv:2309.14365, 2023
-
[45]
Exploring large language model based intelligent agents: Definitions, methods, and prospects,
Y . Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F . Yin, J. Zhaoet al., “Exploring large language model based intelligent agents: Definitions, methods, and prospects,” arXiv preprint arXiv:2401.03428, 2024
-
[46]
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
Y . Li, H. Wen, W. Wang, X. Li, Y . Yuan, G. Liu, J. Liu, W. Xu, X. Wang, Y . Sun et al. , “Personal llm agents: Insights and survey about the capability, efficiency and security,”arXiv preprint arXiv:2401.05459, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
The Rise and Potential of Large Language Model Based Agents: A Survey
Z. Xi, W. Chen, X. Guo, W. He, Y . Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al., “The rise and potential of large language model based agents: A survey,”arXiv preprint arXiv:2309.07864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
A survey on large language model based autonomous agents,
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Y ang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Linet al., “A survey on large language model based autonomous agents,”Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024
work page 2024
-
[49]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
T. Guo, X. Chen, Y . Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi- agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
LLM Multi-Agent Systems: Challenges and Open Problems
S. Han, Q. Zhang, Y . Y ao, W. Jin, Z. Xu, and C. He, “Llm multi- agent systems: Challenges and open problems,” arXiv preprint arXiv:2402.03578, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Llm-based multi-agent rein- forcement learning: Current and future directions,
C. Sun, S. Huang, and D. Pompili, “Llm-based multi-agent rein- forcement learning: Current and future directions,”arXiv preprint arXiv:2405.11106, 2024
-
[52]
Understanding the planning of LLM agents: A survey
X. Huang, W. Liu, X. Chen, X. Wang, H. Wang, D. Lian, Y . Wang, R. Tang, and E. Chen, “Understanding the planning of llm agents: A survey,”arXiv preprint arXiv:2402.02716, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
A survey on large language models for automated planning,
M. Aghzal, E. Plaku, G. J. Stein, and Z. Y ao, “A survey on large language models for automated planning,” arXiv preprint arXiv:2502.12435, 2025
-
[54]
arXiv preprint arXiv:2501.07278 , year =
J. Zheng, C. Shi, X. Cai, Q. Li, D. Zhang, C. Li, D. Yu, and Q. Ma, “Lifelong learning of large language model based agents: A roadmap,”arXiv preprint arXiv:2501.07278, 2025
-
[55]
A Survey on the Memory Mechanism of Large Language Model based Agents
Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A survey on the memory mechanism of large language model based agents,”arXiv preprint arXiv:2404.13501, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
A survey on evaluation of large language models,
Y . Chang, X. Wang, J. Wang, Y . Wu, L. Y ang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM Transactions on Intelligent Systems and Technology, vol. 15, no. 3, pp. 1–45, 2024
work page 2024
-
[57]
A survey on multimodal benchmarks: In the era of large ai models,
L. Li, G. Chen, H. Shi, J. Xiao, and L. Chen, “A survey on multimodal benchmarks: In the era of large ai models,” arXiv preprint arXiv:2409.18142, 2024
-
[58]
Benchmark evaluations, applications, and challenges of large vision language models: A survey,
Z. Li, X. Wu, H. Du, H. Nghiem, and G. Shi, “Benchmark evaluations, applications, and challenges of large vision language models: A survey,”arXiv preprint arXiv:2501.02189, 2025
-
[59]
A survey on evaluation of multimodal large language models
J. Huang and J. Zhang, “A survey on evaluation of multimodal large language models,”arXiv preprint arXiv:2408.15769, 2024
-
[60]
arXiv preprint arXiv:2402.15116 (2024)
J. Xie, Z. Chen, R. Zhang, X. Wan, and G. Li, “Large multimodal agents: A survey,”arXiv preprint arXiv:2402.15116, 2024
-
[61]
Agent AI: Surveying the Horizons of Multimodal Interaction
Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y . Noda, D. Terzopoulos, Y . Choi et al. , “Agent ai: JOURNAL OF LATEX CLASS FILES, DECEMBER 2024 87 Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
arXiv preprint arXiv:2411.02006 (2024)
B. Wu, Y . Li, M. Fang, Z. Song, Z. Zhang, Y . Wei, and L. Chen, “Foundations and recent trends in multimodal mobile agents: A survey,”arXiv preprint arXiv:2411.02006, 2024
-
[63]
Gui agents with foundation models: A comprehensive survey,
S. Wang, W. Liu, J. Chen, W. Gan, X. Zeng, S. Yu, X. Hao, K. Shao, Y . Wang, and R. Tang, “Gui agents with foundation models: A comprehensive survey,” 2024. [Online]. Available: https://arxiv.org/abs/2411.04890
-
[64]
Generalist virtual agents: A survey on autonomous agents across digital platforms,
M. Gao, W. Bu, B. Miao, Y . Wu, Y . Li, J. Li, S. Tang, Q. Wu, Y . Zhuang, and M. Wang, “Generalist virtual agents: A survey on autonomous agents across digital platforms,” arXiv preprint arXiv:2411.10943, 2024
-
[65]
D. Nguyen, J. Chen, Y . Wang, G. Wu, N. Park, Z. Hu, H. Lyu, J. Wu, R. Aponte, Y . Xia, X. Li, J. Shi, H. Chen, V. D. Lai, Z. Xie, S. Kim, R. Zhang, T. Yu, M. Tanjim, N. K. Ahmed, P . Mathur, S. Y oon, L. Y ao, B. Kveton, T. H. Nguyen, T. Bui, T. Zhou, R. A. Rossi, and F . Dernoncourt, “Gui agents: A survey,” 2024. [Online]. Available: https://arxiv.org/a...
-
[66]
Llm-powered gui agents in phone automation: Surveying progress and prospects,
G. Liu, P . Zhao, L. Liu, Y . Guo, H. Xiao, W. Lin, Y . Chai, Y . Han, S. Ren, H. Wang et al. , “Llm-powered gui agents in phone automation: Surveying progress and prospects,” arXiv preprint arXiv:2504.19838, 2025
-
[67]
Os agents: A survey on mllm-based agents for general computing devices use,
X. Hu, T. Xiong, B. Yi, Z. Wei, R. Xiao, Y . Chen, J. Y e, M. Tao, X. Zhou, Z. Zhao et al., “Os agents: A survey on mllm-based agents for general computing devices use,” 2024
work page 2024
-
[68]
Towards trustworthy gui agents: A survey,
Y . Shi, W. Yu, W. Y ao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025
-
[69]
L. Ning, Z. Liang, Z. Jiang, H. Qu, Y . Ding, W. Fan, X.-y. Wei, S. Lin, H. Liu, P . S. Yuet al., “A survey of webagents: Towards next-generation ai agents for web automation with large foundation models,”arXiv preprint arXiv:2503.23350, 2025
-
[70]
A survey on (m)llm-based gui agents,
F . Tang, H. Xu, H. Zhang, S. Chen, X. Wu, Y . Shen, W. Zhang, G. Hou, Z. Tan, Y . Y an, K. Song, J. Shao, W. Lu, J. Xiao, and Y . Zhuang, “A survey on (m)llm-based gui agents,” 2025. [Online]. Available: https://arxiv.org/abs/2504.13865
-
[71]
A summary on gui agents with foundation models enhanced by reinforcement learning,
J. Li and K. Huang, “A summary on gui agents with foundation models enhanced by reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2504.20464
-
[72]
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
P . J. Sager, B. Meyer, P . Y an, R. von Wartburg-Kottler, L. Etaiwi, A. Enayati, G. Nobel, A. Abdulkadir, B. F . Grewe, and T. Stadel- mann, “Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants,”arXiv preprint arXiv:2501.16150, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Cytestion: Automated gui testing for web applications,
T. S. d. Moura, E. L. Alves, H. F . d. Figueirêdo, and C. d. S. Baptista, “Cytestion: Automated gui testing for web applications,” in Proceedings of the XXXVII Brazilian Symposium on Software Engineering, 2023, pp. 388–397
work page 2023
-
[74]
Sikuli: using gui screenshots for search and automation,
T. Y eh, T.-H. Chang, and R. C. Miller, “Sikuli: using gui screenshots for search and automation,” in Proceedings of the 22nd annual ACM symposium on User interface software and technology, 2009, pp. 183–192
work page 2009
-
[75]
Prediction and entropy of printed english,
C. E. Shannon, “Prediction and entropy of printed english,”Bell system technical journal, vol. 30, no. 1, pp. 50–64, 1951
work page 1951
-
[76]
N-gram-based text catego- rization,
W. B. Cavnar, J. M. Trenkle et al., “N-gram-based text catego- rization,” inProceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, vol. 161175. Ann Arbor, Michigan, 1994, p. 14
work page 1994
-
[77]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
J. Chung, C. Gulcehre, K. Cho, and Y . Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,”arXiv preprint arXiv:1412.3555, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[78]
Language Models are Few-Shot Learners
B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, vol. 1, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[79]
Finetuned Language Models Are Zero-Shot Learners
J. Wei, M. Bosma, V. Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,”arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[80]
L. R. Medsker, L. Jain et al., “Recurrent neural networks,”Design and Applications, vol. 5, no. 64-67, p. 2, 2001
work page 2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.