arxiv: 2510.24168 · v3 · submitted 2025-10-28 · 💻 cs.AI

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Weihua Cheng , Junming Liu , Yifei Sun , Botian Shi , Yirong Chen , Ding Wang This is my paper

Pith reviewed 2026-05-18 03:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords GUI agentsstructured memorymultimodal large language modelsstate transitionsobservation-centric interactionlong-horizon automationerror cascadesOSWorld

0 comments

The pith

MGA links GUI decisions through compact verified state changes instead of full history logs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes replacing raw sequential histories in multimodal GUI agents with a structured memory of state transitions. Current approaches overload context and add redundant modules, which triggers error cascades and high latency over long tasks. MGA observes the screen first without task intent, then compresses each step into verified deltas that connect independent decisions. This keeps the system simple while still handling open-ended automation. Experiments on OSWorld and real apps show it matches more complex designs in performance.

Core claim

MGA decouples long-horizon trajectories into independent decision steps linked by a structured state memory under an Observe First and Memory Enhancement principle. An Observer module reads screen states without task bias or intent, while Structured Memory distills each step into validated deltas that form a lightweight transition chain, removing irrelevant history and extra components.

What carries the argument

Structured Memory mechanism that distills each interaction step into verified state deltas to form a lightweight state transition chain.

If this is right

Error cascades from concatenated histories decrease because decisions no longer depend on full past trajectories.
System redundancy drops by removing over-engineered expert modules.
Inference latency falls while performance on long open-ended GUI tasks remains competitive.
The design scales more easily to real-world applications without added complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar state-delta linking could reduce context problems in sequential tasks outside GUI, such as robotic planning.
Integrating external verification for the deltas might further strengthen reliability on ambiguous screens.
Independent steps open the possibility of selective re-planning only on failed transitions rather than restarting entire histories.

Load-bearing premise

The Observer module can read screen states without any task intent or bias, preventing hallucinations and perception errors at the source.

What would settle it

Run MGA and baseline agents on the same OSWorld tasks known to trigger visual misreads, then measure whether error rates stay lower when the memory validation step is removed.

Figures

Figures reproduced from arXiv: 2510.24168 by Botian Shi, Ding Wang, Junming Liu, Weihua Cheng, Yifei Sun, Yirong Chen.

**Figure 2.** Figure 2: Detailed workflow of MGA showing internal data flow among the Observer, Memory Agent, Planner, and Grounding [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have significantly advanced GUI agents, yet long-horizon automation remains constrained by two critical bottlenecks: context overload from raw sequential trajectory dependence and architectural redundancy from over-engineered expert modules. Prevailing End-to-End and Multi-Agent paradigms struggle with error cascades caused by concatenated visual-textual histories and incur high inference latency due to redundant expert components, limiting their practical deployment. To address these issues, we propose the Memory-Driven GUI Agent (MGA), a minimalist framework that decouples long-horizon trajectories into independent decision steps linked by a structured state memory. MGA operates on an ``Observe First and Memory Enhancement`` principle, powered by two tightly coupled core mechanisms: (1) an Observer module that acts as a task-agnostic, intent-free screen state reader to eliminate confirmation bias, visual hallucinations, and perception bias at the root; and (2) a Structured Memory mechanism that distills, validates, and compresses each interaction step into verified state deltas, constructing a lightweight state transition chain to avoid irrelevant historical interference and system redundancy. By replacing raw historical aggregation with compact, fact-based memory transitions, MGA drastically reduces cognitive overhead and system complexity. Extensive experiments on OSWorld and real-world applications demonstrate that MGA achieves highly competitive performance in open-ended GUI tasks while maintaining architectural simplicity, offering a scalable and efficient blueprint for next-generation GUI automation {https://github.com/MintyCo0kie/MGA4OSWorld}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes the Memory-Driven GUI Agent (MGA), a minimalist framework for GUI agents that decouples long-horizon trajectories into independent steps via an Observer module (task-agnostic, intent-free screen state reader) and a Structured Memory mechanism that distills interactions into verified state deltas and transition chains. It claims this approach reduces context overload, architectural redundancy, and error cascades compared to end-to-end and multi-agent paradigms, while achieving highly competitive performance on OSWorld and real-world applications.

Significance. If the central mechanisms hold, MGA would provide a scalable blueprint for GUI automation by replacing raw historical aggregation with compact, fact-based memory transitions, lowering cognitive overhead and inference latency. The architectural simplicity and focus on observation-centric interaction represent a potentially useful direction if the performance claims are substantiated with quantitative evidence.

major comments (1)

[Section 3.2] Section 3.2: The claim that the Observer module 'eliminates confirmation bias, visual hallucinations, and perception bias at the root' by being task-agnostic and intent-free is load-bearing for the error-cascade reduction argument. However, many GUI states (e.g., context-dependent dialog boxes) are under-specified without task intent; an intent-free reader may therefore produce incomplete or misleading deltas that the subsequent Structured Memory cannot fully compensate for, undermining both the claimed reduction in cognitive overhead and competitive long-horizon performance.

minor comments (1)

[Abstract] Abstract: The statement that 'extensive experiments... demonstrate that MGA achieves highly competitive performance' would be strengthened by including at least one or two key quantitative metrics (e.g., success rates on OSWorld) and brief baseline comparisons.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The major comment identifies a potential limitation in the Observer module's design. We respond to it below and have incorporated clarifications into the revised manuscript.

read point-by-point responses

Referee: [Section 3.2] Section 3.2: The claim that the Observer module 'eliminates confirmation bias, visual hallucinations, and perception bias at the root' by being task-agnostic and intent-free is load-bearing for the error-cascade reduction argument. However, many GUI states (e.g., context-dependent dialog boxes) are under-specified without task intent; an intent-free reader may therefore produce incomplete or misleading deltas that the subsequent Structured Memory cannot fully compensate for, undermining both the claimed reduction in cognitive overhead and competitive long-horizon performance.

Authors: We agree that purely visual descriptions can leave some GUI elements (such as context-dependent dialogs) under-specified when viewed in isolation. Our design addresses this through explicit separation of concerns rather than by claiming the Observer alone resolves all ambiguity. The Observer produces a neutral, task-agnostic enumeration of visible UI elements, text, and layout. Task intent is supplied only at the subsequent decision step, while the Structured Memory supplies compact, verified state deltas that record how the screen evolved from prior actions. This chaining supplies the missing context without reintroducing confirmation bias into perception. Ablation results in the paper show that removing the intent-free constraint increases error rates on long-horizon tasks, supporting the claimed reduction in cascades. To address the referee's point directly, we have added a new paragraph in Section 3.2 that discusses ambiguous states, provides concrete dialog-box examples from OSWorld, and explains how memory transitions mitigate incompleteness. We also include additional per-task breakdowns in the appendix for dialog-heavy scenarios. revision: partial

Circularity Check

0 steps flagged

No circularity: design claims rest on independent mechanisms and empirical results

full rationale

The paper presents MGA as a framework that decouples trajectories via an Observer (defined as task-agnostic and intent-free) and Structured Memory (for distilling state deltas). These are architectural choices whose benefits for reducing context overload and error cascades are asserted from the design and then evaluated empirically on OSWorld and real-world tasks. No equations, fitted parameters renamed as predictions, or self-citation chains are present that would make the performance or overhead-reduction claims equivalent to their inputs by construction. The derivation chain is therefore self-contained and externally testable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework introduces two new mechanisms built on existing MLLM technology; no explicit free parameters are described, and the approach relies on domain assumptions about model perception rather than new postulates.

axioms (1)

domain assumption Multimodal large language models can serve as effective task-agnostic screen state readers when prompted without task context.
This underpins the Observer module's ability to eliminate biases as stated in the abstract.

invented entities (2)

Observer module no independent evidence
purpose: Task-agnostic screen state reader to remove confirmation and perception biases
New component introduced to address root causes of error cascades.
Structured Memory mechanism no independent evidence
purpose: Distills interactions into verified state deltas for lightweight transition chains
New component to replace raw history aggregation.

pith-pipeline@v0.9.0 · 5809 in / 1278 out tokens · 40043 ms · 2026-05-18T03:36:10.861616+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Observer module that acts as a task-agnostic, intent-free screen state reader to eliminate confirmation bias... Structured Memory mechanism that distills, validates, and compresses each interaction step into verified state deltas
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 8 internal anchors

[1]

Deepak Bhaskar Acharya, Karthigeyan Kuppan, and B Divya. 2025. Agentic ai: Autonomous intelligence for complex goals–a comprehensive survey.IEEe Access (2025)

work page 2025
[2]

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. 2024. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164(2024)

work page arXiv 2024
[3]

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang

work page
[4]

Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906(2025)

work page arXiv 2025
[5]

Claude Anthropic. 2025. 3.7 sonnet and claude code

work page 2025
[6]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. 2024. The revolution of multimodal large language models: a survey.arXiv preprint arXiv:2402.12451 (2024)

work page arXiv 2024
[8]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, and Lin Ma. 2025. GUIRoboTron- Speech: Towards Automated GUI Agents Based on Speech Instructions.arXiv preprint arXiv:2506.11127(2025)

work page arXiv 2025
[10]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290

work page 2024
[11]

Jing Huang, Zhixiong Zeng, Wenkang Han, Yufeng Zhong, Liming Zheng, Shuai Fu, Jingyuan Chen, and Lin Ma. 2025. Scaletrack: Scaling and back-tracking automated gui agents.arXiv preprint arXiv:2505.00416(2025)

work page arXiv 2025
[12]

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiy- ong Huang, and Tat-Seng Chua. 2025. Screenspot-pro: Gui grounding for profes- sional high-resolution computer use.arXiv preprint arXiv:2504.07981(2025)

work page arXiv 2025
[13]

Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. 2023. A zero-shot language agent for computer control with structured reflection.arXiv preprint arXiv:2310.08740(2023)

work page arXiv 2023
[14]

Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Botian Shi, and Ding Wang. 2025. Aligning vision to language: Text-free multimodal knowledge graph construction for enhanced llms reasoning.arXiv preprint arXiv:2503.12972(2025)

work page arXiv 2025
[15]

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575(2025)

work page arXiv 2025
[16]

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. 2025. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239 (2025)

work page internal anchor Pith review arXiv 2025
[17]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

work page
[18]

Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems36 (2023), 46534–46594

work page 2023
[19]

Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, Xintong Li, Jing Shi, Hongjie Chen, Viet Dac Lai, Zhouhang Xie, Sungchul Kim, Ruiyi Zhang, Tong Yu, Mehrab Tanjim, Nesreen K. Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zho...

work page doi:10.48550/arxiv.2412.13501 2024
[20]

Team OpenAI. 2025. Introducing OpenAI o3 and o4-mini.https://openai. com/index/introducing-o3-and-o4-mini/(2025)

work page 2025
[21]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems. (2023)

work page 2023
[22]

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Jiahao Qiu, Xuan Qi, Tongcheng Zhang, Xinzhe Juan, Jiacheng Guo, Yifu Lu, Yimin Wang, Zixin Yao, Qihan Ren, Xun Jiang, et al. 2025. Alita: Generalist agent enabling scalable agentic reasoning with minimal predefinition and maximal self-evolution.arXiv preprint arXiv:2505.20286(2025)

work page arXiv 2025
[24]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al . 2024. Androidworld: A dynamic benchmarking environment for au- tonomous agents.arXiv preprint arXiv:2405.14573(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova. 2023. From pixels to ui actions: Learning to follow instructions via graphical user interfaces.Advances in Neural Information Processing Systems36 (2023), 34354– 34370

work page 2023
[26]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems36 (2023), 8634–8652

work page 2023
[27]

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, et al. 2025. Coact-1: Computer-using agents with coding as actions.arXiv preprint arXiv:2508.03923(2025)

work page arXiv 2025
[28]

Yueqi Song, Frank Xu, Shuyan Zhou, and Graham Neubig. 2024. Beyond browsing: Api-based web agents.arXiv preprint arXiv:2410.16464(2024)

work page arXiv 2024
[29]

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, Bin Wang, Chuhan Wu, Yasheng Wang, Ruiming Tang, and Jianye Hao. 2024. GUI Agents with Foundation Models: A Comprehensive Survey.arXiv preprint arXiv:2411.04890(2024). doi:10. 48550/arXiv.2411.04890 Submitted 7 Nov 2024; Revised 13 Feb 2025

work page arXiv 2024
[30]

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. 2025. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123(2025)

work page arXiv 2025
[31]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. 2024. Autogen: Enabling next-gen LLM applications via multi-agent conversations. InFirst Conference on Language Modeling

work page 2024
[32]

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. 2024. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al . 2025. Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis.arXiv WSDM ’25, March 10–14, 2025, Los Angeles, CA, USA Trovato et al. preprint arXiv:2505.13227(2025)

work page arXiv 2025
[34]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al . 2024. Os- world: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

work page 2024
[35]

Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, and Liqiang Nie. 2025. Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills.arXiv preprint arXiv:2506.10387 (2025)

work page arXiv 2025
[36]

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2024. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454(2024)

work page internal anchor Pith review arXiv 2024
[37]

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. 2025. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)

work page 2023
[39]

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. 2025. Mobile-agent-v3: Foundamental agents for gui automation.arXiv preprint arXiv:2508.15144(2025)

work page arXiv 2025
[40]

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models.National Science Review11, 12 (2024), nwae403

work page 2024
[41]

Simon Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Peter Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, et al. 2024. Fine-tuning large vision-language models as decision-making agents via reinforcement learning.Advances in neural information processing systems37 (2024), 110935–110971

work page 2024
[42]

Chaoyun Zhang, Shilin He, Liqun Li, Si Qin, Yu Kang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. 2025. Api agents vs. gui agents: Divergence and convergence.arXiv preprint arXiv:2503.11069(2025)

work page arXiv 2025
[43]

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. 2024. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279(2024)

work page arXiv 2024
[44]

Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, et al. 2025. Ufo2: The desktop agentos.arXiv preprint arXiv:2504.14603(2025)

work page arXiv 2025
[45]

Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, and Chen Wei. 2025. Pyvision: Agentic vision with dynamic tooling. arXiv preprint arXiv:2507.07998(2025)

work page arXiv 2025