pith. sign in

arxiv: 2606.10522 · v1 · pith:F6FCUIURnew · submitted 2026-06-09 · 💻 cs.CV

GUI-AC: Enhancing Continual Learning in GUI Agents

Pith reviewed 2026-06-27 13:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords GUI agentscontinual learningreinforcement fine-tuninggrounding certaintyadaptive advantagedynamic clippingdistribution shift
0
0 comments X

The pith

GUI-AC stabilizes reinforcement fine-tuning for GUI agents by using grounding certainty for adaptive advantage and dynamic clipping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reinforcement fine-tuning of GUI agents suffers from instabilities including sharp reward discontinuities, high-variance oscillations, imbalanced rollouts that cause policy overconfidence, and fixed clipping that limits adaptation to new interfaces. It introduces grounding certainty as a quantity that enables two fixes: adaptive advantage estimation to down-weight noisy signals and dynamic clipping to relax bounds and restore exploration. If these mechanisms succeed, agents can continue learning across the non-stationary stream of new GUI domains and resolutions without collapsing. This matters because GUIs remain the primary human-computer interface and current agents cannot handle the continual emergence of unseen screens or applications. Experiments are presented showing the combined mechanisms produce higher performance than prior baselines.

Core claim

GUI-AC introduces grounding certainty to support Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence, and Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. These mechanisms jointly address instabilities in reinforcement fine-tuning and enable GUI agents to handle persistent distribution shifts from novel interface instances.

What carries the argument

Grounding certainty, which supplies the signal for adaptive advantage estimation and dynamic clipping bounds during reinforcement fine-tuning of GUI agents.

If this is right

  • Policy overconfidence from imbalanced rollout outcomes is reduced during adaptation to new interfaces.
  • Exploration capacity is preserved because the clipping bound can increase when grounding certainty is high.
  • Overall task performance exceeds that of prior state-of-the-art reinforcement fine-tuning methods on continual GUI benchmarks.
  • Agents maintain grounding capability across repeated distribution shifts without requiring full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same certainty signal might be applied to other reinforcement-learning settings that exhibit high-variance advantage estimates and fixed clipping.
  • If the computation of grounding certainty generalizes, it could reduce reliance on manual hyperparameter schedules for clipping and advantage normalization.
  • The approach may extend naturally to agents operating in other non-stationary interactive environments such as web browsers or mobile applications.

Load-bearing premise

The listed instabilities are the dominant causes of failure in reinforcement fine-tuning for GUI continual learning, and grounding certainty can be computed reliably enough to correct them without creating new failure modes.

What would settle it

Training the method on a held-out collection of previously unseen GUI domains and resolutions and observing either no performance gain over baselines or the reappearance of the same reward discontinuities and exploration collapse would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.10522 by Can Lin, Dan Zhang, Hangjie Yuan, Tao Feng, Yifan Zhu, Zhonghong Ou.

Figure 1
Figure 1. Figure 1: (a) The RFT-based method initially yields steady rewards. However, when transitioning to [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of GUI- [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hyperparameter sensitivity analysis for amin and k. Performance peaks at (amin = 0.2, k = 3.0) on ScreenSpot-V1, ScreenSpot-V2 and ScreenSpot-Pro benchmarks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Histograms of grounding certainty under n = 4 sampled candidates per instruction. distributional widening after shifts, and preserves the bulk of the reward mass. Overall, these results support that GUI-AC enhances training stability for continual learning of GUI agents. Visualization Comparison [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of interaction region heatmaps [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GUI-AC maintains consistently smooth, high-reward trajectories and recovers rapidly after [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at https://anonymous.4open.science/r/GUI-AC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes GUI-AC to improve continual learning for GUI agents under non-stationary interface distributions. It diagnoses four instabilities in reinforcement fine-tuning (sharp reward discontinuities, high-variance oscillations, imbalanced rollout outcomes causing noisy advantage estimates, and fixed clipping that limits exploration) and introduces a grounding-certainty signal to drive two mechanisms: Adaptive Advantage (down-weighting noisy advantages to avoid overconfidence) and Dynamic Clipping (relaxing the clipping bound to restore exploration capacity). The abstract asserts that these mechanisms jointly yield performance gains that surpass state-of-the-art baselines, with code released anonymously.

Significance. If the experimental claims hold and the grounding-certainty signal proves robust across distribution shifts, the work would address a practically important obstacle in deploying GUI agents. The explicit targeting of RFT instabilities and the provision of anonymous code are positive features that could support reproducibility and follow-on research in continual learning for interface agents.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines' is unsupported by any referenced metrics, tables, ablation studies, or error analysis. This absence is load-bearing for the primary contribution.
  2. [Abstract] Abstract: no equations, pseudocode, or formal definitions are supplied for grounding certainty, the Adaptive Advantage weighting function, or the Dynamic Clipping schedule. Without these, it is impossible to evaluate whether the certainty signal is independent of the very policy errors (reward discontinuities, high-variance rollouts) it is intended to correct, leaving the skeptic concern about circularity unaddressed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and constructive feedback. We address each major comment on the abstract below, proposing targeted revisions where appropriate to strengthen the presentation of our claims and method details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines' is unsupported by any referenced metrics, tables, ablation studies, or error analysis. This absence is load-bearing for the primary contribution.

    Authors: We agree that referencing specific empirical support would make the abstract's central claim more robust. In the revised version, we will incorporate concise mentions of key quantitative results (such as average success rate improvements and comparisons to baselines) along with explicit references to the primary results table and ablation studies presented in the experimental section. This revision will directly substantiate the claim within the abstract while adhering to length limits. revision: yes

  2. Referee: [Abstract] Abstract: no equations, pseudocode, or formal definitions are supplied for grounding certainty, the Adaptive Advantage weighting function, or the Dynamic Clipping schedule. Without these, it is impossible to evaluate whether the certainty signal is independent of the very policy errors (reward discontinuities, high-variance rollouts) it is intended to correct, leaving the skeptic concern about circularity unaddressed.

    Authors: Abstracts conventionally omit equations and pseudocode to maintain brevity; these are fully provided in the main manuscript (formal definition and weighting function for grounding certainty and Adaptive Advantage in Section 3.2, Dynamic Clipping schedule in Section 3.3, and pseudocode in Algorithm 1). On the circularity concern, grounding certainty is computed via a dedicated grounding evaluation on localization accuracy for interface elements, which operates independently of the rollout-derived advantage estimates and reward signals. We will add a brief clarifying phrase to the revised abstract and expand the independence discussion in Section 3 to explicitly address this point. revision: partial

Circularity Check

0 steps flagged

No derivation chain or equations; claims rest on experimental results only.

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters, or self-citations of prior uniqueness theorems. The method is introduced at the level of high-level mechanisms (Adaptive Advantage, Dynamic Clipping) justified by addressing listed instabilities, with performance claims supported solely by 'extensive experiments.' No load-bearing step reduces by construction to its own inputs, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, parameters, or axioms are described in the abstract; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5793 in / 955 out tokens · 17714 ms · 2026-06-27T13:54:22.469287+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 15 linked inside Pith

  1. [1]

    GUI Agents with Foundation Models: A Comprehensive Survey.arXiv preprint arXiv:2411.04890, 2024

    Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. GUI Agents with Foundation Models: A Comprehensive Survey.arXiv preprint arXiv:2411.04890, 2024

  2. [2]

    A Survey on (M)LLM-Based GUI Agents.arXiv preprint arXiv:2504.13865, 2025

    Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, et al. A Survey on (M)LLM-Based GUI Agents.arXiv preprint arXiv:2504.13865, 2025

  3. [3]

    CogAgent: A Visual Language Model for GUI Agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. CogAgent: A Visual Language Model for GUI Agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

  4. [4]

    KG-RAG: Enhancing GUI Agent Decision- Making via Knowledge Graph-Driven Retrieval-Augmented Generation

    Ziyi Guan, Jason Chun Lok Li, Zhijian Hou, Pingping Zhang, Donglai Xu, Yuzhi Zhao, Mengyang Wu, Jinpeng Chen, Thanh-Toan Nguyen, Pengfei Xian, et al. KG-RAG: Enhancing GUI Agent Decision- Making via Knowledge Graph-Driven Retrieval-Augmented Generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5396–5405, 2025

  5. [5]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 9313–9332, 2024

  6. [6]

    ShowUI: One Vision-Language-Action Model for GUI Visual Agent

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. ShowUI: One Vision-Language-Action Model for GUI Visual Agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

  7. [7]

    Continual GUI Agents.arXiv, 2026

    Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, and Tao Feng. Continual GUI Agents.arXiv, 2026

  8. [8]

    Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation

    Tao Feng, Mang Wang, and Hangjie Yuan. Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9427–9436, 2022

  9. [9]

    Test-Time Reinforcement Learning for GUI Grounding via Region Consistency.arXiv preprint arXiv:2508.05615, 2025

    Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test-Time Reinforcement Learning for GUI Grounding via Region Consistency.arXiv preprint arXiv:2508.05615, 2025

  10. [10]

    GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents.arXiv preprint arXiv:2509.15532, 2025

    Xianhang Ye, Yiqing Li, Wei Dai, Miancan Liu, Ziyuan Chen, Zhangye Han, Hongbo Min, Jinkui Ren, Xiantao Zhang, Wen Yang, et al. GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents.arXiv preprint arXiv:2509.15532, 2025. 10

  11. [11]

    UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding.arXiv preprint arXiv:2507.22025, 2025

    Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding.arXiv preprint arXiv:2507.22025, 2025

  12. [12]

    Visual-RFT: Visual Reinforcement Fine-Tuning.arXiv preprint arXiv:2503.01785, 2025

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual Reinforcement Fine-Tuning.arXiv preprint arXiv:2503.01785, 2025

  13. [13]

    Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective.arXiv preprint arXiv:2506.23508, 2025

    Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, et al. Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective.arXiv preprint arXiv:2506.23508, 2025

  14. [14]

    ICPO: Intrinsic Confidence- Driven Group Relative Preference Optimization for Efficient Reinforcement Learning.arXiv preprint arXiv:2511.21005, 2025

    Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, and Jian Luan. ICPO: Intrinsic Confidence- Driven Group Relative Preference Optimization for Efficient Reinforcement Learning.arXiv preprint arXiv:2511.21005, 2025

  15. [15]

    DCPO: Dynamic Clipping Policy Optimization.arXiv preprint arXiv:2509.02333, 2025

    Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. DCPO: Dynamic Clipping Policy Optimization.arXiv preprint arXiv:2509.02333, 2025

  16. [16]

    An Empirical Study on Eliciting and Improving R1-like Reasoning Models.arXiv preprint arXiv:2503.04548, 2025

    Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, et al. An Empirical Study on Eliciting and Improving R1-like Reasoning Models.arXiv preprint arXiv:2503.04548, 2025

  17. [17]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv preprint arXiv:2503.14476, 2025

  18. [18]

    OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

  19. [19]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents.arXiv preprint arXiv:2410.23218, 2024

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents.arXiv preprint arXiv:2410.23218, 2024

  20. [20]

    ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

  21. [21]

    InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners. arXiv preprint arXiv:2504.14239, 2025

  22. [22]

    Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning.arXiv preprint arXiv:2505.12370, 2025

    Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning.arXiv preprint arXiv:2505.12370, 2025

  23. [23]

    GUI-G 2: Gaussian Reward Modeling for GUI Grounding

    Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. GUI-G 2: Gaussian Reward Modeling for GUI Grounding. arXiv preprint arXiv:2507.15846, 2025

  24. [24]

    Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

  25. [25]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents.arXiv preprint arXiv:2410.05243, 2024

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents.arXiv preprint arXiv:2410.05243, 2024

  26. [26]

    Large Language Model-Brained GUI Agents: A Survey.arXiv preprint arXiv:2411.18279, 2024

    Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large Language Model-Brained GUI Agents: A Survey.arXiv preprint arXiv:2411.18279, 2024

  27. [27]

    OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

    Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5555–5579, 2025. 11

  28. [28]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

  29. [29]

    Aria-UI: Visual Grounding for GUI Instructions

    Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-UI: Visual Grounding for GUI Instructions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22418–22433, 2025

  30. [30]

    UIShift: Enhancing VLM-based GUI Agents through Self- supervised Reinforcement Learning.arXiv preprint arXiv:2505.12493, 2025

    Longxi Gao, Li Zhang, and Mengwei Xu. UIShift: Enhancing VLM-based GUI Agents through Self- supervised Reinforcement Learning.arXiv preprint arXiv:2505.12493, 2025

  31. [31]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.arXiv preprint arXiv:2401.16158, 2024

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.arXiv preprint arXiv:2401.16158, 2024

  32. [32]

    UFO: A UI-Focused Agent for Windows OS Interaction

    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. UFO: A UI-Focused Agent for Windows OS Interaction. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pag...

  33. [33]

    AutoGLM: Autonomous Foundation Agents for GUIs.arXiv preprint arXiv:2411.00820, 2024

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. AutoGLM: Autonomous Foundation Agents for GUIs.arXiv preprint arXiv:2411.00820, 2024

  34. [34]

    Android in the Wild: A Large-Scale Dataset for Android Device Control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the Wild: A Large-Scale Dataset for Android Device Control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

  35. [35]

    AppAgent: Multimodal Agents as Smartphone Users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. AppAgent: Multimodal Agents as Smartphone Users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

  36. [36]

    PP-OCR: A Practical Ultra Lightweight OCR System.arXiv preprint arXiv:2009.09941, 2020

    Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. PP-OCR: A Practical Ultra Lightweight OCR System.arXiv preprint arXiv:2009.09941, 2020

  37. [37]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment Anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  38. [38]

    OmniParser for Pure Vision Based GUI Agent.arXiv preprint arXiv:2408.00203, 2024

    Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. OmniParser for Pure Vision Based GUI Agent.arXiv preprint arXiv:2408.00203, 2024

  39. [39]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv preprint arXiv:2501.12326, 2025

  40. [40]

    ScreenAI: A Vision-Language Model for UI and Infographics Understanding.arXiv preprint arXiv:2402.04615, 2024

    Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C˘arbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. ScreenAI: A Vision-Language Model for UI and Infographics Understanding.arXiv preprint arXiv:2402.04615, 2024

  41. [41]

    GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents.arXiv preprint arXiv:2504.10458, 2025

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents.arXiv preprint arXiv:2504.10458, 2025

  42. [42]

    Continual Learning for Generative AI: From LLMs to MLLMs and Beyond, 2025

    Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Continual Learning for Generative AI: From LLMs to MLLMs and Beyond, 2025

  43. [43]

    LLaV A-c: Continual Improved Visual Instruction Tuning.arXiv preprint arXiv:2506.08666, 2025

    Wenzhuo Liu, Fei Zhu, Haiyang Guo, Longhui Wei, and Cheng-Lin Liu. LLaV A-c: Continual Improved Visual Instruction Tuning.arXiv preprint arXiv:2506.08666, 2025

  44. [44]

    RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025

  45. [45]

    RL Fine-Tuning Heals OOD Forgetting in SFT.arXiv preprint arXiv:2509.12235, 2025

    Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa. RL Fine-Tuning Heals OOD Forgetting in SFT.arXiv preprint arXiv:2509.12235, 2025. 12

  46. [46]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

  47. [47]

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620, 2025

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620, 2025

  48. [48]

    GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents.arXiv preprint arXiv:2505.15810, 2025

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents.arXiv preprint arXiv:2505.15810, 2025. 13 A Appendix A.1 Reward Function and Its Interaction with the Method. In our implementation, GUI-AC uses the same task format as Continual GUI Agents [7]. This ensur...