GUI-AC: Enhancing Continual Learning in GUI Agents

Can Lin; Dan Zhang; Hangjie Yuan; Tao Feng; Yifan Zhu; Zhonghong Ou

arxiv: 2606.10522 · v1 · pith:F6FCUIURnew · submitted 2026-06-09 · 💻 cs.CV

GUI-AC: Enhancing Continual Learning in GUI Agents

Can Lin , Tao Feng , Hangjie Yuan , Dan Zhang , Yifan Zhu , Zhonghong Ou This is my paper

Pith reviewed 2026-06-27 13:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords GUI agentscontinual learningreinforcement fine-tuninggrounding certaintyadaptive advantagedynamic clippingdistribution shift

0 comments

The pith

GUI-AC stabilizes reinforcement fine-tuning for GUI agents by using grounding certainty for adaptive advantage and dynamic clipping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that reinforcement fine-tuning of GUI agents suffers from instabilities including sharp reward discontinuities, high-variance oscillations, imbalanced rollouts that cause policy overconfidence, and fixed clipping that limits adaptation to new interfaces. It introduces grounding certainty as a quantity that enables two fixes: adaptive advantage estimation to down-weight noisy signals and dynamic clipping to relax bounds and restore exploration. If these mechanisms succeed, agents can continue learning across the non-stationary stream of new GUI domains and resolutions without collapsing. This matters because GUIs remain the primary human-computer interface and current agents cannot handle the continual emergence of unseen screens or applications. Experiments are presented showing the combined mechanisms produce higher performance than prior baselines.

Core claim

GUI-AC introduces grounding certainty to support Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence, and Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. These mechanisms jointly address instabilities in reinforcement fine-tuning and enable GUI agents to handle persistent distribution shifts from novel interface instances.

What carries the argument

Grounding certainty, which supplies the signal for adaptive advantage estimation and dynamic clipping bounds during reinforcement fine-tuning of GUI agents.

If this is right

Policy overconfidence from imbalanced rollout outcomes is reduced during adaptation to new interfaces.
Exploration capacity is preserved because the clipping bound can increase when grounding certainty is high.
Overall task performance exceeds that of prior state-of-the-art reinforcement fine-tuning methods on continual GUI benchmarks.
Agents maintain grounding capability across repeated distribution shifts without requiring full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same certainty signal might be applied to other reinforcement-learning settings that exhibit high-variance advantage estimates and fixed clipping.
If the computation of grounding certainty generalizes, it could reduce reliance on manual hyperparameter schedules for clipping and advantage normalization.
The approach may extend naturally to agents operating in other non-stationary interactive environments such as web browsers or mobile applications.

Load-bearing premise

The listed instabilities are the dominant causes of failure in reinforcement fine-tuning for GUI continual learning, and grounding certainty can be computed reliably enough to correct them without creating new failure modes.

What would settle it

Training the method on a held-out collection of previously unseen GUI domains and resolutions and observing either no performance gain over baselines or the reappearance of the same reward discontinuities and exploration collapse would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.10522 by Can Lin, Dan Zhang, Hangjie Yuan, Tao Feng, Yifan Zhu, Zhonghong Ou.

**Figure 2.** Figure 2: Illustration of GUI- [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity analysis for amin and k. Performance peaks at (amin = 0.2, k = 3.0) on ScreenSpot-V1, ScreenSpot-V2 and ScreenSpot-Pro benchmarks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Histograms of grounding certainty under n = 4 sampled candidates per instruction. distributional widening after shifts, and preserves the bulk of the reward mass. Overall, these results support that GUI-AC enhances training stability for continual learning of GUI agents. Visualization Comparison [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 6.** Figure 6: Visualizations of interaction region heatmaps [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: GUI-AC maintains consistently smooth, high-reward trajectories and recovers rapidly after [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at https://anonymous.4open.science/r/GUI-AC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names real RFT instabilities for non-stationary GUI data and proposes grounding certainty to drive adaptive advantage plus dynamic clipping, but supplies no equations, ablations, or numbers to show the fixes work.

read the letter

The core idea is straightforward: standard reinforcement fine-tuning struggles with GUI continual learning because new interfaces create reward discontinuities, high-variance rollouts, and noisy advantage estimates, while fixed clipping prevents the policy from adapting. The authors add a grounding certainty signal to down-weight bad advantages and relax the clip bound when needed.

That pairing is new for this setting. The paper does a clean job listing the concrete failure modes that come from imbalanced outcomes and distribution shift, and the two mechanisms follow logically from those problems.

The gaps are straightforward. The abstract states the problems and the proposed fixes but gives no derivation for how certainty is computed, no sensitivity checks, and no experimental results at all. The claim that the method surpasses baselines therefore sits on nothing visible. The stress-test worry lands: if certainty is estimated from the same unstable policy, it can easily reinforce the very overconfidence or under-exploration it is meant to correct, especially across novel domains. Nothing in the text shows the signal stays informative under shift.

This is for groups already running RFT on GUI agents who need practical stabilization tricks. A reader who wants a fully specified method with reproducible gains will find the abstract thin. The work deserves a serious referee because the problem is practical and the proposed changes are testable; a review can check whether the experiments actually close the loop on the instabilities they diagnose.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes GUI-AC to improve continual learning for GUI agents under non-stationary interface distributions. It diagnoses four instabilities in reinforcement fine-tuning (sharp reward discontinuities, high-variance oscillations, imbalanced rollout outcomes causing noisy advantage estimates, and fixed clipping that limits exploration) and introduces a grounding-certainty signal to drive two mechanisms: Adaptive Advantage (down-weighting noisy advantages to avoid overconfidence) and Dynamic Clipping (relaxing the clipping bound to restore exploration capacity). The abstract asserts that these mechanisms jointly yield performance gains that surpass state-of-the-art baselines, with code released anonymously.

Significance. If the experimental claims hold and the grounding-certainty signal proves robust across distribution shifts, the work would address a practically important obstacle in deploying GUI agents. The explicit targeting of RFT instabilities and the provision of anonymous code are positive features that could support reproducibility and follow-on research in continual learning for interface agents.

major comments (2)

[Abstract] Abstract: the central claim that 'extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines' is unsupported by any referenced metrics, tables, ablation studies, or error analysis. This absence is load-bearing for the primary contribution.
[Abstract] Abstract: no equations, pseudocode, or formal definitions are supplied for grounding certainty, the Adaptive Advantage weighting function, or the Dynamic Clipping schedule. Without these, it is impossible to evaluate whether the certainty signal is independent of the very policy errors (reward discontinuities, high-variance rollouts) it is intended to correct, leaving the skeptic concern about circularity unaddressed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your thorough review and constructive feedback. We address each major comment on the abstract below, proposing targeted revisions where appropriate to strengthen the presentation of our claims and method details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines' is unsupported by any referenced metrics, tables, ablation studies, or error analysis. This absence is load-bearing for the primary contribution.

Authors: We agree that referencing specific empirical support would make the abstract's central claim more robust. In the revised version, we will incorporate concise mentions of key quantitative results (such as average success rate improvements and comparisons to baselines) along with explicit references to the primary results table and ablation studies presented in the experimental section. This revision will directly substantiate the claim within the abstract while adhering to length limits. revision: yes
Referee: [Abstract] Abstract: no equations, pseudocode, or formal definitions are supplied for grounding certainty, the Adaptive Advantage weighting function, or the Dynamic Clipping schedule. Without these, it is impossible to evaluate whether the certainty signal is independent of the very policy errors (reward discontinuities, high-variance rollouts) it is intended to correct, leaving the skeptic concern about circularity unaddressed.

Authors: Abstracts conventionally omit equations and pseudocode to maintain brevity; these are fully provided in the main manuscript (formal definition and weighting function for grounding certainty and Adaptive Advantage in Section 3.2, Dynamic Clipping schedule in Section 3.3, and pseudocode in Algorithm 1). On the circularity concern, grounding certainty is computed via a dedicated grounding evaluation on localization accuracy for interface elements, which operates independently of the rollout-derived advantage estimates and reward signals. We will add a brief clarifying phrase to the revised abstract and expand the independence discussion in Section 3 to explicitly address this point. revision: partial

Circularity Check

0 steps flagged

No derivation chain or equations; claims rest on experimental results only.

full rationale

The provided abstract and description contain no equations, derivations, fitted parameters, or self-citations of prior uniqueness theorems. The method is introduced at the level of high-level mechanisms (Adaptive Advantage, Dynamic Clipping) justified by addressing listed instabilities, with performance claims supported solely by 'extensive experiments.' No load-bearing step reduces by construction to its own inputs, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, parameters, or axioms are described in the abstract; the ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5793 in / 955 out tokens · 17714 ms · 2026-06-27T13:54:22.469287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 15 linked inside Pith

[1]

GUI Agents with Foundation Models: A Comprehensive Survey.arXiv preprint arXiv:2411.04890, 2024

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. GUI Agents with Foundation Models: A Comprehensive Survey.arXiv preprint arXiv:2411.04890, 2024

arXiv 2024
[2]

A Survey on (M)LLM-Based GUI Agents.arXiv preprint arXiv:2504.13865, 2025

Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, et al. A Survey on (M)LLM-Based GUI Agents.arXiv preprint arXiv:2504.13865, 2025

arXiv 2025
[3]

CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. CogAgent: A Visual Language Model for GUI Agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

2024
[4]

KG-RAG: Enhancing GUI Agent Decision- Making via Knowledge Graph-Driven Retrieval-Augmented Generation

Ziyi Guan, Jason Chun Lok Li, Zhijian Hou, Pingping Zhang, Donglai Xu, Yuzhi Zhao, Mengyang Wu, Jinpeng Chen, Thanh-Toan Nguyen, Pengfei Xian, et al. KG-RAG: Enhancing GUI Agent Decision- Making via Knowledge Graph-Driven Retrieval-Augmented Generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5396–5405, 2025

2025
[5]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 9313–9332, 2024

2024
[6]

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. ShowUI: One Vision-Language-Action Model for GUI Visual Agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

2025
[7]

Continual GUI Agents.arXiv, 2026

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, and Tao Feng. Continual GUI Agents.arXiv, 2026

2026
[8]

Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation

Tao Feng, Mang Wang, and Hangjie Yuan. Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9427–9436, 2022

2022
[9]

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency.arXiv preprint arXiv:2508.05615, 2025

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test-Time Reinforcement Learning for GUI Grounding via Region Consistency.arXiv preprint arXiv:2508.05615, 2025

arXiv 2025
[10]

GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents.arXiv preprint arXiv:2509.15532, 2025

Xianhang Ye, Yiqing Li, Wei Dai, Miancan Liu, Ziyuan Chen, Zhangye Han, Hongbo Min, Jinkui Ren, Xiantao Zhang, Wen Yang, et al. GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents.arXiv preprint arXiv:2509.15532, 2025. 10

arXiv 2025
[11]

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding.arXiv preprint arXiv:2507.22025, 2025

Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding.arXiv preprint arXiv:2507.22025, 2025

Pith/arXiv arXiv 2025
[12]

Visual-RFT: Visual Reinforcement Fine-Tuning.arXiv preprint arXiv:2503.01785, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual Reinforcement Fine-Tuning.arXiv preprint arXiv:2503.01785, 2025

Pith/arXiv arXiv 2025
[13]

Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective.arXiv preprint arXiv:2506.23508, 2025

Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, et al. Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective.arXiv preprint arXiv:2506.23508, 2025

arXiv 2025
[14]

ICPO: Intrinsic Confidence- Driven Group Relative Preference Optimization for Efficient Reinforcement Learning.arXiv preprint arXiv:2511.21005, 2025

Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, and Jian Luan. ICPO: Intrinsic Confidence- Driven Group Relative Preference Optimization for Efficient Reinforcement Learning.arXiv preprint arXiv:2511.21005, 2025

arXiv 2025
[15]

DCPO: Dynamic Clipping Policy Optimization.arXiv preprint arXiv:2509.02333, 2025

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. DCPO: Dynamic Clipping Policy Optimization.arXiv preprint arXiv:2509.02333, 2025

arXiv 2025
[16]

An Empirical Study on Eliciting and Improving R1-like Reasoning Models.arXiv preprint arXiv:2503.04548, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, et al. An Empirical Study on Eliciting and Improving R1-like Reasoning Models.arXiv preprint arXiv:2503.04548, 2025

arXiv 2025
[17]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025
[18]

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

2024
[19]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents.arXiv preprint arXiv:2410.23218, 2024

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents.arXiv preprint arXiv:2410.23218, 2024

Pith/arXiv arXiv 2024
[20]

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

2025
[21]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners. arXiv preprint arXiv:2504.14239, 2025

Pith/arXiv arXiv 2025
[22]

Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning.arXiv preprint arXiv:2505.12370, 2025

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning.arXiv preprint arXiv:2505.12370, 2025

arXiv 2025
[23]

GUI-G 2: Gaussian Reward Modeling for GUI Grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. GUI-G 2: Gaussian Reward Modeling for GUI Grounding. arXiv preprint arXiv:2507.15846, 2025

arXiv 2025
[24]

Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025
[25]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents.arXiv preprint arXiv:2410.05243, 2024

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents.arXiv preprint arXiv:2410.05243, 2024

Pith/arXiv arXiv 2024
[26]

Large Language Model-Brained GUI Agents: A Survey.arXiv preprint arXiv:2411.18279, 2024

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large Language Model-Brained GUI Agents: A Survey.arXiv preprint arXiv:2411.18279, 2024

Pith/arXiv arXiv 2024
[27]

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5555–5579, 2025. 11

2025
[28]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

2023
[29]

Aria-UI: Visual Grounding for GUI Instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-UI: Visual Grounding for GUI Instructions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22418–22433, 2025

2025
[30]

UIShift: Enhancing VLM-based GUI Agents through Self- supervised Reinforcement Learning.arXiv preprint arXiv:2505.12493, 2025

Longxi Gao, Li Zhang, and Mengwei Xu. UIShift: Enhancing VLM-based GUI Agents through Self- supervised Reinforcement Learning.arXiv preprint arXiv:2505.12493, 2025

arXiv 2025
[31]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.arXiv preprint arXiv:2401.16158, 2024

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.arXiv preprint arXiv:2401.16158, 2024

Pith/arXiv arXiv 2024
[32]

UFO: A UI-Focused Agent for Windows OS Interaction

Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. UFO: A UI-Focused Agent for Windows OS Interaction. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pag...

2025
[33]

AutoGLM: Autonomous Foundation Agents for GUIs.arXiv preprint arXiv:2411.00820, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. AutoGLM: Autonomous Foundation Agents for GUIs.arXiv preprint arXiv:2411.00820, 2024

arXiv 2024
[34]

Android in the Wild: A Large-Scale Dataset for Android Device Control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the Wild: A Large-Scale Dataset for Android Device Control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

2023
[35]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. AppAgent: Multimodal Agents as Smartphone Users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025
[36]

PP-OCR: A Practical Ultra Lightweight OCR System.arXiv preprint arXiv:2009.09941, 2020

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. PP-OCR: A Practical Ultra Lightweight OCR System.arXiv preprint arXiv:2009.09941, 2020

arXiv 2009
[37]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment Anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023
[38]

OmniParser for Pure Vision Based GUI Agent.arXiv preprint arXiv:2408.00203, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. OmniParser for Pure Vision Based GUI Agent.arXiv preprint arXiv:2408.00203, 2024

arXiv 2024
[39]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv preprint arXiv:2501.12326, 2025

Pith/arXiv arXiv 2025
[40]

ScreenAI: A Vision-Language Model for UI and Infographics Understanding.arXiv preprint arXiv:2402.04615, 2024

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C˘arbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. ScreenAI: A Vision-Language Model for UI and Infographics Understanding.arXiv preprint arXiv:2402.04615, 2024

arXiv 2024
[41]

GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents.arXiv preprint arXiv:2504.10458, 2025

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents.arXiv preprint arXiv:2504.10458, 2025

Pith/arXiv arXiv 2025
[42]

Continual Learning for Generative AI: From LLMs to MLLMs and Beyond, 2025

Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Continual Learning for Generative AI: From LLMs to MLLMs and Beyond, 2025

2025
[43]

LLaV A-c: Continual Improved Visual Instruction Tuning.arXiv preprint arXiv:2506.08666, 2025

Wenzhuo Liu, Fei Zhu, Haiyang Guo, Longhui Wei, and Cheng-Lin Liu. LLaV A-c: Continual Improved Visual Instruction Tuning.arXiv preprint arXiv:2506.08666, 2025

arXiv 2025
[44]

RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025

Pith/arXiv arXiv 2025
[45]

RL Fine-Tuning Heals OOD Forgetting in SFT.arXiv preprint arXiv:2509.12235, 2025

Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa. RL Fine-Tuning Heals OOD Forgetting in SFT.arXiv preprint arXiv:2509.12235, 2025. 12

Pith/arXiv arXiv 2025
[46]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[47]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620, 2025

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620, 2025

Pith/arXiv arXiv 2025
[48]

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents.arXiv preprint arXiv:2505.15810, 2025. 13 A Appendix A.1 Reward Function and Its Interaction with the Method. In our implementation, GUI-AC uses the same task format as Continual GUI Agents [7]. This ensur...

arXiv 2025

[1] [1]

GUI Agents with Foundation Models: A Comprehensive Survey.arXiv preprint arXiv:2411.04890, 2024

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. GUI Agents with Foundation Models: A Comprehensive Survey.arXiv preprint arXiv:2411.04890, 2024

arXiv 2024

[2] [2]

A Survey on (M)LLM-Based GUI Agents.arXiv preprint arXiv:2504.13865, 2025

Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, et al. A Survey on (M)LLM-Based GUI Agents.arXiv preprint arXiv:2504.13865, 2025

arXiv 2025

[3] [3]

CogAgent: A Visual Language Model for GUI Agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. CogAgent: A Visual Language Model for GUI Agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

2024

[4] [4]

KG-RAG: Enhancing GUI Agent Decision- Making via Knowledge Graph-Driven Retrieval-Augmented Generation

Ziyi Guan, Jason Chun Lok Li, Zhijian Hou, Pingping Zhang, Donglai Xu, Yuzhi Zhao, Mengyang Wu, Jinpeng Chen, Thanh-Toan Nguyen, Pengfei Xian, et al. KG-RAG: Enhancing GUI Agent Decision- Making via Knowledge Graph-Driven Retrieval-Augmented Generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5396–5405, 2025

2025

[5] [5]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 9313–9332, 2024

2024

[6] [6]

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. ShowUI: One Vision-Language-Action Model for GUI Visual Agent. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19498–19508, 2025

2025

[7] [7]

Continual GUI Agents.arXiv, 2026

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li, Yifan Zhu, and Tao Feng. Continual GUI Agents.arXiv, 2026

2026

[8] [8]

Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation

Tao Feng, Mang Wang, and Hangjie Yuan. Overcoming Catastrophic Forgetting in Incremental Object Detection via Elastic Response Distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9427–9436, 2022

2022

[9] [9]

Test-Time Reinforcement Learning for GUI Grounding via Region Consistency.arXiv preprint arXiv:2508.05615, 2025

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test-Time Reinforcement Learning for GUI Grounding via Region Consistency.arXiv preprint arXiv:2508.05615, 2025

arXiv 2025

[10] [10]

GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents.arXiv preprint arXiv:2509.15532, 2025

Xianhang Ye, Yiqing Li, Wei Dai, Miancan Liu, Ziyuan Chen, Zhangye Han, Hongbo Min, Jinkui Ren, Xiantao Zhang, Wen Yang, et al. GUI-ARP: Enhancing Grounding with Adaptive Region Perception for GUI Agents.arXiv preprint arXiv:2509.15532, 2025. 10

arXiv 2025

[11] [11]

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding.arXiv preprint arXiv:2507.22025, 2025

Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xiawu Zheng, and Hui Li. UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding.arXiv preprint arXiv:2507.22025, 2025

Pith/arXiv arXiv 2025

[12] [12]

Visual-RFT: Visual Reinforcement Fine-Tuning.arXiv preprint arXiv:2503.01785, 2025

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-RFT: Visual Reinforcement Fine-Tuning.arXiv preprint arXiv:2503.01785, 2025

Pith/arXiv arXiv 2025

[13] [13]

Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective.arXiv preprint arXiv:2506.23508, 2025

Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, et al. Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective.arXiv preprint arXiv:2506.23508, 2025

arXiv 2025

[14] [14]

ICPO: Intrinsic Confidence- Driven Group Relative Preference Optimization for Efficient Reinforcement Learning.arXiv preprint arXiv:2511.21005, 2025

Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, and Jian Luan. ICPO: Intrinsic Confidence- Driven Group Relative Preference Optimization for Efficient Reinforcement Learning.arXiv preprint arXiv:2511.21005, 2025

arXiv 2025

[15] [15]

DCPO: Dynamic Clipping Policy Optimization.arXiv preprint arXiv:2509.02333, 2025

Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, and Rihui Xin. DCPO: Dynamic Clipping Policy Optimization.arXiv preprint arXiv:2509.02333, 2025

arXiv 2025

[16] [16]

An Empirical Study on Eliciting and Improving R1-like Reasoning Models.arXiv preprint arXiv:2503.04548, 2025

Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, et al. An Empirical Study on Eliciting and Improving R1-like Reasoning Models.arXiv preprint arXiv:2503.04548, 2025

arXiv 2025

[17] [17]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025

[18] [18]

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024

2024

[19] [19]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents.arXiv preprint arXiv:2410.23218, 2024

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents.arXiv preprint arXiv:2410.23218, 2024

Pith/arXiv arXiv 2024

[20] [20]

ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778–8786, 2025

2025

[21] [21]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners. arXiv preprint arXiv:2504.14239, 2025

Pith/arXiv arXiv 2025

[22] [22]

Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning.arXiv preprint arXiv:2505.12370, 2025

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning.arXiv preprint arXiv:2505.12370, 2025

arXiv 2025

[23] [23]

GUI-G 2: Gaussian Reward Modeling for GUI Grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. GUI-G 2: Gaussian Reward Modeling for GUI Grounding. arXiv preprint arXiv:2507.15846, 2025

arXiv 2025

[24] [24]

Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923, 2025

Pith/arXiv arXiv 2025

[25] [25]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents.arXiv preprint arXiv:2410.05243, 2024

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents.arXiv preprint arXiv:2410.05243, 2024

Pith/arXiv arXiv 2024

[26] [26]

Large Language Model-Brained GUI Agents: A Survey.arXiv preprint arXiv:2411.18279, 2024

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large Language Model-Brained GUI Agents: A Survey.arXiv preprint arXiv:2411.18279, 2024

Pith/arXiv arXiv 2024

[27] [27]

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 5555–5579, 2025. 11

2025

[28] [28]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face.Advances in Neural Information Processing Systems, 36:38154–38180, 2023

2023

[29] [29]

Aria-UI: Visual Grounding for GUI Instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-UI: Visual Grounding for GUI Instructions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22418–22433, 2025

2025

[30] [30]

UIShift: Enhancing VLM-based GUI Agents through Self- supervised Reinforcement Learning.arXiv preprint arXiv:2505.12493, 2025

Longxi Gao, Li Zhang, and Mengwei Xu. UIShift: Enhancing VLM-based GUI Agents through Self- supervised Reinforcement Learning.arXiv preprint arXiv:2505.12493, 2025

arXiv 2025

[31] [31]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.arXiv preprint arXiv:2401.16158, 2024

Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception.arXiv preprint arXiv:2401.16158, 2024

Pith/arXiv arXiv 2024

[32] [32]

UFO: A UI-Focused Agent for Windows OS Interaction

Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. UFO: A UI-Focused Agent for Windows OS Interaction. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 1: Long Papers), pag...

2025

[33] [33]

AutoGLM: Autonomous Foundation Agents for GUIs.arXiv preprint arXiv:2411.00820, 2024

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, et al. AutoGLM: Autonomous Foundation Agents for GUIs.arXiv preprint arXiv:2411.00820, 2024

arXiv 2024

[34] [34]

Android in the Wild: A Large-Scale Dataset for Android Device Control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the Wild: A Large-Scale Dataset for Android Device Control.Advances in Neural Information Processing Systems, 36:59708–59728, 2023

2023

[35] [35]

AppAgent: Multimodal Agents as Smartphone Users

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. AppAgent: Multimodal Agents as Smartphone Users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

2025

[36] [36]

PP-OCR: A Practical Ultra Lightweight OCR System.arXiv preprint arXiv:2009.09941, 2020

Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, et al. PP-OCR: A Practical Ultra Lightweight OCR System.arXiv preprint arXiv:2009.09941, 2020

arXiv 2009

[37] [37]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment Anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

2023

[38] [38]

OmniParser for Pure Vision Based GUI Agent.arXiv preprint arXiv:2408.00203, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. OmniParser for Pure Vision Based GUI Agent.arXiv preprint arXiv:2408.00203, 2024

arXiv 2024

[39] [39]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering Automated GUI Interaction with Native Agents. arXiv preprint arXiv:2501.12326, 2025

Pith/arXiv arXiv 2025

[40] [40]

ScreenAI: A Vision-Language Model for UI and Infographics Understanding.arXiv preprint arXiv:2402.04615, 2024

Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C˘arbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. ScreenAI: A Vision-Language Model for UI and Infographics Understanding.arXiv preprint arXiv:2402.04615, 2024

arXiv 2024

[41] [41]

GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents.arXiv preprint arXiv:2504.10458, 2025

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents.arXiv preprint arXiv:2504.10458, 2025

Pith/arXiv arXiv 2025

[42] [42]

Continual Learning for Generative AI: From LLMs to MLLMs and Beyond, 2025

Haiyang Guo, Fanhu Zeng, Fei Zhu, Jiayi Wang, Xukai Wang, Jingang Zhou, Hongbo Zhao, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang, and Cheng-Lin Liu. Continual Learning for Generative AI: From LLMs to MLLMs and Beyond, 2025

2025

[43] [43]

LLaV A-c: Continual Improved Visual Instruction Tuning.arXiv preprint arXiv:2506.08666, 2025

Wenzhuo Liu, Fei Zhu, Haiyang Guo, Longhui Wei, and Cheng-Lin Liu. LLaV A-c: Continual Improved Visual Instruction Tuning.arXiv preprint arXiv:2506.08666, 2025

arXiv 2025

[44] [44]

RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. RL’s Razor: Why Online Reinforcement Learning Forgets Less.arXiv preprint arXiv:2509.04259, 2025

Pith/arXiv arXiv 2025

[45] [45]

RL Fine-Tuning Heals OOD Forgetting in SFT.arXiv preprint arXiv:2509.12235, 2025

Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, and Mohammad Hamdaqa. RL Fine-Tuning Heals OOD Forgetting in SFT.arXiv preprint arXiv:2509.12235, 2025. 12

Pith/arXiv arXiv 2025

[46] [46]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[47] [47]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620, 2025

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620, 2025

Pith/arXiv arXiv 2025

[48] [48]

GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents.arXiv preprint arXiv:2505.15810, 2025

Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents.arXiv preprint arXiv:2505.15810, 2025. 13 A Appendix A.1 Reward Function and Its Interaction with the Method. In our implementation, GUI-AC uses the same task format as Continual GUI Agents [7]. This ensur...

arXiv 2025