arxiv: 2504.14239 · v1 · pith:YOB6WEPEnew · submitted 2025-04-19 · 💻 cs.AI · cs.CL

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Yuhang Liu , Pengxiang Li , Congkai Xie , Xavier Hu , Xiaotian Han , Shengyu Zhang , Hongxia Yang , Fei Wu This is my paper

Pith reviewed 2026-05-18 13:48 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords GUI agentsmultimodal large language modelsspatial reasoning distillationreinforcement learningsub-goal guidanceerror recoverydeliberative reasoningActor2Reasoner

0 comments

The pith

InfiGUI-R1 trains GUI agents to reason explicitly about layouts and sub-goals before acting through a two-stage Actor2Reasoner process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to advance GUI agents beyond immediate reactions by first injecting spatial reasoning skills from teacher models and then refining them into deliberate planning via reinforcement learning. This matters because reactive agents often struggle with tasks that need forward planning or recovery from slips on real interfaces. The approach uses trajectories that spell out reasoning steps to link screen visuals with logic, followed by rewards for correct sub-goals and specially built failure-recovery examples. If successful, agents become better at grounding elements accurately and completing full task sequences.

Core claim

InfiGUI-R1 develops an MLLM-based GUI agent via the Actor2Reasoner framework. Reasoning Injection distills cross-modal spatial reasoning from teacher models into the MLLM using trajectories that include explicit reasoning steps, so the model connects visual-spatial GUI information with logical steps before generating actions. Deliberation Enhancement then applies reinforcement learning with Sub-goal Guidance to reward accurate intermediate sub-goals and Error Recovery Scenario Construction to generate training cases from prone-to-error steps, evolving the agent from a Reactive Actor into a Deliberative Reasoner that shows strong results on GUI grounding and trajectory tasks.

What carries the argument

The Actor2Reasoner two-stage training framework, which first distills explicit spatial reasoning through trajectories and then strengthens deliberation with reinforcement learning that rewards sub-goals and constructs error-recovery scenarios.

If this is right

Agents produce explicit intermediate reasoning before each action instead of implicit reactions.
Performance improves on both precise element grounding and complete multi-step task trajectories.
Training incorporates deliberate sub-goal setting and recovery from likely failure points.
Cross-modal spatial information becomes explicitly linked to logical decision steps in the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation-plus-RL pattern might transfer to agents that operate in other visual environments such as web pages or mobile apps.
Explicit reasoning traces could make it easier to inspect and correct agent mistakes after deployment.
Synthetic error scenarios may help agents cope with the long tail of unusual screen states that appear in real use.

Load-bearing premise

That distilling spatial reasoning from teacher trajectories and then applying sub-goal rewards plus error-recovery reinforcement will produce reasoning robust and adaptive enough for complex GUI environments.

What would settle it

A head-to-head test in which InfiGUI-R1 shows no measurable gain over reactive baselines on long-horizon tasks that require planning several steps ahead or recovering from an early mistake.

read the original abstract

Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Actor2Reasoner two-stage setup gives GUI agents a plausible path from reactive behavior to sub-goal planning and error recovery, but the abstract's performance claims rest on no visible numbers or comparisons.

read the letter

The core contribution is a concrete training pipeline that first distills explicit spatial reasoning trajectories from teacher models into an MLLM, then applies RL with sub-goal rewards and constructed failure-recovery episodes. This directly tackles the problem that most current GUI agents stay implicit and brittle on complex interfaces. The spatial distillation step is a reasonable way to make visual grounding feed into logical steps before action, and the error-recovery construction targets a practical pain point in real deployments where screens change or clicks miss. Those pieces feel like genuine engineering progress over purely template-based reasoning or plain imitation learning. The paper is honest about starting from reactive actors and trying to evolve them, which aligns with the literature on agent robustness. What is missing is any quantitative anchor. The abstract states strong results on grounding and trajectory tasks but supplies no scores, no baselines, no ablation on the two stages, and no error analysis. Without those, it is impossible to separate gains from denser supervision versus actual gains in adaptive deliberation. The stress-test worry about overfitting to constructed prone-to-error steps is live: if the failure scenarios come from the same model family or limited trajectory pool, the policy could simply memorize recovery patterns rather than learn general replanning. That risk is worth checking against the full experiments. This work is for researchers building multimodal agents for desktop or mobile automation. A reader already working on GUI grounding or RL for agents will find the framework description useful as a starting point, even if they end up modifying the RL stage. It deserves a serious referee because the problem is well-posed and the method is reproducible in principle; the current draft just needs the missing metrics and controls to become evaluable.

Referee Report

3 major / 2 minor

Summary. The paper introduces InfiGUI-R1, an MLLM-based GUI agent developed via the Actor2Reasoner two-stage training framework. The first stage (Reasoning Injection) uses Spatial Reasoning Distillation to transfer cross-modal spatial reasoning from teacher models through trajectories containing explicit reasoning steps. The second stage (Deliberation Enhancement) applies reinforcement learning with sub-goal guidance rewards and constructed error-recovery scenarios from prone-to-error steps. The central claim is that this process transforms reactive GUI agents into deliberative reasoners that achieve strong performance on GUI grounding and trajectory tasks.

Significance. If the empirical claims hold with detailed validation, the Actor2Reasoner framework offers a structured alternative to manual reasoning templates or purely reactive policies, potentially improving robustness in planning and error recovery for dynamic GUI environments. The combination of distillation for spatial reasoning and targeted RL for deliberation is a coherent incremental advance in multimodal agent training.

major comments (3)

[Abstract / Experimental Results] Abstract and Experimental Results section: the claim that InfiGUI-R1 'achieves strong performance in GUI grounding and trajectory tasks' is presented without any quantitative metrics, baselines, ablation studies, or error analysis in the visible text. This absence is load-bearing for the central claim that the two-stage process produces genuinely deliberative rather than merely better-supervised behavior.
[Deliberation Enhancement] Deliberation Enhancement stage: the Error Recovery Scenario Construction relies on identifying 'prone-to-error steps' and building failure-and-recovery trajectories. If these steps are derived from the same model family or limited trajectory data, the resulting policy may overfit to the constructed distribution rather than learn general error detection and replanning; no out-of-distribution GUI change tests or ablation against standard RL are described to rule out this risk.
[Reasoning Injection] Reasoning Injection stage: Spatial Reasoning Distillation assumes teacher trajectories are both correct and transferable without visual encoder mismatch. The manuscript provides no analysis of domain gaps between teacher and student visual encoders or verification that the distilled reasoning steps remain valid under GUI variations, which directly affects whether the first stage establishes a reliable basic reasoner.

minor comments (2)

[Deliberation Enhancement] Define 'prone-to-error steps' and the procedure for identifying them more precisely, including any heuristics or model-based detection method used.
[Experimental Results] Add a short comparison table or paragraph contrasting the proposed sub-goal and error-recovery rewards against standard outcome-only RL baselines to clarify the incremental contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with clarifications from the full paper and indicate planned revisions to strengthen the empirical presentation and analysis.

read point-by-point responses

Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the claim that InfiGUI-R1 'achieves strong performance in GUI grounding and trajectory tasks' is presented without any quantitative metrics, baselines, ablation studies, or error analysis in the visible text. This absence is load-bearing for the central claim that the two-stage process produces genuinely deliberative rather than merely better-supervised behavior.

Authors: We agree that the abstract would be strengthened by including specific quantitative metrics. The full Experimental Results section contains detailed tables with performance numbers on GUI grounding and trajectory benchmarks, direct comparisons to baselines, ablation studies isolating the contributions of Reasoning Injection and Deliberation Enhancement, and error analysis breaking down failure modes. To address the concern, we will revise the abstract to report key metrics (e.g., success rates and improvements over baselines) and explicitly reference the ablations and analyses in the main text. revision: yes
Referee: [Deliberation Enhancement] Deliberation Enhancement stage: the Error Recovery Scenario Construction relies on identifying 'prone-to-error steps' and building failure-and-recovery trajectories. If these steps are derived from the same model family or limited trajectory data, the resulting policy may overfit to the constructed distribution rather than learn general error detection and replanning; no out-of-distribution GUI change tests or ablation against standard RL are described to rule out this risk.

Authors: We appreciate this valid concern about potential overfitting. The prone-to-error steps are identified via systematic analysis across diverse trajectory datasets from multiple sources and GUI environments, not restricted to a single model family. The paper includes ablations comparing the full Deliberation Enhancement (sub-goal guidance plus error recovery) against standard RL without these elements, showing gains in robustness and recovery. However, explicit out-of-distribution GUI change tests are not present. We will add a discussion of the construction method's generality and include additional analysis or experiments addressing generalization to unseen GUI variations in the revised manuscript. revision: partial
Referee: [Reasoning Injection] Reasoning Injection stage: Spatial Reasoning Distillation assumes teacher trajectories are both correct and transferable without visual encoder mismatch. The manuscript provides no analysis of domain gaps between teacher and student visual encoders or verification that the distilled reasoning steps remain valid under GUI variations, which directly affects whether the first stage establishes a reliable basic reasoner.

Authors: We acknowledge that the current manuscript lacks a dedicated analysis of domain gaps and variation robustness in the Reasoning Injection stage. Teacher models were chosen for strong GUI task performance and we performed manual verification on a subset of distilled trajectories for correctness. To strengthen this, we will add quantitative analysis of visual encoder feature similarities, domain gap measurements, and validation of reasoning step validity under GUI variations in the revised paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline with independent experimental validation

full rationale

The paper describes a two-stage empirical training procedure (Spatial Reasoning Distillation followed by RL with sub-goal rewards and constructed error-recovery scenarios) whose performance claims rest on experimental results rather than any mathematical derivation or equation. No load-bearing step reduces a reported metric to a fitted parameter, self-referential definition, or self-citation chain by construction. The framework uses standard distillation and RL techniques whose outputs are evaluated on held-out GUI grounding and trajectory benchmarks, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that MLLMs can integrate visual-spatial GUI information with logical reasoning steps and that RL rewards on sub-goals and error scenarios will generalize.

axioms (1)

domain assumption MLLMs can integrate GUI visual-spatial information with logical reasoning before action generation
Invoked in the Reasoning Injection stage description.

pith-pipeline@v0.9.0 · 5858 in / 1207 out tokens · 43438 ms · 2026-05-18T13:48:59.898272+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
cs.CL 2026-04 unverdicted novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation
cs.SE 2025-09 unverdicted novelty 7.0

An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.
How Mobile World Model Guides GUI Agents?
cs.AI 2026-05 unverdicted novelty 6.0

Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
BAMI: Training-Free Bias Mitigation in GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
cs.DC 2026-05 unverdicted novelty 6.0

ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
cs.CR 2026-04 unverdicted novelty 6.0

Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
cs.AI 2025-10 unverdicted novelty 6.0

MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
RISK: A Framework for GUI Agents in E-commerce Risk Management
cs.AI 2025-09 unverdicted novelty 6.0

RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
cs.CL 2025-09 unverdicted novelty 6.0

VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
GTA1: GUI Test-time Scaling Agent
cs.AI 2025-07 unverdicted novelty 6.0

GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
cs.CV 2026-04 unverdicted novelty 5.0

Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
cs.AI 2026-04 unverdicted novelty 2.0

This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 19 Pith papers · 23 internal anchors

[1]

Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang

work page
[2]

arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906

work page arXiv
[3]

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...

work page 2022
[5]

Anthropic. 2024. Developing a computer use model. https://www.anthropic.com/ news/developing-computer-use. Accessed: 2025-04-12

work page 2024
[6]

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, K. Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.16609 2023
[8]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Frontier Large Vision- Language Model with Versatile Abilities. ArXiv (2023). https://doi.org/10.48550/ arXiv.2308.12966

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al

work page
[11]

arXiv preprint arXiv:2409.08264 (2024)

Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264 (2024)

work page arXiv 2024
[12]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Google DeepMind. 2024. Gemini-2.0 (Project Mariner). https://deepmind.google/ technologies/project-mariner. Accessed: 2025-04-12

work page 2024
[14]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR ...

work page 2021
[15]

Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30 (2020), 681–694

work page 2020
[16]

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2024. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290

work page 2024
[18]

Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, and Fei Wu. 2024....

work page doi:10.20944/preprints202412.2294.v1 2024
[19]

Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evalu- ating Agents on Data Analysis Tasks. arXiv preprint arXiv:2401.05507 (2024)

work page arXiv 2024
[20]

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Yue Jiang, Eldon Schoop, Amanda Swearngin, and Jeffrey Nichols. 2023. ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations. arXiv preprint arXiv:2310.04869 (2023)

work page arXiv 2023
[23]

Marko Jurmu, Sebastian Boring, and Jukka Riekki. 2008. ScreenSpot: Multi- dimensional resource discovery for distributed applications in smart spaces. In Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services. 1–9

work page 2008
[24]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742

work page 2023
[26]

Kaixin Li, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua, et al. 2025. Screenspot-pro: Gui grounding for professional high- resolution computer use. In Workshop on Reasoning and Planning for Large Language Models

work page 2025
[27]

Linyi Li, Shijie Geng, Zhenwen Li, Yibo He, Hao Yu, Ziyue Hua, Guanghan Ning, Siwei Wang, Tao Xie, and Hongxia Yang. 2024. InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models. arXiv preprint arXiv:2404.07940 (2024)

work page arXiv 2024
[28]

Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. 2024. On the effects of data scale on computer control agents. arXiv e-prints (2024), arXiv–2406

work page 2024
[29]

Yang Li, Luheng Li, Gangaand He, Jingjie Zheng, Hong Li, and Zhiwei Guan

work page
[30]

arXiv preprint arXiv:2010.04295 (2020)

Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. arXiv preprint arXiv:2010.04295 (2020)

work page arXiv 2010
[31]

Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. 2024. A Survey of Multimodel Large Language Models. InProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering. 405–409

work page 2024
[32]

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. 2024. Showui: One vision-language- action model for generalist gui agent. InNeurIPS 2024 Workshopon Open-World Agents

work page 2024
[33]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 740– 755

work page 2014
[34]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Vi- sual Instruction Tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Al- ice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey ...

work page 2023
[35]

Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian, Yunzhe Tao, Jianbo Yuan, Ran He, and Hongxia Yang

work page
[36]

InAnnual Meeting of the Association for Computational Linguistics

InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model. InAnnual Meeting of the Association for Computational Linguistics

work page
[37]

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. 2025. InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection.arXiv preprint arXiv:2501.04575 (2025)

work page arXiv 2025
[38]

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

work page 2022
[39]

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning. arXiv preprint arXiv:2503.21620 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

OpenAI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/ GPTV_System_Card.pdf

work page 2023
[41]

OpenAI. 2024. GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2025-01-03

work page 2024
[42]

Zhenyu Pan, Haozheng Luo, Manling Li, and Han Liu. 2024. Chain-of-action: Faithful and multimodal question answering through large language models. arXiv preprint arXiv:2403.17359 (2024)

work page arXiv 2024
[43]

Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents.arXiv preprint arXiv:2501.12326 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763

work page 2021
[46]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al . 2024. Androidworld: A dynamic benchmarking environment for au- tonomous agents. arXiv preprint arXiv:2405.14573 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow: A Flexible and Efficient RLHF Framework. arXiv preprint arXiv: 2409.19256 (2024). Preprint, Under review, April 2025 Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Yuchen Sun, Shanhui Zhao, Tao Yu, Hao Wen, Samith Va, Mengwei Xu, Yuanchun Li, and Chongyang Zhang. 2025. GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration. arXiv preprint arXiv:2503.17709 (2025)

work page arXiv 2025
[49]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. 2025. Kimi-VL Technical Report. arXiv preprint arXiv:2504.07491 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/

work page 2025
[51]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv (2023). https://doi.org/10.48550/arXiv.2302.13971

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
[52]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023. AutoDroid: LLM- powered Task Automation in Android. arXiv preprint arXiv:2308.15272 (2023)

work page arXiv 2023
[55]

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al . 2024. Os- atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Xiaobo Xia and Run Luo. 2025. GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv preprint arXiv:2504.10458 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for chinese legal long documents. AI Open 2 (2021), 79–84

work page 2021
[58]

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao

work page
[59]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. 2024. Aria-UI: Visual Grounding for GUI Instructions. arXiv preprint arXiv:2412.16256 (2024)

work page arXiv 2024
[61]

Shenzhi Wang Zhangchi Feng Dongdong Kuang Yuwen Xiong Yaowei Zheng, Junting Lu. 2025. EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework. https://github.com/hiyouga/EasyR1

work page 2025
[62]

Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2025. Ferret-ui: Grounded mobile ui understanding with multimodal llms. In European Conference on Computer Vision. Springer, 240–255

work page 2025
[63]

Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. Appagent: Multimodal agents as smartphone users.arXiv preprint arXiv:2312.13771 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023