InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Pith reviewed 2026-05-18 13:48 UTC · model grok-4.3
The pith
InfiGUI-R1 trains GUI agents to reason explicitly about layouts and sub-goals before acting through a two-stage Actor2Reasoner process.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InfiGUI-R1 develops an MLLM-based GUI agent via the Actor2Reasoner framework. Reasoning Injection distills cross-modal spatial reasoning from teacher models into the MLLM using trajectories that include explicit reasoning steps, so the model connects visual-spatial GUI information with logical steps before generating actions. Deliberation Enhancement then applies reinforcement learning with Sub-goal Guidance to reward accurate intermediate sub-goals and Error Recovery Scenario Construction to generate training cases from prone-to-error steps, evolving the agent from a Reactive Actor into a Deliberative Reasoner that shows strong results on GUI grounding and trajectory tasks.
What carries the argument
The Actor2Reasoner two-stage training framework, which first distills explicit spatial reasoning through trajectories and then strengthens deliberation with reinforcement learning that rewards sub-goals and constructs error-recovery scenarios.
If this is right
- Agents produce explicit intermediate reasoning before each action instead of implicit reactions.
- Performance improves on both precise element grounding and complete multi-step task trajectories.
- Training incorporates deliberate sub-goal setting and recovery from likely failure points.
- Cross-modal spatial information becomes explicitly linked to logical decision steps in the model.
Where Pith is reading between the lines
- The same distillation-plus-RL pattern might transfer to agents that operate in other visual environments such as web pages or mobile apps.
- Explicit reasoning traces could make it easier to inspect and correct agent mistakes after deployment.
- Synthetic error scenarios may help agents cope with the long tail of unusual screen states that appear in real use.
Load-bearing premise
That distilling spatial reasoning from teacher trajectories and then applying sub-goal rewards plus error-recovery reinforcement will produce reasoning robust and adaptive enough for complex GUI environments.
What would settle it
A head-to-head test in which InfiGUI-R1 shows no measurable gain over reactive baselines on long-horizon tasks that require planning several steps ahead or recovering from an early mistake.
read the original abstract
Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces InfiGUI-R1, an MLLM-based GUI agent developed via the Actor2Reasoner two-stage training framework. The first stage (Reasoning Injection) uses Spatial Reasoning Distillation to transfer cross-modal spatial reasoning from teacher models through trajectories containing explicit reasoning steps. The second stage (Deliberation Enhancement) applies reinforcement learning with sub-goal guidance rewards and constructed error-recovery scenarios from prone-to-error steps. The central claim is that this process transforms reactive GUI agents into deliberative reasoners that achieve strong performance on GUI grounding and trajectory tasks.
Significance. If the empirical claims hold with detailed validation, the Actor2Reasoner framework offers a structured alternative to manual reasoning templates or purely reactive policies, potentially improving robustness in planning and error recovery for dynamic GUI environments. The combination of distillation for spatial reasoning and targeted RL for deliberation is a coherent incremental advance in multimodal agent training.
major comments (3)
- [Abstract / Experimental Results] Abstract and Experimental Results section: the claim that InfiGUI-R1 'achieves strong performance in GUI grounding and trajectory tasks' is presented without any quantitative metrics, baselines, ablation studies, or error analysis in the visible text. This absence is load-bearing for the central claim that the two-stage process produces genuinely deliberative rather than merely better-supervised behavior.
- [Deliberation Enhancement] Deliberation Enhancement stage: the Error Recovery Scenario Construction relies on identifying 'prone-to-error steps' and building failure-and-recovery trajectories. If these steps are derived from the same model family or limited trajectory data, the resulting policy may overfit to the constructed distribution rather than learn general error detection and replanning; no out-of-distribution GUI change tests or ablation against standard RL are described to rule out this risk.
- [Reasoning Injection] Reasoning Injection stage: Spatial Reasoning Distillation assumes teacher trajectories are both correct and transferable without visual encoder mismatch. The manuscript provides no analysis of domain gaps between teacher and student visual encoders or verification that the distilled reasoning steps remain valid under GUI variations, which directly affects whether the first stage establishes a reliable basic reasoner.
minor comments (2)
- [Deliberation Enhancement] Define 'prone-to-error steps' and the procedure for identifying them more precisely, including any heuristics or model-based detection method used.
- [Experimental Results] Add a short comparison table or paragraph contrasting the proposed sub-goal and error-recovery rewards against standard outcome-only RL baselines to clarify the incremental contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with clarifications from the full paper and indicate planned revisions to strengthen the empirical presentation and analysis.
read point-by-point responses
-
Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: the claim that InfiGUI-R1 'achieves strong performance in GUI grounding and trajectory tasks' is presented without any quantitative metrics, baselines, ablation studies, or error analysis in the visible text. This absence is load-bearing for the central claim that the two-stage process produces genuinely deliberative rather than merely better-supervised behavior.
Authors: We agree that the abstract would be strengthened by including specific quantitative metrics. The full Experimental Results section contains detailed tables with performance numbers on GUI grounding and trajectory benchmarks, direct comparisons to baselines, ablation studies isolating the contributions of Reasoning Injection and Deliberation Enhancement, and error analysis breaking down failure modes. To address the concern, we will revise the abstract to report key metrics (e.g., success rates and improvements over baselines) and explicitly reference the ablations and analyses in the main text. revision: yes
-
Referee: [Deliberation Enhancement] Deliberation Enhancement stage: the Error Recovery Scenario Construction relies on identifying 'prone-to-error steps' and building failure-and-recovery trajectories. If these steps are derived from the same model family or limited trajectory data, the resulting policy may overfit to the constructed distribution rather than learn general error detection and replanning; no out-of-distribution GUI change tests or ablation against standard RL are described to rule out this risk.
Authors: We appreciate this valid concern about potential overfitting. The prone-to-error steps are identified via systematic analysis across diverse trajectory datasets from multiple sources and GUI environments, not restricted to a single model family. The paper includes ablations comparing the full Deliberation Enhancement (sub-goal guidance plus error recovery) against standard RL without these elements, showing gains in robustness and recovery. However, explicit out-of-distribution GUI change tests are not present. We will add a discussion of the construction method's generality and include additional analysis or experiments addressing generalization to unseen GUI variations in the revised manuscript. revision: partial
-
Referee: [Reasoning Injection] Reasoning Injection stage: Spatial Reasoning Distillation assumes teacher trajectories are both correct and transferable without visual encoder mismatch. The manuscript provides no analysis of domain gaps between teacher and student visual encoders or verification that the distilled reasoning steps remain valid under GUI variations, which directly affects whether the first stage establishes a reliable basic reasoner.
Authors: We acknowledge that the current manuscript lacks a dedicated analysis of domain gaps and variation robustness in the Reasoning Injection stage. Teacher models were chosen for strong GUI task performance and we performed manual verification on a subset of distilled trajectories for correctness. To strengthen this, we will add quantitative analysis of visual encoder feature similarities, domain gap measurements, and validation of reasoning step validity under GUI variations in the revised paper. revision: yes
Circularity Check
No circularity: empirical training pipeline with independent experimental validation
full rationale
The paper describes a two-stage empirical training procedure (Spatial Reasoning Distillation followed by RL with sub-goal rewards and constructed error-recovery scenarios) whose performance claims rest on experimental results rather than any mathematical derivation or equation. No load-bearing step reduces a reported metric to a fitted parameter, self-referential definition, or self-citation chain by construction. The framework uses standard distillation and RL techniques whose outputs are evaluated on held-out GUI grounding and trajectory benchmarks, rendering the derivation chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs can integrate GUI visual-spatial information with logical reasoning before action generation
Forward citations
Cited by 21 Pith papers
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
-
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation
An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.
-
How Mobile World Model Guides GUI Agents?
Mobile world models in text, image, and code modalities reach state-of-the-art on their benchmarks and improve downstream GUI agent performance, with code best for in-distribution accuracy and text more robust for out...
-
LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.
-
BAMI: Training-Free Bias Mitigation in GUI Grounding
BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
-
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL
ROSE delivers 1.2-3.3x higher end-to-end throughput for agentic RL by safely co-using underutilized serving GPUs for rollouts while meeting serving SLOs.
-
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
-
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
-
Are GUI Agents Focused Enough? Automated Distraction via Semantic-level UI Element Injection
Semantic-level UI Element Injection distracts GUI agents by overlaying safety-aligned UI elements, achieving up to 4.4x higher attack success rates that transfer across models and create persistent attractors.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
RISK: A Framework for GUI Agents in E-commerce Risk Management
RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
-
VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
VeriOS-Agent is an OS agent that proactively queries humans in untrustworthy scenarios via a query-driven framework and three-stage training, achieving 19.72% higher step-wise success rate over baselines while preserv...
-
GTA1: GUI Test-time Scaling Agent
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
-
See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback
Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...
-
A Brief Overview: Agentic Reinforcement Learning In Large Language Models
This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.
Reference graph
Works this paper leans on
-
[1]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang
-
[2]
arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents. arXiv:2504.00906 [cs.AI] https://arxiv.org/abs/2504.00906
-
[3]
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, R...
work page 2022
-
[5]
Anthropic. 2024. Developing a computer use model. https://www.anthropic.com/ news/developing-computer-use. Accessed: 2025-04-12
work page 2024
-
[6]
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. 2023. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenhang Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, K. Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.16609 2023
-
[8]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Frontier Large Vision- Language Model with Versatile Abilities. ArXiv (2023). https://doi.org/10.48550/ arXiv.2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al . 2025. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al
-
[11]
arXiv preprint arXiv:2409.08264 (2024)
Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264 (2024)
-
[12]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Google DeepMind. 2024. Gemini-2.0 (Project Mariner). https://deepmind.google/ technologies/project-mariner. Accessed: 2025-04-12
work page 2024
-
[14]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR ...
work page 2021
-
[15]
Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30 (2020), 681–694
work page 2020
-
[16]
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2024. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290
work page 2024
-
[18]
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shawn Wang, Xinchen Xu, Shuofei Qiao, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, and Fei Wu. 2024....
-
[19]
Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evalu- ating Agents on Data Analysis Tasks. arXiv preprint arXiv:2401.05507 (2024)
-
[20]
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [22]
-
[23]
Marko Jurmu, Sebastian Boring, and Jukka Riekki. 2008. ScreenSpot: Multi- dimensional resource discovery for distributed applications in smart spaces. In Proceedings of the 5th Annual International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services. 1–9
work page 2008
-
[24]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. 2024. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742
work page 2023
-
[26]
Kaixin Li, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, Tat-Seng Chua, et al. 2025. Screenspot-pro: Gui grounding for professional high- resolution computer use. In Workshop on Reasoning and Planning for Large Language Models
work page 2025
- [27]
-
[28]
Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. 2024. On the effects of data scale on computer control agents. arXiv e-prints (2024), arXiv–2406
work page 2024
-
[29]
Yang Li, Luheng Li, Gangaand He, Jingjie Zheng, Hong Li, and Zhiwei Guan
-
[30]
arXiv preprint arXiv:2010.04295 (2020)
Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements. arXiv preprint arXiv:2010.04295 (2020)
-
[31]
Zijing Liang, Yanjie Xu, Yifan Hong, Penghui Shang, Qi Wang, Qiang Fu, and Ke Liu. 2024. A Survey of Multimodel Large Language Models. InProceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering. 405–409
work page 2024
-
[32]
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. 2024. Showui: One vision-language- action model for generalist gui agent. InNeurIPS 2024 Workshopon Open-World Agents
work page 2024
-
[33]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 740– 755
work page 2014
-
[34]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Vi- sual Instruction Tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Al- ice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey ...
work page 2023
-
[35]
Haogeng Liu, Quanzeng You, Yiqi Wang, Xiaotian Han, Bohan Zhai, Yongfei Liu, Wentao Chen, Yiren Jian, Yunzhe Tao, Jianbo Yuan, Ran He, and Hongxia Yang
-
[36]
InAnnual Meeting of the Association for Computational Linguistics
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model. InAnnual Meeting of the Association for Computational Linguistics
- [37]
-
[38]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986
work page 2022
-
[39]
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning. arXiv preprint arXiv:2503.21620 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
OpenAI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/ GPTV_System_Card.pdf
work page 2023
-
[41]
OpenAI. 2024. GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2025-01-03
work page 2024
- [42]
-
[43]
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents.arXiv preprint arXiv:2501.12326 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763
work page 2021
-
[46]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al . 2024. Androidworld: A dynamic benchmarking environment for au- tonomous agents. arXiv preprint arXiv:2405.14573 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2024. HybridFlow: A Flexible and Efficient RLHF Framework. arXiv preprint arXiv: 2409.19256 (2024). Preprint, Under review, April 2025 Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [48]
-
[49]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, et al. 2025. Kimi-VL Technical Report. arXiv preprint arXiv:2504.07491 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Qwen Team. 2025. QwQ-32B: Embracing the Power of Reinforcement Learning. https://qwenlm.github.io/blog/qwq-32b/
work page 2025
-
[51]
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv (2023). https://doi.org/10.48550/arXiv.2302.13971
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
-
[52]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [54]
-
[55]
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al . 2024. Os- atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Xiaobo Xia and Run Luo. 2025. GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents. arXiv preprint arXiv:2504.10458 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, Cunchao Tu, and Maosong Sun. 2021. Lawformer: A pre-trained language model for chinese legal long documents. AI Open 2 (2021), 79–84
work page 2021
-
[58]
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao
-
[59]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [60]
-
[61]
Shenzhi Wang Zhangchi Feng Dongdong Kuang Yuwen Xiong Yaowei Zheng, Junting Lu. 2025. EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework. https://github.com/hiyouga/EasyR1
work page 2025
-
[62]
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan. 2025. Ferret-ui: Grounded mobile ui understanding with multimodal llms. In European Conference on Computer Vision. Springer, 240–255
work page 2025
-
[63]
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. Appagent: Multimodal agents as smartphone users.arXiv preprint arXiv:2312.13771 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.