arxiv: 2412.04454 · v2 · pith:SS3JRCD3new · submitted 2024-12-05 · 💻 cs.CL

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu , Zekun Wang , Junli Wang , Dunjie Lu , Tianbao Xie , Amrita Saha , Doyen Sahoo , Tao Yu

show 1 more author

Caiming Xiong

This is my paper

Pith reviewed 2026-05-18 04:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords GUI agentsvision-based agentsautonomous GUI interactionmultimodal groundinginner monologue reasoningtwo-stage trainingscreen image agents

0 comments

The pith

A pure vision framework lets AI agents control any GUI by processing screen images directly and reasoning internally without text or closed-source models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Aguvis as a unified vision-based system for autonomous GUI agents. It works by feeding raw screen images into the model, standardizing actions across platforms, and using an inner monologue to break down tasks into reasoned steps. To make this work, the authors built the Aguvis Data Collection dataset containing screen images annotated for both element grounding and reasoning chains, then trained models in two stages that first master visual grounding before adding planning. A sympathetic reader would care because current GUI automation usually depends on brittle text representations and platform-specific rules that break across apps and operating systems. If the approach holds, it opens the door to agents that interact with computers the way humans do, by looking at the screen and thinking visually.

Core claim

Aguvis is a unified pure vision framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions, and incorporates structured reasoning via inner monologue. It is enabled by the Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show state-of-the-art performance across offline and real-world online benchmarks, establishing the first fully autonomous vision-based GUI agent that operates without closed-source models.

What carries the argument

The two-stage training pipeline that first teaches visual GUI grounding on annotated screen images before adding planning and reasoning, supported by the Aguvis Data Collection dataset of multimodal annotations.

If this is right

GUI agents no longer require textual screen descriptions or platform-specific action definitions to function.
The same trained model can be applied to different operating systems and applications through standardized visual interaction.
Reasoning about tasks occurs internally via inner monologue rather than relying on external parsers or closed models.
Fully open models can now match or exceed performance previously tied to proprietary systems on GUI benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The grounding-then-reasoning separation may transfer to training agents for other visual control domains such as robotics or web navigation.
If visual grounding proves robust, future work could test whether adding real-time feedback loops improves handling of changing interface states.
Scaling the dataset size or diversity could be directly measured by tracking how cross-platform success rates change with additional annotated examples.

Load-bearing premise

The new dataset of screen images with grounding and reasoning annotations is representative enough of real GUI variations to support cross-platform generalization after the two-stage training.

What would settle it

Running the trained Aguvis model on a new operating system or application family absent from the training dataset and checking whether success rates on representative tasks fall sharply below the reported benchmark levels.

read the original abstract

Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Aguvis, a unified pure-vision framework for autonomous GUI agents that processes raw screen images directly, standardizes cross-platform action spaces, and incorporates structured reasoning via inner monologue. It constructs the Aguvis Data Collection dataset containing multimodal grounding and reasoning annotations and trains models with a two-stage pipeline that first learns grounding then planning/reasoning. Experiments report state-of-the-art results on offline benchmarks and real-world online tasks, positioning Aguvis as the first fully autonomous vision-only GUI agent that does not rely on closed-source models; all datasets, models, and recipes are released.

Significance. If the reported gains are robust, the work would advance the field by showing that carefully annotated vision-only data can support cross-platform GUI autonomy without textual interfaces or proprietary LLMs. The open release of the full pipeline and data is a concrete strength that supports reproducibility.

major comments (2)

[§3] §3 (Aguvis Data Collection): The manuscript provides no quantitative metrics on platform balance (e.g., fraction of Android vs. web vs. desktop trajectories), GUI-element diversity, or inter-annotator agreement for the multimodal grounding and reasoning labels. Because the central SOTA claim rests on the two-stage pipeline generalizing from this dataset, the absence of these statistics leaves the representativeness assumption unverified and load-bearing for the cross-platform results.
[§5] §5 (Experiments): The offline and online benchmark tables report point estimates without error bars, multiple random seeds, or statistical significance tests against the strongest baselines. Given that the soundness assessment notes missing experimental details, these omissions prevent assessment of whether the claimed improvements are stable or could be explained by dataset-specific fitting.

minor comments (2)

[§4.2] The inner-monologue prompting template in §4.2 could be moved to an appendix or figure for easier reference during replication.
[Figure 3] Figure 3 (qualitative examples) would benefit from explicit annotation of which stage (grounding vs. reasoning) produced each output token.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate the suggested improvements in the revised version to strengthen the presentation of the dataset and experimental results.

read point-by-point responses

Referee: [§3] §3 (Aguvis Data Collection): The manuscript provides no quantitative metrics on platform balance (e.g., fraction of Android vs. web vs. desktop trajectories), GUI-element diversity, or inter-annotator agreement for the multimodal grounding and reasoning labels. Because the central SOTA claim rests on the two-stage pipeline generalizing from this dataset, the absence of these statistics leaves the representativeness assumption unverified and load-bearing for the cross-platform results.

Authors: We acknowledge that the current version of the manuscript does not include these quantitative metrics. In the revision, we will add a new table and accompanying text in §3 reporting: (1) the exact fractions of trajectories across Android, web, and desktop platforms; (2) statistics on GUI-element diversity, including counts and distributions of element types; and (3) inter-annotator agreement scores (e.g., Cohen’s kappa) for both grounding and reasoning annotations. These additions will directly support the representativeness of the dataset and the generalization claims of the two-stage pipeline. revision: yes
Referee: [§5] §5 (Experiments): The offline and online benchmark tables report point estimates without error bars, multiple random seeds, or statistical significance tests against the strongest baselines. Given that the soundness assessment notes missing experimental details, these omissions prevent assessment of whether the claimed improvements are stable or could be explained by dataset-specific fitting.

Authors: We agree that reporting only point estimates limits the ability to assess result stability. In the revised §5, we will rerun the key experiments with multiple random seeds (reporting means and standard deviations as error bars in the tables) and include statistical significance tests (e.g., paired t-tests) against the strongest baselines. This will provide evidence that the reported gains are robust rather than attributable to variance or dataset-specific effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with benchmark results

full rationale

The paper describes an empirical system: construction of the Aguvis Data Collection dataset with multimodal annotations, followed by a two-stage training pipeline (grounding then planning/reasoning) and evaluation on offline and real-world online benchmarks. No equations, first-principles derivations, or fitted parameters are presented whose outputs reduce to the inputs by construction. Claims of SOTA performance rest on reported experimental results rather than any self-referential prediction or self-citation chain that would force the outcome. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of a large-scale vision-language model trained on a custom dataset; no explicit mathematical axioms or invented physical entities are introduced. Hyperparameters of the training pipeline and model architecture choices function as free parameters.

free parameters (1)

two-stage training hyperparameters
Learning rates, batch sizes, and stage-transition criteria are chosen to make the grounding-then-reasoning pipeline work.

pith-pipeline@v0.9.0 · 5682 in / 1143 out tokens · 29942 ms · 2026-05-18T04:03:48.792443+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
cs.AI 2025-12 accept novelty 8.0

MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
cs.CR 2025-10 unverdicted novelty 7.0

SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
BAMI: Training-Free Bias Mitigation in GUI Grounding
cs.CV 2026-05 unverdicted novelty 6.0

BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 6.0

SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
cs.CV 2026-04 unverdicted novelty 6.0

UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

UI-in-the-Loop makes multimodal models explicitly learn UI element locations, meanings, and uses in a cyclic screen-element-action loop, delivering better UI comprehension and GUI reasoning on a new 26K-sample benchmark.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
cs.AI 2025-12 unverdicted novelty 6.0

EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
cs.AI 2025-10 unverdicted novelty 6.0

MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
RISK: A Framework for GUI Agents in E-commerce Risk Management
cs.AI 2025-09 unverdicted novelty 6.0

RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GTA1: GUI Test-time Scaling Agent
cs.AI 2025-07 unverdicted novelty 6.0

GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
cs.AI 2025-03 accept novelty 6.0

UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
cs.AI 2026-04 unverdicted novelty 5.0

LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · cited by 19 Pith papers · 15 internal anchors

[1]

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , title =

Chongyang Bai and Xiaoxue Zang and Ying Xu and Srinivas Sunkara and Abhinav Rastogi and Jindong Chen and Blaise Ag. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , title =

work page
[2]

Bucker and Lawrence Jang and Zack Hui , journal =

Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Yadong Lu and Justin Wagle and Kazuhito Koishida and Arthur Fender C. Bucker and Lawrence Jang and Zack Hui , journal =. Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale , url =

work page
[3]

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? , url =

Ruisheng Cao and Fangyu Lei and Haoyuan Wu and Jixuan Chen and Yeqiao Fu and Hongcheng Gao and Xinzhuang Xiong and Hanchong Zhang and Yuchen Mao and Wenjing Hu and Tianbao Xie and Hongshen Xu and Danyang Zhang and Sida Wang and Ruoxi Sun and Pengcheng Yin and Caiming Xiong and Ansong Ni and Qian Liu and Victor Zhong and Lu Chen and Kai Yu and Tao Yu , jou...

work page
[4]

AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents , url =

Chai, Yuxiang and Huang, Siyuan and Niu, Yazhe and Xiao, Han and Liu, Liang and Zhang, Dingyu and Gao, Peng and Ren, Shuai and Li, Hongsheng , journal =. AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents , url =

work page
[5]

Rajat Chawla and Adarsh Jha and Muskaan Kumar and Mukunda NS and Ishaan Bhola , journal =

work page
[6]

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , year =

Zehui Chen and Kuikun Liu and Qiuchen Wang and Wenwei Zhang and Jiangning Liu and Dahua Lin and Kai Chen and Feng Zhao , booktitle =. Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , year =

work page
[7]

GUICourse: From General Vision Language Models to Versatile GUI Agents , url =

Chen, Wentong and Cui, Junbo and Hu, Jinyi and Qin, Yujia and Fang, Junjie and Zhao, Yue and Wang, Chongyi and Liu, Jun and Chen, Guirong and Huo, Yupeng and others , journal =. GUICourse: From General Vision Language Models to Versatile GUI Agents , url =

work page
[8]

SeeClick: Harnessing

Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu , booktitle =. SeeClick: Harnessing

work page
[9]

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution , url =

Mostafa Dehghani and Basil Mustafa and Josip Djolonga and Jonathan Heek and Matthias Minderer and Mathilde Caron and Andreas Steiner and Joan Puigcerver and Robert Geirhos and Ibrahim Alabdulmohsin and Avital Oliver and Piotr Padlewski and Alexey Gritsenko and Mario Lučić and Neil Houlsby , journal =. Patch n' Pack: NaViT, a Vision Transformer for any Asp...

work page
[10]

Biplab Deka and Zifeng Huang and Chad Franzen and Joshua Hibschman and Daniel Afergan and Yang Li and Jeffrey Nichols and Ranjitha Kumar , booktitle =. Rico:

work page
[11]

Mind2Web: Towards a Generalist Agent for the Web , year =

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samual Stevens and Boshi Wang and Huan Sun and Yu Su , booktitle =. Mind2Web: Towards a Generalist Agent for the Web , year =

work page
[12]

Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste , journal =

Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste , journal =. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , url =

work page
[13]

CoRR , volume =

Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su , title =. CoRR , volume =. 2024 , url =

work page 2024
[14]

Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , booktitle =

Izzeddin Gur and Hiroki Furuta and Austin V. Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , booktitle =. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis , year =

work page
[15]

Cogagent: A visual language model for gui agents , year =

Hong, Wenyi and Wang, Weihan and Lv, Qingsong and Xu, Jiazheng and Yu, Wenmeng and Ji, Junhui and Wang, Yan and Wang, Zihan and Dong, Yuxiao and Ding, Ming and others , booktitle =. Cogagent: A visual language model for gui agents , year =

work page
[16]

Inner monologue: Embodied reasoning through planning with language models , url =

Huang, Wenlong and Xia, Fei and Xiao, Ted and Chan, Harris and Liang, Jacky and Florence, Pete and Zeng, Andy and Tompson, Jonathan and Mordatch, Igor and Chebotar, Yevgen and others , journal =. Inner monologue: Embodied reasoning through planning with language models , url =

work page
[17]

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web , url =

Kapoor, Raghav and Butala, Yash Parag and Russak, Melisa and Koh, Jing Yu and Kamble, Kiran and Alshikh, Waseem and Salakhutdinov, Ruslan , journal =. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web , url =

work page
[18]

Language Models can Solve Computer Tasks , year =

Geunwoo Kim and Pierre Baldi and Stephen McAleer , booktitle =. Language Models can Solve Computer Tasks , year =

work page
[19]

Tree Search for Language Model Agents , url =

Koh, Jing Yu and McAleer, Stephen and Fried, Daniel and Salakhutdinov, Ruslan , journal =. Tree Search for Language Model Agents , url =

work page
[20]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =

Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =

work page
[21]

ArXiv preprint , title =

Xing Han L. ArXiv preprint , title =

work page
[22]

AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent , url =

Lai, Hanyu and Liu, Xiao and Iong, Iat Long and Yao, Shuntian and Chen, Yuxuan and Shen, Pengbo and Yu, Hao and Zhang, Hanchen and Zhang, Xiaohan and Dong, Yuxiao and others , journal =. AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent , url =

work page
[23]

Mapping Natural Language Instructions to Mobile

Li, Yang and He, Jiacong and Zhou, Xin and Zhang, Yuan and Baldridge, Jason , booktitle =. Mapping Natural Language Instructions to Mobile

work page
[24]

Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , url =

Li, Yang and Li, Gang and He, Luheng and Zheng, Jingjie and Li, Hong and Guan, Zhiwei , booktitle =. Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , url =

work page
[25]

On the Effects of Data Scale on Computer Control Agents , url =

Wei Li and William Bishop and Alice Li and Chris Rawles and Folawiyo Campbell-Ajala and Divya Tyamagundlu and Oriana Riva , journal =. On the Effects of Data Scale on Computer Control Agents , url =

work page
[26]

Llava-onevision: Easy visual task transfer , url =

Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan , journal =. Llava-onevision: Easy visual task transfer , url =

work page
[27]

Decoupled Weight Decay Regularization , url =

Ilya Loshchilov and Frank Hutter , biburl =. Decoupled Weight Decay Regularization , url =. 7th International Conference on Learning Representations,

work page
[28]

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices , url =

Lu, Quanfeng and Shao, Wenqi and Liu, Zitao and Meng, Fanqing and Li, Boxuan and Chen, Botong and Huang, Siyuan and Zhang, Kaipeng and Qiao, Yu and Luo, Ping , journal =. GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices , url =

work page
[29]

OmniParser for Pure Vision Based GUI Agent , url =

Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah , journal =. OmniParser for Pure Vision Based GUI Agent , url =

work page
[30]

Webgpt: Browser-assisted question-answering with human feedback , url =

Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others , journal =. Webgpt: Browser-assisted question-answering with human feedback , url =

work page
[31]

ScreenAgent: A Vision Language Model-driven Computer Control Agent , url =

Runliang Niu and Jindong Li and Shiqi Wang and Yali Fu and Xiyu Hu and Xueyuan Leng and He Kong and Yi Chang and Qi Wang , journal =. ScreenAgent: A Vision Language Model-driven Computer Control Agent , url =

work page
[32]

WebCanvas: Benchmarking Web Agents in Online Environments , url =

Pan, Yichen and Kong, Dehan and Zhou, Sida and Cui, Cheng and Leng, Yifei and Jiang, Bing and Liu, Hangyu and Shang, Yanyi and Zhou, Shuyan and Wu, Tongshuang and others , journal =. WebCanvas: Benchmarking Web Agents in Online Environments , url =

work page
[33]

Playwright for Python Documentation , year =

work page
[34]

Agent q: Advanced reasoning and learning for autonomous ai agents , url =

Putta, Pranav and Mills, Edmund and Garg, Naman and Motwani, Sumeet and Finn, Chelsea and Garg, Divyansh and Rafailov, Rafael , journal =. Agent q: Advanced reasoning and learning for autonomous ai agents , url =

work page
[35]

Pranav Putta and Edmund Mills and Naman Garg and Sumeet Motwani and Chelsea Finn and Divyansh Garg and Rafael Rafailov , journal =. Agent

work page
[36]

PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K. PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =. Advances in Neural Information Processing Systems 32: Annual Conference ...

work page 2019
[37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , url =

Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang , journal =. Qwen2-VL: Enhancing Vision-Language Model's P...

work page
[38]

ZeRO: memory optimizations toward training trillion parameter models , year =

Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He , booktitle =. ZeRO: memory optimizations toward training trillion parameter models , year =

work page
[39]

Androidinthewild: A large-scale dataset for android device control , year =

Rawles, Christopher and Li, Alice and Rodriguez, Daniel and Riva, Oriana and Lillicrap, Timothy , journal =. Androidinthewild: A large-scale dataset for android device control , year =

work page
[40]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , url =

Christopher Rawles and Sarah Clinckemaillie and Yifan Chang and Jonathan Waltz and Gabrielle Lau and Marybeth Fair and Alice Li and William Bishop and Wei Li and Folawiyo Campbell-Ajala and Daniel Toyama and Robert Berry and Divya Tyamagundlu and Timothy Lillicrap and Oriana Riva , journal =. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous...

work page
[41]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , url =

Team, Gemini and Georgiev, Petko and Lei, Ving Ian and Burnell, Ryan and Bai, Libin and Gulati, Anmol and Tanzer, Garrett and Vincent, Damien and Pan, Zhufeng and Wang, Shibo and others , journal =. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , url =

work page
[42]

Roformer: Enhanced transformer with rotary position embedding , year =

Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , journal =. Roformer: Enhanced transformer with rotary position embedding , year =

work page
[43]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , url =

Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and others , journal =. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , url =

work page
[44]

ArXiv preprint , title =

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. ArXiv preprint , title =

work page
[45]

Michael Wornow and Avanika Narayan and Ben T Viggiano and Ishan S. Khare and Tathagat Verma and Tibor Thompson and Miguel Angel Fuentes Hernandez and Sudharsan Sundar and Chloe Trujillo and Krrish Chawla and Rongfei Lu and Justin Shen and Divya Nagaraj and Joshua Martinez and Vardhan Agrawal and Althea Hudson and Nigam H. Shah and Christopher Re , journal...

work page
[46]

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics , year =

Jason Wu and Siyan Wang and Siman Shen and Yi-Hao Peng and Jeffrey Nichols and Jeffrey Bigham , journal =. WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics , year =

work page
[47]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , url =

Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and others , journal =. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , url =

work page
[48]

Tianqi Xu and Linyao Chen and Dai-Jie Wu and Yanjun Chen and Zecheng Zhang and Xiang Yao and Zhiqiang Xie and Yongchao Chen and Shilong Liu and Bochen Qian and Philip H. S. Torr and Bernard Ghanem and G. Li , journal =. CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents , url =

work page
[49]

Lemur: Harmonizing Natural Language and Code for Language Agents , year =

Yiheng Xu and Hongjin Su and Chen Xing and Boyu Mi and Qian Liu and Weijia Shi and Binyuan Hui and Fan Zhou and Yitao Liu and Tianbao Xie and Zhoujun Cheng and Siheng Zhao and Lingpeng Kong and Bailin Wang and Caiming Xiong and Tao Yu , booktitle =. Lemur: Harmonizing Natural Language and Code for Language Agents , year =

work page
[50]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , url =

Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chun-yue Li and Jianfeng Gao , journal =. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , url =

work page
[51]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =

Da Yin and Faeze Brahman and Abhilasha Ravichander and Khyathi Raghavi Chandu and Kai. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =

work page
[52]

AgentTuning: Enabling Generalized Agent Abilities for LLMs , year =

Aohan Zeng and Mingdao Liu and Rui Lu and Bowen Wang and Xiao Liu and Yuxiao Dong and Jie Tang , booktitle =. AgentTuning: Enabling Generalized Agent Abilities for LLMs , year =

work page
[53]

Xiaoyan Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu , journal =

China. Xiaoyan Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu , journal =. AppAgent: Multimodal Agents as Smartphone Users , url =

work page
[54]

Android in the zoo: Chain-of-action-thought for gui agents , url =

Zhang, Jiwen and Wu, Jihao and Teng, Yihua and Liao, Minghui and Xu, Nuo and Xiao, Xiao and Wei, Zhongyu and Tang, Duyu , journal =. Android in the zoo: Chain-of-action-thought for gui agents , url =

work page
[55]

xLAM: A Family of Large Action Models to Empower AI Agent Systems , url =

Jianguo Zhang and Tian Lan and Ming Zhu and Zuxin Liu and Thai Hoang and Shirley Kokane and Weiran Yao and Juntao Tan and Akshara Prabhakar and Haolin Chen and Zhiwei Liu and Yihao Feng and Tulika Awalgaonkar and Rithesh Murthy and Eric Hu and Zeyuan Chen and Ran Xu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Silvio Savarese and Caiming ...

work page
[56]

You Only Look at Screens: Multimodal Chain-of-Action Agents , year =

Zhuosheng Zhang and Aston Zhang , booktitle =. You Only Look at Screens: Multimodal Chain-of-Action Agents , year =

work page
[57]

GPT-4V(ision) is a Generalist Web Agent, if Grounded , year =

Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su , booktitle =. GPT-4V(ision) is a Generalist Web Agent, if Grounded , year =

work page
[58]

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , year =

Longtao Zheng and Rundong Wang and Xinrun Wang and Bo An , booktitle =. Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , year =

work page
[59]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , booktitle =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , booktitle =. WebArena:

work page
[62]

Qwen2.5-VL Technical Report , journal =

Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and Kai Dang and Peng Wang and Shijie Wang and Jun Tang and Humen Zhong and Yuanzhi Zhu and Ming. Qwen2.5-VL Technical Report , journal =

work page
[63]

2024 , eprint=

Devil's Advocate: Anticipatory Reflection for LLM Agents , author=. 2024 , eprint=

work page 2024
[64]

2023 , eprint=

A Zero-Shot Language Agent for Computer Control with Structured Reflection , author=. 2023 , eprint=

work page 2023
[66]

Findings of the Association for Computational Linguistics:

Tao Li and Gang Li and Jingjie Zheng and Purple Wang and Yang Li , title =. Findings of the Association for Computational Linguistics:. 2024 , url =

work page 2024
[67]

Bai, C., Zang, X., Xu, Y., Sunkara, S., Rastogi, A., Chen, J., and y Arcas, B. A. Uibert: Learning generic multimodal representations for UI understanding. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021

work page 2021
[68]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report. CoRR, abs/2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Bonatti, R., Zhao, D., Bonacci, F., Dupont, D., Abdali, S., Li, Y., Lu, Y., Wagle, J., Koishida, K., Bucker, A. F. C., Jang, L., and Hui, Z. Windows agent arena: Evaluating multi-modal os agents at scale. ArXiv preprint, 2024. URL https://api.semanticscholar.org/CorpusID:272600411

work page 2024
[70]

Spider2-v: How far are multimodal agents from automating data science and engineering workflows? ArXiv preprint, 2024

Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Mao, Y., Hu, W., Xie, T., Xu, H., Zhang, D., Wang, S., Sun, R., Yin, P., Xiong, C., Ni, A., Liu, Q., Zhong, V., Chen, L., Yu, K., and Yu, T. Spider2-v: How far are multimodal agents from automating data science and engineering workflows? ArXiv preprint, 2024. URL https://arxiv.org/...

work page arXiv 2024
[71]

Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490,

Chai, Y., Huang, S., Niu, Y., Xiao, H., Liu, L., Zhang, D., Gao, P., Ren, S., and Li, H. Amex: Android multi-annotation expo dataset for mobile gui agents. ArXiv preprint, 2024. URL https://arxiv.org/abs/2407.17490

work page arXiv 2024
[72]

Guicourse: From general vision language models to versatile gui agents

Chen, W., Cui, J., Hu, J., Qin, Y., Fang, J., Zhao, Y., Wang, C., Liu, J., Chen, G., Huo, Y., et al. Guicourse: From general vision language models to versatile gui agents. ArXiv preprint, 2024 a . URL https://arxiv.org/abs/2406.11317

work page arXiv 2024
[73]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., Gu, L., Wang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., Wang, J., Jiang, T., Wang, B., He, C., Shi, B., Zhang, X., Lv, H., Wang, Y., Shao, W., Chu, P., Tu, Z., He, T., Wu, Z., Deng, H., Ge, J., Chen, K., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D., Qiao, Y., Dai, J., a...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.05271 2024
[74]

Seeclick: Harnessing GUI grounding for advanced visual GUI agents

Cheng, K., Sun, Q., Chu, Y., Xu, F., Li, Y., Zhang, J., and Wu, Z. Seeclick: Harnessing GUI grounding for advanced visual GUI agents. In Ku, L., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , 2024. URL h...

work page doi:10.18653/v1/2024.acl-long.505 2024
[75]

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J. S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y., Patel, A., Yatskar, M., Callison - Burch, C., Head, A., Hendrix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, M...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.17146 2024
[76]

Rico: A mobile app dataset for building data-driven design applications

Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., and Kumar, R. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , 2017

work page 2017
[77]

Mind2web: Towards a generalist agent for the web

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web. In Advances in Neural Information Processing Systems, 2023

work page 2023
[78]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Drouin, A., Gasse, M., Caccia, M., Laradji, I. H., Verme, M. D., Marty, T., Boisvert, L., Thakkar, M., Cappart, Q., Vazquez, D., Chapados, N., and Lacoste, A. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., and Su, Y. Navigating the digital world as humans do: Universal visual grounding for GUI agents. CoRR, abs/2410.05243, 2024. URL https://doi.org/10.48550/arXiv.2410.05243

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.05243 2024
[80]

V., Safdari, M., Matsuo, Y., Eck, D., and Faust, A

Gur, I., Furuta, H., Huang, A. V., Safdari, M., Matsuo, Y., Eck, D., and Faust, A. A real-world webagent with planning, long context understanding, and program synthesis. In International Conference on Learning Representations, 2024

work page 2024
[81]

Cogagent: A visual language model for gui agents

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[82]

Inner Monologue: Embodied Reasoning through Planning with Language Models

Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al. Inner monologue: Embodied reasoning through planning with language models. ArXiv preprint, 2022. URL https://arxiv.org/abs/2207.05608

work page internal anchor Pith review Pith/arXiv arXiv 2022
[83]

P., Russak, M., Koh, J

Kapoor, R., Butala, Y. P., Russak, M., Koh, J. Y., Kamble, K., Alshikh, W., and Salakhutdinov, R. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. ArXiv preprint, 2024. URL https://arxiv.org/abs/2402.17553

work page arXiv 2024

Showing first 80 references.