Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Pith reviewed 2026-05-18 04:03 UTC · model grok-4.3
The pith
A pure vision framework lets AI agents control any GUI by processing screen images directly and reasoning internally without text or closed-source models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aguvis is a unified pure vision framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions, and incorporates structured reasoning via inner monologue. It is enabled by the Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show state-of-the-art performance across offline and real-world online benchmarks, establishing the first fully autonomous vision-based GUI agent that operates without closed-source models.
What carries the argument
The two-stage training pipeline that first teaches visual GUI grounding on annotated screen images before adding planning and reasoning, supported by the Aguvis Data Collection dataset of multimodal annotations.
If this is right
- GUI agents no longer require textual screen descriptions or platform-specific action definitions to function.
- The same trained model can be applied to different operating systems and applications through standardized visual interaction.
- Reasoning about tasks occurs internally via inner monologue rather than relying on external parsers or closed models.
- Fully open models can now match or exceed performance previously tied to proprietary systems on GUI benchmarks.
Where Pith is reading between the lines
- The grounding-then-reasoning separation may transfer to training agents for other visual control domains such as robotics or web navigation.
- If visual grounding proves robust, future work could test whether adding real-time feedback loops improves handling of changing interface states.
- Scaling the dataset size or diversity could be directly measured by tracking how cross-platform success rates change with additional annotated examples.
Load-bearing premise
The new dataset of screen images with grounding and reasoning annotations is representative enough of real GUI variations to support cross-platform generalization after the two-stage training.
What would settle it
Running the trained Aguvis model on a new operating system or application family absent from the training dataset and checking whether success rates on representative tasks fall sharply below the reported benchmark levels.
read the original abstract
Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Aguvis, a unified pure-vision framework for autonomous GUI agents that processes raw screen images directly, standardizes cross-platform action spaces, and incorporates structured reasoning via inner monologue. It constructs the Aguvis Data Collection dataset containing multimodal grounding and reasoning annotations and trains models with a two-stage pipeline that first learns grounding then planning/reasoning. Experiments report state-of-the-art results on offline benchmarks and real-world online tasks, positioning Aguvis as the first fully autonomous vision-only GUI agent that does not rely on closed-source models; all datasets, models, and recipes are released.
Significance. If the reported gains are robust, the work would advance the field by showing that carefully annotated vision-only data can support cross-platform GUI autonomy without textual interfaces or proprietary LLMs. The open release of the full pipeline and data is a concrete strength that supports reproducibility.
major comments (2)
- [§3] §3 (Aguvis Data Collection): The manuscript provides no quantitative metrics on platform balance (e.g., fraction of Android vs. web vs. desktop trajectories), GUI-element diversity, or inter-annotator agreement for the multimodal grounding and reasoning labels. Because the central SOTA claim rests on the two-stage pipeline generalizing from this dataset, the absence of these statistics leaves the representativeness assumption unverified and load-bearing for the cross-platform results.
- [§5] §5 (Experiments): The offline and online benchmark tables report point estimates without error bars, multiple random seeds, or statistical significance tests against the strongest baselines. Given that the soundness assessment notes missing experimental details, these omissions prevent assessment of whether the claimed improvements are stable or could be explained by dataset-specific fitting.
minor comments (2)
- [§4.2] The inner-monologue prompting template in §4.2 could be moved to an appendix or figure for easier reference during replication.
- [Figure 3] Figure 3 (qualitative examples) would benefit from explicit annotation of which stage (grounding vs. reasoning) produced each output token.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will incorporate the suggested improvements in the revised version to strengthen the presentation of the dataset and experimental results.
read point-by-point responses
-
Referee: [§3] §3 (Aguvis Data Collection): The manuscript provides no quantitative metrics on platform balance (e.g., fraction of Android vs. web vs. desktop trajectories), GUI-element diversity, or inter-annotator agreement for the multimodal grounding and reasoning labels. Because the central SOTA claim rests on the two-stage pipeline generalizing from this dataset, the absence of these statistics leaves the representativeness assumption unverified and load-bearing for the cross-platform results.
Authors: We acknowledge that the current version of the manuscript does not include these quantitative metrics. In the revision, we will add a new table and accompanying text in §3 reporting: (1) the exact fractions of trajectories across Android, web, and desktop platforms; (2) statistics on GUI-element diversity, including counts and distributions of element types; and (3) inter-annotator agreement scores (e.g., Cohen’s kappa) for both grounding and reasoning annotations. These additions will directly support the representativeness of the dataset and the generalization claims of the two-stage pipeline. revision: yes
-
Referee: [§5] §5 (Experiments): The offline and online benchmark tables report point estimates without error bars, multiple random seeds, or statistical significance tests against the strongest baselines. Given that the soundness assessment notes missing experimental details, these omissions prevent assessment of whether the claimed improvements are stable or could be explained by dataset-specific fitting.
Authors: We agree that reporting only point estimates limits the ability to assess result stability. In the revised §5, we will rerun the key experiments with multiple random seeds (reporting means and standard deviations as error bars in the tables) and include statistical significance tests (e.g., paired t-tests) against the strongest baselines. This will provide evidence that the reported gains are robust rather than attributable to variance or dataset-specific effects. revision: yes
Circularity Check
No circularity: empirical pipeline with benchmark results
full rationale
The paper describes an empirical system: construction of the Aguvis Data Collection dataset with multimodal annotations, followed by a two-stage training pipeline (grounding then planning/reasoning) and evaluation on offline and real-world online benchmarks. No equations, first-principles derivations, or fitted parameters are presented whose outputs reduce to the inputs by construction. Claims of SOTA performance rest on reported experimental results rather than any self-referential prediction or self-citation chain that would force the outcome. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- two-stage training hyperparameters
Forward citations
Cited by 20 Pith papers
-
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
MobiBench is the first modular multi-path offline benchmark for mobile GUI agents, achieving 94.72% agreement with human evaluators while allowing component-level analysis.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
-
Benchmarking and Improving GUI Agents in High-Dynamic Environments
DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
-
SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents
SecureWebArena is a new benchmark suite for holistic security evaluation of LVLM-based web agents using diverse simulated environments, attack taxonomies, and multi-layered failure analysis across reasoning, behavior,...
-
BAMI: Training-Free Bias Mitigation in GUI Grounding
BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.
-
SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
SOLAR-RL assigns dense step-level rewards from static trajectory data by detecting first failure points and applying target-aligned shaping to improve long-horizon GUI task completion without full online interactions.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding
UI-Zoomer uses uncertainty quantification to trigger and size adaptive zoom-ins only on uncertain GUI grounding predictions, yielding up to 13.4% gains on benchmarks with no training.
-
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
UI-in-the-Loop makes multimodal models explicitly learn UI element locations, meanings, and uses in a cyclic screen-element-action loop, delivering better UI comprehension and GUI reasoning on a new 26K-sample benchmark.
-
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
-
MGA: Memory-Driven GUI Agent for Observation-Centric Interaction
MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.
-
RISK: A Framework for GUI Agents in E-commerce Risk Management
RISK introduces a dataset, benchmark, and R1-style RL fine-tuning for GUI agents that achieve 6.8-8.8% offline gains and 70.5% online task success in e-commerce risk management using 7.2% of baseline parameters.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
GTA1: GUI Test-time Scaling Agent
GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.
-
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
LAMO uses role-oriented data synthesis and two-stage training (perplexity-weighted supervised fine-tuning plus reinforcement learning) to create scalable lightweight GUI agents that support both single-model and multi...
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , title =
Chongyang Bai and Xiaoxue Zang and Ying Xu and Srinivas Sunkara and Abhinav Rastogi and Jindong Chen and Blaise Ag. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence , title =
-
[2]
Bucker and Lawrence Jang and Zack Hui , journal =
Rogerio Bonatti and Dan Zhao and Francesco Bonacci and Dillon Dupont and Sara Abdali and Yinheng Li and Yadong Lu and Justin Wagle and Kazuhito Koishida and Arthur Fender C. Bucker and Lawrence Jang and Zack Hui , journal =. Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale , url =
-
[3]
Ruisheng Cao and Fangyu Lei and Haoyuan Wu and Jixuan Chen and Yeqiao Fu and Hongcheng Gao and Xinzhuang Xiong and Hanchong Zhang and Yuchen Mao and Wenjing Hu and Tianbao Xie and Hongshen Xu and Danyang Zhang and Sida Wang and Ruoxi Sun and Pengcheng Yin and Caiming Xiong and Ansong Ni and Qian Liu and Victor Zhong and Lu Chen and Kai Yu and Tao Yu , jou...
-
[4]
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents , url =
Chai, Yuxiang and Huang, Siyuan and Niu, Yazhe and Xiao, Han and Liu, Liang and Zhang, Dingyu and Gao, Peng and Ren, Shuai and Li, Hongsheng , journal =. AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents , url =
-
[5]
Rajat Chawla and Adarsh Jha and Muskaan Kumar and Mukunda NS and Ishaan Bhola , journal =
-
[6]
Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , year =
Zehui Chen and Kuikun Liu and Qiuchen Wang and Wenwei Zhang and Jiangning Liu and Dahua Lin and Kai Chen and Feng Zhao , booktitle =. Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , year =
-
[7]
GUICourse: From General Vision Language Models to Versatile GUI Agents , url =
Chen, Wentong and Cui, Junbo and Hu, Jinyi and Qin, Yujia and Fang, Junjie and Zhao, Yue and Wang, Chongyi and Liu, Jun and Chen, Guirong and Huo, Yupeng and others , journal =. GUICourse: From General Vision Language Models to Versatile GUI Agents , url =
-
[8]
Kanzhi Cheng and Qiushi Sun and Yougang Chu and Fangzhi Xu and Yantao Li and Jianbing Zhang and Zhiyong Wu , booktitle =. SeeClick: Harnessing
-
[9]
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution , url =
Mostafa Dehghani and Basil Mustafa and Josip Djolonga and Jonathan Heek and Matthias Minderer and Mathilde Caron and Andreas Steiner and Joan Puigcerver and Robert Geirhos and Ibrahim Alabdulmohsin and Avital Oliver and Piotr Padlewski and Alexey Gritsenko and Mario Lučić and Neil Houlsby , journal =. Patch n' Pack: NaViT, a Vision Transformer for any Asp...
-
[10]
Biplab Deka and Zifeng Huang and Chad Franzen and Joshua Hibschman and Daniel Afergan and Yang Li and Jeffrey Nichols and Ranjitha Kumar , booktitle =. Rico:
-
[11]
Mind2Web: Towards a Generalist Agent for the Web , year =
Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samual Stevens and Boshi Wang and Huan Sun and Yu Su , booktitle =. Mind2Web: Towards a Generalist Agent for the Web , year =
-
[12]
Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste , journal =. WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks? , url =
-
[13]
Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su , title =. CoRR , volume =. 2024 , url =
work page 2024
-
[14]
Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , booktitle =
Izzeddin Gur and Hiroki Furuta and Austin V. Huang and Mustafa Safdari and Yutaka Matsuo and Douglas Eck and Aleksandra Faust , booktitle =. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis , year =
-
[15]
Cogagent: A visual language model for gui agents , year =
Hong, Wenyi and Wang, Weihan and Lv, Qingsong and Xu, Jiazheng and Yu, Wenmeng and Ji, Junhui and Wang, Yan and Wang, Zihan and Dong, Yuxiao and Ding, Ming and others , booktitle =. Cogagent: A visual language model for gui agents , year =
-
[16]
Inner monologue: Embodied reasoning through planning with language models , url =
Huang, Wenlong and Xia, Fei and Xiao, Ted and Chan, Harris and Liang, Jacky and Florence, Pete and Zeng, Andy and Tompson, Jonathan and Mordatch, Igor and Chebotar, Yevgen and others , journal =. Inner monologue: Embodied reasoning through planning with language models , url =
-
[17]
Kapoor, Raghav and Butala, Yash Parag and Russak, Melisa and Koh, Jing Yu and Kamble, Kiran and Alshikh, Waseem and Salakhutdinov, Ruslan , journal =. OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web , url =
-
[18]
Language Models can Solve Computer Tasks , year =
Geunwoo Kim and Pierre Baldi and Stephen McAleer , booktitle =. Language Models can Solve Computer Tasks , year =
-
[19]
Tree Search for Language Model Agents , url =
Koh, Jing Yu and McAleer, Stephen and Fried, Daniel and Salakhutdinov, Ruslan , journal =. Tree Search for Language Model Agents , url =
-
[20]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =
Jing Yu Koh and Robert Lo and Lawrence Jang and Vikram Duvvur and Ming Chong Lim and Po. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =
- [21]
-
[22]
AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent , url =
Lai, Hanyu and Liu, Xiao and Iong, Iat Long and Yao, Shuntian and Chen, Yuxuan and Shen, Pengbo and Yu, Hao and Zhang, Hanchen and Zhang, Xiaohan and Dong, Yuxiao and others , journal =. AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent , url =
-
[23]
Mapping Natural Language Instructions to Mobile
Li, Yang and He, Jiacong and Zhou, Xin and Zhang, Yuan and Baldridge, Jason , booktitle =. Mapping Natural Language Instructions to Mobile
-
[24]
Li, Yang and Li, Gang and He, Luheng and Zheng, Jingjie and Li, Hong and Guan, Zhiwei , booktitle =. Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , url =
-
[25]
On the Effects of Data Scale on Computer Control Agents , url =
Wei Li and William Bishop and Alice Li and Chris Rawles and Folawiyo Campbell-Ajala and Divya Tyamagundlu and Oriana Riva , journal =. On the Effects of Data Scale on Computer Control Agents , url =
-
[26]
Llava-onevision: Easy visual task transfer , url =
Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Li, Yanwei and Liu, Ziwei and Li, Chunyuan , journal =. Llava-onevision: Easy visual task transfer , url =
-
[27]
Decoupled Weight Decay Regularization , url =
Ilya Loshchilov and Frank Hutter , biburl =. Decoupled Weight Decay Regularization , url =. 7th International Conference on Learning Representations,
-
[28]
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices , url =
Lu, Quanfeng and Shao, Wenqi and Liu, Zitao and Meng, Fanqing and Li, Boxuan and Chen, Botong and Huang, Siyuan and Zhang, Kaipeng and Qiao, Yu and Luo, Ping , journal =. GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices , url =
-
[29]
OmniParser for Pure Vision Based GUI Agent , url =
Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah , journal =. OmniParser for Pure Vision Based GUI Agent , url =
-
[30]
Webgpt: Browser-assisted question-answering with human feedback , url =
Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and others , journal =. Webgpt: Browser-assisted question-answering with human feedback , url =
-
[31]
ScreenAgent: A Vision Language Model-driven Computer Control Agent , url =
Runliang Niu and Jindong Li and Shiqi Wang and Yali Fu and Xiyu Hu and Xueyuan Leng and He Kong and Yi Chang and Qi Wang , journal =. ScreenAgent: A Vision Language Model-driven Computer Control Agent , url =
-
[32]
WebCanvas: Benchmarking Web Agents in Online Environments , url =
Pan, Yichen and Kong, Dehan and Zhou, Sida and Cui, Cheng and Leng, Yifei and Jiang, Bing and Liu, Hangyu and Shang, Yanyi and Zhou, Shuyan and Wu, Tongshuang and others , journal =. WebCanvas: Benchmarking Web Agents in Online Environments , url =
-
[33]
Playwright for Python Documentation , year =
-
[34]
Agent q: Advanced reasoning and learning for autonomous ai agents , url =
Putta, Pranav and Mills, Edmund and Garg, Naman and Motwani, Sumeet and Finn, Chelsea and Garg, Divyansh and Rafailov, Rafael , journal =. Agent q: Advanced reasoning and learning for autonomous ai agents , url =
-
[35]
Pranav Putta and Edmund Mills and Naman Garg and Sumeet Motwani and Chelsea Finn and Divyansh Garg and Rafael Rafailov , journal =. Agent
-
[36]
PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =
Adam Paszke and Sam Gross and Francisco Massa and Adam Lerer and James Bradbury and Gregory Chanan and Trevor Killeen and Zeming Lin and Natalia Gimelshein and Luca Antiga and Alban Desmaison and Andreas K. PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =. Advances in Neural Information Processing Systems 32: Annual Conference ...
work page 2019
-
[37]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , url =
Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang , journal =. Qwen2-VL: Enhancing Vision-Language Model's P...
-
[38]
ZeRO: memory optimizations toward training trillion parameter models , year =
Samyam Rajbhandari and Jeff Rasley and Olatunji Ruwase and Yuxiong He , booktitle =. ZeRO: memory optimizations toward training trillion parameter models , year =
-
[39]
Androidinthewild: A large-scale dataset for android device control , year =
Rawles, Christopher and Li, Alice and Rodriguez, Daniel and Riva, Oriana and Lillicrap, Timothy , journal =. Androidinthewild: A large-scale dataset for android device control , year =
-
[40]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents , url =
Christopher Rawles and Sarah Clinckemaillie and Yifan Chang and Jonathan Waltz and Gabrielle Lau and Marybeth Fair and Alice Li and William Bishop and Wei Li and Folawiyo Campbell-Ajala and Daniel Toyama and Robert Berry and Divya Tyamagundlu and Timothy Lillicrap and Oriana Riva , journal =. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous...
-
[41]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , url =
Team, Gemini and Georgiev, Petko and Lei, Ving Ian and Burnell, Ryan and Bai, Libin and Gulati, Anmol and Tanzer, Garrett and Vincent, Damien and Pan, Zhufeng and Wang, Shibo and others , journal =. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , url =
-
[42]
Roformer: Enhanced transformer with rotary position embedding , year =
Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , journal =. Roformer: Enhanced transformer with rotary position embedding , year =
-
[43]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , url =
Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and others , journal =. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , url =
-
[44]
Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. ArXiv preprint , title =
-
[45]
Michael Wornow and Avanika Narayan and Ben T Viggiano and Ishan S. Khare and Tathagat Verma and Tibor Thompson and Miguel Angel Fuentes Hernandez and Sudharsan Sundar and Chloe Trujillo and Krrish Chawla and Rongfei Lu and Justin Shen and Divya Nagaraj and Joshua Martinez and Vardhan Agrawal and Althea Hudson and Nigam H. Shah and Christopher Re , journal...
-
[46]
WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics , year =
Jason Wu and Siyan Wang and Siman Shen and Yi-Hao Peng and Jeffrey Nichols and Jeffrey Bigham , journal =. WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics , year =
-
[47]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , url =
Xie, Tianbao and Zhang, Danyang and Chen, Jixuan and Li, Xiaochuan and Zhao, Siheng and Cao, Ruisheng and Hua, Toh Jing and Cheng, Zhoujun and Shin, Dongchan and Lei, Fangyu and others , journal =. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments , url =
-
[48]
Tianqi Xu and Linyao Chen and Dai-Jie Wu and Yanjun Chen and Zecheng Zhang and Xiang Yao and Zhiqiang Xie and Yongchao Chen and Shilong Liu and Bochen Qian and Philip H. S. Torr and Bernard Ghanem and G. Li , journal =. CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents , url =
-
[49]
Lemur: Harmonizing Natural Language and Code for Language Agents , year =
Yiheng Xu and Hongjin Su and Chen Xing and Boyu Mi and Qian Liu and Weijia Shi and Binyuan Hui and Fan Zhou and Yitao Liu and Tianbao Xie and Zhoujun Cheng and Siheng Zhao and Lingpeng Kong and Bailin Wang and Caiming Xiong and Tao Yu , booktitle =. Lemur: Harmonizing Natural Language and Code for Language Agents , year =
-
[50]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , url =
Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chun-yue Li and Jianfeng Gao , journal =. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V , url =
-
[51]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =
Da Yin and Faeze Brahman and Abhilasha Ravichander and Khyathi Raghavi Chandu and Kai. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , title =
-
[52]
AgentTuning: Enabling Generalized Agent Abilities for LLMs , year =
Aohan Zeng and Mingdao Liu and Rui Lu and Bowen Wang and Xiao Liu and Yuxiao Dong and Jie Tang , booktitle =. AgentTuning: Enabling Generalized Agent Abilities for LLMs , year =
-
[53]
China. Xiaoyan Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu , journal =. AppAgent: Multimodal Agents as Smartphone Users , url =
-
[54]
Android in the zoo: Chain-of-action-thought for gui agents , url =
Zhang, Jiwen and Wu, Jihao and Teng, Yihua and Liao, Minghui and Xu, Nuo and Xiao, Xiao and Wei, Zhongyu and Tang, Duyu , journal =. Android in the zoo: Chain-of-action-thought for gui agents , url =
-
[55]
xLAM: A Family of Large Action Models to Empower AI Agent Systems , url =
Jianguo Zhang and Tian Lan and Ming Zhu and Zuxin Liu and Thai Hoang and Shirley Kokane and Weiran Yao and Juntao Tan and Akshara Prabhakar and Haolin Chen and Zhiwei Liu and Yihao Feng and Tulika Awalgaonkar and Rithesh Murthy and Eric Hu and Zeyuan Chen and Ran Xu and Juan Carlos Niebles and Shelby Heinecke and Huan Wang and Silvio Savarese and Caiming ...
-
[56]
You Only Look at Screens: Multimodal Chain-of-Action Agents , year =
Zhuosheng Zhang and Aston Zhang , booktitle =. You Only Look at Screens: Multimodal Chain-of-Action Agents , year =
-
[57]
GPT-4V(ision) is a Generalist Web Agent, if Grounded , year =
Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su , booktitle =. GPT-4V(ision) is a Generalist Web Agent, if Grounded , year =
-
[58]
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , year =
Longtao Zheng and Rundong Wang and Xinrun Wang and Bo An , booktitle =. Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , year =
-
[59]
Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , booktitle =. WebArena:
-
[62]
Qwen2.5-VL Technical Report , journal =
Shuai Bai and Keqin Chen and Xuejing Liu and Jialin Wang and Wenbin Ge and Sibo Song and Kai Dang and Peng Wang and Shijie Wang and Jun Tang and Humen Zhong and Yuanzhi Zhu and Ming. Qwen2.5-VL Technical Report , journal =
-
[63]
Devil's Advocate: Anticipatory Reflection for LLM Agents , author=. 2024 , eprint=
work page 2024
-
[64]
A Zero-Shot Language Agent for Computer Control with Structured Reflection , author=. 2023 , eprint=
work page 2023
-
[66]
Findings of the Association for Computational Linguistics:
Tao Li and Gang Li and Jingjie Zheng and Purple Wang and Yang Li , title =. Findings of the Association for Computational Linguistics:. 2024 , url =
work page 2024
-
[67]
Bai, C., Zang, X., Xu, Y., Sunkara, S., Rastogi, A., Chen, J., and y Arcas, B. A. Uibert: Learning generic multimodal representations for UI understanding. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 2021
work page 2021
-
[68]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report. CoRR, abs/2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
Bonatti, R., Zhao, D., Bonacci, F., Dupont, D., Abdali, S., Li, Y., Lu, Y., Wagle, J., Koishida, K., Bucker, A. F. C., Jang, L., and Hui, Z. Windows agent arena: Evaluating multi-modal os agents at scale. ArXiv preprint, 2024. URL https://api.semanticscholar.org/CorpusID:272600411
work page 2024
-
[70]
Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Mao, Y., Hu, W., Xie, T., Xu, H., Zhang, D., Wang, S., Sun, R., Yin, P., Xiong, C., Ni, A., Liu, Q., Zhong, V., Chen, L., Yu, K., and Yu, T. Spider2-v: How far are multimodal agents from automating data science and engineering workflows? ArXiv preprint, 2024. URL https://arxiv.org/...
-
[71]
Amex: Android multi-annotation expo dataset for mobile gui agents.arXiv preprint arXiv:2407.17490,
Chai, Y., Huang, S., Niu, Y., Xiao, H., Liu, L., Zhang, D., Gao, P., Ren, S., and Li, H. Amex: Android multi-annotation expo dataset for mobile gui agents. ArXiv preprint, 2024. URL https://arxiv.org/abs/2407.17490
-
[72]
Guicourse: From general vision language models to versatile gui agents
Chen, W., Cui, J., Hu, J., Qin, Y., Fang, J., Zhao, Y., Wang, C., Liu, J., Chen, G., Huo, Y., et al. Guicourse: From general vision language models to versatile gui agents. ArXiv preprint, 2024 a . URL https://arxiv.org/abs/2406.11317
-
[73]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., Gu, L., Wang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., Wang, J., Jiang, T., Wang, B., He, C., Shi, B., Zhang, X., Lv, H., Wang, Y., Shao, W., Chu, P., Tu, Z., He, T., Wu, Z., Deng, H., Ge, J., Chen, K., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D., Qiao, Y., Dai, J., a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.05271 2024
-
[74]
Seeclick: Harnessing GUI grounding for advanced visual GUI agents
Cheng, K., Sun, Q., Chu, Y., Xu, F., Li, Y., Zhang, J., and Wu, Z. Seeclick: Harnessing GUI grounding for advanced visual GUI agents. In Ku, L., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024 , 2024. URL h...
-
[75]
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J. S., Salehi, M., Muennighoff, N., Lo, K., Soldaini, L., Lu, J., Anderson, T., Bransom, E., Ehsani, K., Ngo, H., Chen, Y., Patel, A., Yatskar, M., Callison - Burch, C., Head, A., Hendrix, R., Bastani, F., VanderBilt, E., Lambert, N., Chou, Y., Chheda, A., Sparks, J., Skjonsberg, S., Schmitz, M...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.17146 2024
-
[76]
Rico: A mobile app dataset for building data-driven design applications
Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., and Kumar, R. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology , 2017
work page 2017
-
[77]
Mind2web: Towards a generalist agent for the web
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[78]
WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?
Drouin, A., Gasse, M., Caccia, M., Laradji, I. H., Verme, M. D., Marty, T., Boisvert, L., Thakkar, M., Cappart, Q., Vazquez, D., Chapados, N., and Lacoste, A. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., and Su, Y. Navigating the digital world as humans do: Universal visual grounding for GUI agents. CoRR, abs/2410.05243, 2024. URL https://doi.org/10.48550/arXiv.2410.05243
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.05243 2024
-
[80]
V., Safdari, M., Matsuo, Y., Eck, D., and Faust, A
Gur, I., Furuta, H., Huang, A. V., Safdari, M., Matsuo, Y., Eck, D., and Faust, A. A real-world webagent with planning, long context understanding, and program synthesis. In International Conference on Learning Representations, 2024
work page 2024
-
[81]
Cogagent: A visual language model for gui agents
Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[82]
Inner Monologue: Embodied Reasoning through Planning with Language Models
Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al. Inner monologue: Embodied reasoning through planning with language models. ArXiv preprint, 2022. URL https://arxiv.org/abs/2207.05608
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[83]
Kapoor, R., Butala, Y. P., Russak, M., Koh, J. Y., Kamble, K., Alshikh, W., and Salakhutdinov, R. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. ArXiv preprint, 2024. URL https://arxiv.org/abs/2402.17553
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.