pith. machine review for the scientific record. sign in

arxiv: 2401.10935 · v2 · pith:CVVQ5U6Znew · submitted 2024-01-17 · 💻 cs.HC · cs.AI

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Pith reviewed 2026-05-17 10:04 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords GUI agentsvisual groundingscreenshot-based agentsGUI grounding benchmarkpre-trainingScreenSpottask automationvisual interfaces
0
0 comments X

The pith

Advancements in GUI grounding directly improve the performance of visual agents that automate tasks from screenshots alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SeeClick, a visual GUI agent that completes complex tasks on phones, desktops, and web browsers using only screenshots rather than structured data such as HTML. It identifies accurate localization of on-screen elements from instructions as the central bottleneck and solves it by pre-training on large amounts of automatically generated grounding examples. A new benchmark called ScreenSpot measures grounding across realistic mobile, desktop, and web interfaces, and three standard agent benchmarks show consistent gains once grounding improves. A sympathetic reader would care because this removes the need for accessible structured data and suggests that grounding skill is a transferable foundation for reliable visual automation.

Core claim

After GUI-grounding pre-training on automatically curated screenshot-instruction pairs, SeeClick achieves large gains on the ScreenSpot benchmark and the improvements transfer to higher success rates on downstream GUI-agent tasks in mobile, desktop, and web settings, establishing a direct correlation between grounding accuracy and overall agent performance.

What carries the argument

GUI grounding—the capacity to locate screen elements from instructions—which is strengthened by pre-training on automatically curated data and then transferred to full task sequences.

If this is right

  • Visual agents can operate without relying on extractable structured data such as HTML or accessibility trees.
  • Pre-training focused on element localization produces measurable gains on standard GUI-agent benchmarks.
  • Performance scales with grounding quality across mobile, desktop, and web platforms.
  • Automatic curation of grounding data provides a scalable route to better agents without manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar grounding pre-training could be applied to other screenshot-based agents outside the GUI domain.
  • The same curation pipeline might be extended to generate even larger or more diverse grounding datasets for further gains.
  • If grounding remains the bottleneck, future agent work could prioritize localization objectives over end-to-end policy learning.

Load-bearing premise

The automatically generated GUI grounding examples are high-quality and representative enough to transfer to real agent tasks across different device environments.

What would settle it

Measure grounding accuracy and downstream task success on a new set of environments or tasks; if the correlation between the two disappears or if pre-training yields no transfer gain, the central claim is falsified.

read the original abstract

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be notably lengthy (e.g., HTML) and occasionally inaccessible (e.g., on desktops). To alleviate this issue, we propose a novel visual GUI agent -- SeeClick, which only relies on screenshots for task automation. In our preliminary study, we have discovered a key challenge in developing visual GUI agents: GUI grounding -- the capacity to accurately locate screen elements based on instructions. To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data. Along with the efforts above, we have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments. After pre-training, SeeClick demonstrates significant improvement in ScreenSpot over various baselines. Moreover, comprehensive evaluations on three widely used benchmarks consistently support our finding that advancements in GUI grounding directly correlate with enhanced performance in downstream GUI agent tasks. The model, data and code are available at https://github.com/njucckevin/SeeClick.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SeeClick, a visual GUI agent that automates tasks on screenshots alone rather than structured data such as HTML. It identifies GUI grounding as the central challenge, introduces an automated method to curate grounding data for pre-training, and releases ScreenSpot, a new benchmark spanning mobile, desktop, and web environments. The authors report that pre-training yields gains on ScreenSpot relative to baselines and that these grounding improvements correlate with higher success rates on three downstream GUI agent benchmarks. Model, data, and code are publicly released.

Significance. If the empirical claims hold after validation, the work is significant because it supplies the first realistic multi-environment GUI grounding benchmark, demonstrates a practical link between grounding accuracy and agent task performance, and releases reproducible artifacts that can accelerate research on screenshot-based agents.

major comments (2)
  1. [§3] §3 (data curation): The automated curation procedure is presented without any quantitative validation metrics such as precision/recall against human labels, inter-annotator agreement, or distribution-shift statistics between curated and real user instructions. This validation is load-bearing for the central claim that pre-training on the curated data produces genuine grounding gains that transfer to agent tasks; without it, observed correlations on the three benchmarks could arise from label noise or selection bias rather than improved grounding.
  2. [§5] §5 (downstream evaluations): The reported correlation between ScreenSpot scores and agent-task success is presented without controls such as ablation of the grounding head, error analysis of failure modes, or comparison against agents that receive equivalent compute but no grounding pre-training. These controls are needed to establish that the grounding improvements are causally responsible for the downstream gains rather than incidental.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'significant improvement' is used without accompanying numbers or baseline identifiers; adding the key deltas (e.g., ScreenSpot accuracy lift) would make the summary self-contained.
  2. [§2] Notation: The term 'GUI grounding' is introduced without an explicit formal definition or equation; a short mathematical statement (e.g., mapping instruction to bounding-box coordinates) would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional quantitative validation for the data curation and stronger controls for the downstream evaluations will help substantiate the central claims. We respond to each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (data curation): The automated curation procedure is presented without any quantitative validation metrics such as precision/recall against human labels, inter-annotator agreement, or distribution-shift statistics between curated and real user instructions. This validation is load-bearing for the central claim that pre-training on the curated data produces genuine grounding gains that transfer to agent tasks; without it, observed correlations on the three benchmarks could arise from label noise or selection bias rather than improved grounding.

    Authors: We agree that quantitative validation of the curation process is important to rule out label noise or selection bias. In the revised manuscript we will add a dedicated subsection to §3 reporting a manual validation study: a random sample of 500 curated examples will be independently labeled by two human annotators, with precision, recall, and inter-annotator agreement reported. We will also include distribution statistics (e.g., instruction length, element type frequencies) comparing the curated data to ScreenSpot and the three downstream benchmarks to quantify any shift. revision: yes

  2. Referee: [§5] §5 (downstream evaluations): The reported correlation between ScreenSpot scores and agent-task success is presented without controls such as ablation of the grounding head, error analysis of failure modes, or comparison against agents that receive equivalent compute but no grounding pre-training. These controls are needed to establish that the grounding improvements are causally responsible for the downstream gains rather than incidental.

    Authors: We concur that additional controls are needed to strengthen the causal interpretation. We will revise §5 to include: (i) an ablation that removes the grounding pre-training stage while keeping total training compute comparable by extending the subsequent fine-tuning; (ii) a categorized error analysis of failure modes on the three downstream benchmarks, explicitly linking errors to grounding inaccuracies; and (iii) a compute-matched baseline agent trained with a generic vision-language pre-training objective instead of GUI grounding pre-training. These results will be presented alongside the existing correlations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper's chain consists of proposing a visual GUI agent, identifying GUI grounding as a challenge via preliminary study, automatically curating grounding data, pre-training SeeClick, releasing the ScreenSpot benchmark, and reporting empirical gains on ScreenSpot plus three downstream agent benchmarks. The central claim of correlation between grounding improvements and agent performance rests on these new evaluations and the released benchmark rather than any equation, fitted parameter renamed as prediction, or self-citation that reduces the result to its own inputs by construction. No load-bearing step exhibits self-definitional, fitted-input, or uniqueness-imported circularity; the work is externally falsifiable through the public model, data, and code.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that GUI grounding is the primary bottleneck and on standard machine-learning training procedures whose specific hyperparameters are not detailed in the abstract.

free parameters (1)
  • pre-training hyperparameters and data curation thresholds
    Standard deep-learning choices required to produce the reported improvements but not enumerated in the abstract.
axioms (1)
  • domain assumption GUI grounding is the key challenge limiting visual GUI agents
    Identified via preliminary study and used to motivate the pre-training approach.

pith-pipeline@v0.9.0 · 5530 in / 1112 out tokens · 88814 ms · 2026-05-17T10:04:24.837735+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    SeeClick demonstrates significant improvement in ScreenSpot over various baselines... comprehensive evaluations on three widely used benchmarks consistently support our finding

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  2. GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

    cs.LG 2026-04 conditional novelty 7.0

    GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.

  3. GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    cs.CV 2025-04 unverdicted novelty 7.0

    GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...

  4. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    cs.AI 2024-05 accept novelty 7.0

    AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

  5. BAMI: Training-Free Bias Mitigation in GUI Grounding

    cs.CV 2026-05 unverdicted novelty 6.0

    BAMI mitigates precision and ambiguity biases in GUI grounding via coarse-to-fine focus and candidate selection, raising accuracy on ScreenSpot-Pro without training.

  6. UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?

    cs.HC 2026-04 accept novelty 6.0

    VLMs achieve moderate alignment with human gaze on UIs that improves with longer viewing durations and varies by UI type, capturing exploratory rather than initial fixation patterns.

  7. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  8. AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

    cs.AI 2025-12 conditional novelty 6.0

    AgentProg reframes interaction history as a program with variables and control flow, plus a belief state for partial observability, achieving SOTA success rates on long-horizon GUI benchmarks while baselines degrade.

  9. MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

    cs.AI 2025-10 unverdicted novelty 6.0

    MGA is a memory-driven GUI agent that uses an observer for bias-free screen reading and structured memory for compact state transitions to enable efficient long-horizon automation.

  10. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  11. GTA1: GUI Test-time Scaling Agent

    cs.AI 2025-07 unverdicted novelty 6.0

    GTA1 combines test-time scaling for action plan selection with RL-based grounding to achieve SOTA results on GUI agent benchmarks.

  12. InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    cs.AI 2025-04 unverdicted novelty 6.0

    InfiGUI-R1 uses Reasoning Injection via spatial distillation followed by Deliberation Enhancement via RL to evolve GUI agents from reactive actors to deliberative reasoners, reporting strong performance on grounding a...

  13. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  14. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  15. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  16. See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

    cs.CV 2026-04 unverdicted novelty 5.0

    Multi-turn visual feedback refinement outperforms single-shot coordinate prediction for pixel-precise GUI grounding in complex coding environments.

  17. Less Detail, Better Answers: Degradation-Driven Prompting for VQA

    cs.CV 2026-04 unverdicted novelty 5.0

    Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.

  18. UI-Oceanus: Scaling GUI Agents with Synthetic Environmental Dynamics

    cs.LG 2026-02 unverdicted novelty 5.0

    UI-Oceanus shows that continual pre-training on forward dynamics predictions from synthetic GUI exploration improves agent success rates by 7% offline and 16.8% online, with gains scaling by data volume.

  19. UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    cs.AI 2025-09 conditional novelty 5.0

    UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.

  20. Agentic Reasoning for Large Language Models

    cs.AI 2026-01 unverdicted novelty 4.0

    The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applicat...

  21. Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

    cs.HC 2024-01 unverdicted novelty 3.0

    This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 21 Pith papers · 24 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    arXiv preprint arXiv:2311.11797 , year=

    Igniting Language Intelligence: The Hitchhiker's Guide From Chain-of-Thought Reasoning to Language Agents , author=. arXiv preprint arXiv:2311.11797 , year=

  9. [9]

    International Conference on Machine Learning , pages=

    World of bits: An open-domain platform for web-based agents , author=. International Conference on Machine Learning , pages=. 2017 , url=

  10. [10]

    International Conference on Learning Representations , year=

    Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration , author=. International Conference on Learning Representations , year=

  11. [11]

    International Conference on Learning Representations , year=

    Learning to Navigate the Web , author=. International Conference on Learning Representations , year=

  12. [13]

    NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

    Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. NeurIPS 2023 Foundation Models for Decision Making Workshop , year=

  13. [16]

    ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

    Understanding HTML with Large Language Models , author=. ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models , year=

  14. [20]

    AppAgent: Multimodal Agents as Smartphone Users

    AppAgent: Multimodal Agents as Smartphone Users , author=. arXiv preprint arXiv:2312.13771 , year=

  15. [21]

    Advances in Neural Information Processing Systems , year=

    From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces , author=. Advances in Neural Information Processing Systems , year=

  16. [23]

    Neural Information Processing Systems , year =

    Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , title =. Neural Information Processing Systems , year =

  17. [26]

    InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

    Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition , author=. arXiv preprint arXiv:2309.15112 , year=

  18. [28]

    International Conference on Learning Representations , year=

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

  19. [34]

    CogVLM: Visual Expert for Pretrained Language Models

    Cogvlm: Visual expert for pretrained language models , author=. arXiv preprint arXiv:2311.03079 , year=

  20. [36]

    The 34th Annual ACM Symposium on User Interface Software and Technology , pages=

    Screen2words: Automatic mobile UI summarization with multimodal learning , author=. The 34th Annual ACM Symposium on User Interface Software and Technology , pages=. 2021 , url=

  21. [37]

    Proceedings of the AAAI Conference on Artificial Intelligence , pages=

    Actionbert: Leveraging user actions for semantic understanding of user interfaces , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=. 2021 , url=

  22. [38]

    arXiv preprint arXiv:2107.13731 , year=

    Uibert: Learning generic multimodal representations for ui understanding , author=. arXiv preprint arXiv:2107.13731 , year=

  23. [40]

    Proceedings of the 29th International Conference on Computational Linguistics , pages=

    Towards Better Semantic Understanding of Mobile Interfaces , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=. 2022 , url=

  24. [41]

    proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

    Object detection for graphical user interface: Old fashioned or deep learning or a combination? , author=. proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=. 2020 , url=

  25. [43]

    European Conference on Computer Vision , pages=

    A dataset for interactive vision-language navigation with unknown command feasibility , author=. European Conference on Computer Vision , pages=. 2022 , organization=

  26. [44]

    The Eleventh International Conference on Learning Representations , year=

    Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus , author=. The Eleventh International Conference on Learning Representations , year=

  27. [46]

    Proceedings of the 30th annual ACM symposium on user interface software and technology , pages=

    Rico: A mobile app dataset for building data-driven design applications , author=. Proceedings of the 30th annual ACM symposium on user interface software and technology , pages=. 2017 , url=

  28. [49]

    International Conference on Learning Representations , year=

    Pix2seq: A Language Modeling Framework for Object Detection , author=. International Conference on Learning Representations , year=

  29. [50]

    International Conference on Machine Learning , pages=

    Pix2struct: Screenshot parsing as pretraining for visual language understanding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  30. [51]

    The 2023 Conference on Empirical Methods in Natural Language Processing , year=

    UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model , author=. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

  31. [52]

    International Conference on Learning Representations , year=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

  32. [53]

    Introducing our Multimodal Models , url =

    Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

  33. [58]

    International Journal of Computer Vision , volume=

    Top-down neural attention by excitation backprop , author=. International Journal of Computer Vision , volume=. 2018 , publisher=

  34. [59]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Grounded language-image pre-training , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=. 2022 , url=

  35. [60]

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=

    Referitgame: Referring to objects in photographs of natural scenes , author=. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages=. 2014 , url=

  36. [66]

    2024 , eprint=

    OS-Copilot: Towards Generalist Computer Agents with Self-Improvement , author=. 2024 , eprint=

  37. [67]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. https://arxiv.org/pdf/2308.12966 Qwen-vl: A frontier large vision-language model with versatile abilities . arXiv preprint arXiv:2308.12966

  38. [68]

    Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa g nak Ta s rlar. 2023. https://www.adept.ai/blog/fuyu-8b Introducing our multimodal models

  39. [69]

    Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A Plummer. 2022. https://arxiv.org/pdf/2202.02312 A dataset for interactive vision-language navigation with unknown command feasibility . In European Conference on Computer Vision, pages 312--328. Springer

  40. [70]

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023 a . https://arxiv.org/pdf/2310.09478 Minigpt-v2: large language model as a unified interface for vision-language multi-task learning . arXiv preprint arXiv:2310.09478

  41. [71]

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. 2023 b . https://arxiv.org/pdf/2306.15195 Shikra: Unleashing multimodal llm's referential dialogue magic . arXiv preprint arXiv:2306.15195

  42. [72]

    Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. 2021. https://arxiv.org/pdf/2109.10852 Pix2seq: A language modeling framework for object detection . In International Conference on Learning Representations

  43. [73]

    Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. https://dl.acm.org/doi/pdf/10.1145/3126594.3126651 Rico: A mobile app dataset for building data-driven design applications . In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845--854

  44. [74]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. https://arxiv.org/pdf/2306.06070 Mind2web: Towards a generalist agent for the web . arXiv preprint arXiv:2306.06070

  45. [75]

    Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin Gur. 2023. https://arxiv.org/pdf/2305.11854 Multimodal web navigation with instruction-finetuned foundation models . arXiv preprint arXiv:2305.11854

  46. [77]

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2023. https://arxiv.org/pdf/2307.12856 A real-world webagent with planning, long context understanding, and program synthesis . arXiv preprint arXiv:2307.12856

  47. [78]

    Izzeddin Gur, Ulrich Rueckert, Aleksandra Faust, and Dilek Hakkani-Tur. 2018. https://arxiv.org/pdf/1812.09195 Learning to navigate the web . In International Conference on Learning Representations

  48. [79]

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2023. https://arxiv.org/pdf/2312.08914 Cogagent: A visual language model for gui agents . arXiv preprint arXiv:2312.08914

  49. [80]

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. https://arxiv.org/pdf/2106.09685.pdf In International Conference on Learning Representations

  50. [81]

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2023. https://arxiv.org/pdf/2303.17491 Language models can solve computer tasks . arXiv preprint arXiv:2303.17491

  51. [82]

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023. https://arxiv.org/pdf/2305.03726 Otter: A multi-modal model with in-context instruction tuning . arXiv preprint arXiv:2305.03726

  52. [83]

    Gang Li and Yang Li. 2022. https://arxiv.org/pdf/2209.14927 Spotlight: Mobile ui understanding using vision-language models with a focus . In The Eleventh International Conference on Learning Representations

  53. [84]

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022. http://openaccess.thecvf.com/content/CVPR2022/papers/Li_Grounded_Language-Image_Pre-Training_CVPR_2022_paper.pdf Grounded language-image pre-training . In Proceedings of the IEEE/CVF Conference on Compute...

  54. [85]

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020 a . https://arxiv.org/pdf/2005.03776 Mapping natural language instructions to mobile ui action sequences . arXiv preprint arXiv:2005.03776

  55. [86]

    Yang Li, Gang Li, Luheng He, Jingjie Zheng, Hong Li, and Zhiwei Guan. 2020 b . https://arxiv.org/pdf/2010.04295 Widget captioning: Generating natural language description for mobile user interface elements . arXiv preprint arXiv:2010.04295

  56. [87]

    Yang Li, Gang Li, Xin Zhou, Mostafa Dehghani, and Alexey Gritsenko. 2021. https://arxiv.org/pdf/2112.05692 Vut: Versatile ui transformer for multi-modal multi-task user interface modeling . arXiv preprint arXiv:2112.05692

  57. [88]

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. https://arxiv.org/pdf/1802.08802 Reinforcement learning on web interfaces using workflow-guided exploration . In International Conference on Learning Representations

  58. [89]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023 a . https://arxiv.org/pdf/2304.08485 Visual instruction tuning . In Neural Information Processing Systems

  59. [90]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. 2023 b . https://arxiv.org/pdf/2307.06281 Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281

  60. [91]

    OpenAI. 2023. http://arxiv.org/abs/2303.08774 GPT-4 technical report

  61. [92]

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. 2023. https://arxiv.org/pdf/2306.14824 Kosmos-2: Grounding multimodal large language models to the world . arXiv preprint arXiv:2306.14824

  62. [93]

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2023. https://arxiv.org/pdf/2307.10088 Android in the wild: A large-scale dataset for android device control . arXiv preprint arXiv:2307.10088

  63. [94]

    Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. 2023. https://arxiv.org/abs/2306.00245 From pixels to ui actions: Learning to follow instructions via graphical user interfaces . In Advances in Neural Information Processing Systems

  64. [95]

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. 2017. https://proceedings.mlr.press/v70/shi17a.html World of bits: An open-domain platform for web-based agents . In International Conference on Machine Learning, pages 3135--3144. PMLR

  65. [96]

    Qiushi Sun, Zhangyue Yin, Xiang Li, Zhiyong Wu, Xipeng Qiu, and Lingpeng Kong. 2023. https://arxiv.org/pdf/2310.00280 Corex: Pushing the boundaries of complex reasoning through multi-model collaboration . arXiv preprint arXiv:2310.00280

  66. [97]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. https://arxiv.org/pdf/2302.13971 Llama: Open and efficient foundation language models . arXiv preprint arXiv:2302.13971

  67. [98]

    Bryan Wang, Gang Li, Xin Zhou, Zhourong Chen, Tovi Grossman, and Yang Li. 2021. https://dl.acm.org/doi/pdf/10.1145/3472749.3474765 Screen2words: Automatic mobile ui summarization with multimodal learning . In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498--510

  68. [99]

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. 2023. https://arxiv.org/pdf/2305.11175 Visionllm: Large language model is also an open-ended decoder for vision-centric tasks . arXiv preprint arXiv:2305.11175

  69. [100]

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. 2024. http://arxiv.org/abs/2402.07456 Os-copilot: Towards generalist computer agents with self-improvement

  70. [101]

    Fangzhi Xu, Zhiyong Wu, Qiushi Sun, Siyu Ren, Fei Yuan, Shuai Yuan, Qika Lin, Yu Qiao, and Jun Liu. 2023. https://arxiv.org/pdf/2311.09278 Symbol-llm: Towards foundational symbol-centric interface for large language models . arXiv preprint arXiv:2311.09278

  71. [102]

    An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. https://arxiv.org/pdf/2311.07562 Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation . arXiv preprint arXiv:2311.07562

  72. [103]

    Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023 a . https://arxiv.org/pdf/2312.13108 Appagent: Multimodal agents as smartphone users . arXiv preprint arXiv:2312.13771

  73. [104]

    Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023 b . https://www.stableaiprompts.com/wp-content/uploads/2023/10/Chatgpt-Updates.pdf The dawn of lmms: Preliminary explorations with gpt-4v (ision) . arXiv preprint arXiv:2309.17421, 9(1):1

  74. [105]

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. 2023. https://arxiv.org/pdf/2304.14178.pdf?trk=public_post_comment-text mplug-owl: Modularization empowers large language models with multimodality . arXiv preprint arXiv:2304.14178

  75. [106]

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. 2023. https://arxiv.org/pdf/2308.02490 Mm-vet: Evaluating large multimodal models for integrated capabilities . arXiv preprint arXiv:2308.02490

  76. [107]

    Zhuosheng Zhan and Aston Zhang. 2023. https://arxiv.org/pdf/2309.11436 You only look at screens: Multimodal chain-of-action agents . arXiv preprint arXiv:2309.11436

  77. [108]

    Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, and Yan Lu. 2023. https://arxiv.org/pdf/2310.04716 Reinforced ui instruction grounding: Towards a generic ui task automation api . arXiv preprint arXiv:2310.04716

  78. [109]

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. https://arxiv.org/abs/2401.01614 Gpt-4v (ision) is a generalist web agent, if grounded . arXiv preprint arXiv:2401.01614

  79. [110]

    Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An. 2023. https://ltzheng.github.io/Synapse/static/Synapse.pdf Synapse: Trajectory-as-exemplar prompting with memory for computer control . In NeurIPS 2023 Foundation Models for Decision Making Workshop

  80. [111]

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. 2023. https://arxiv.org/pdf/2307.13854 Webarena: A realistic web environment for building autonomous agents . arXiv preprint arXiv:2307.13854

Showing first 80 references.