pith. machine review for the scientific record. sign in

arxiv: 2401.01614 · v2 · submitted 2024-01-03 · 💻 cs.IR · cs.AI· cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:37 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.CV
keywords web agentsGPT-4Vmultimodal modelsgrounding strategiesMIND2WEB benchmarklive websitesonline evaluationnatural language instructions
0
0 comments X

The pith

GPT-4V completes 51.1 percent of tasks on live websites when its textual plans are manually grounded into actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large multimodal models like GPT-4V can act as generalist web agents that follow natural language instructions to finish tasks on arbitrary websites. It introduces SEEACT and evaluates the model on the MIND2WEB benchmark in both cached and live online settings. GPT-4V reaches 51.1 percent success with manual grounding of its plans, which beats text-only GPT-4 and smaller fine-tuned models, yet automatic grounding remains the main bottleneck and falls short of oracle performance.

Core claim

GPT-4V presents a great potential for web agents as it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models specifically fine-tuned for web agents. However, grounding still remains a major challenge, with existing strategies like set-of-mark prompting proving ineffective and the best developed approach leveraging both HTML structure and visuals but leaving a substantial gap with oracle grounding.

What carries the argument

Manual grounding of GPT-4V textual plans into website actions, which serves as an upper-bound proxy for evaluating the model's integrated planning and visual reasoning inside the SEEACT agent.

If this is right

  • GPT-4V outperforms GPT-4 and fine-tuned models such as FLAN-T5 and BLIP-2 on web agent tasks.
  • Set-of-mark prompting fails as a grounding strategy for web agents.
  • The strongest current grounding method combines HTML structure with visual information.
  • A large performance gap persists between the best automatic grounding and oracle grounding.
  • A practical tool now exists for running and evaluating web agents directly on live websites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automatic grounding methods that close the gap to oracle levels would remove the need for human intervention and enable fully autonomous web agents.
  • Multimodal models may generalize across previously unseen websites more readily than agents trained on narrow task distributions.
  • Directly embedding grounding inside the model could eliminate the separate manual translation step.
  • The same visual-plus-structure approach might transfer to other interactive digital environments such as desktop applications or mobile interfaces.

Load-bearing premise

Manual grounding of the model's textual plans supplies a valid upper-bound proxy for its planning and reasoning capability.

What would settle it

A controlled test in which GPT-4V still fails most tasks even when given perfect manual groundings on live sites, or an automatic grounding method that matches the 51.1 percent success rate.

read the original abstract

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SEEACT, a generalist web agent that uses GPT-4V to integrate visual understanding from screenshots with HTML structure for planning and executing natural-language web tasks. It reports results on the MIND2WEB benchmark in both offline (cached) and online (live-website) settings, with the headline finding that GPT-4V achieves 51.1% task success on live sites when its textual plans are manually grounded into actions; this outperforms text-only GPT-4 and smaller fine-tuned models (FLAN-T5, BLIP-2). The work identifies grounding as the primary remaining bottleneck, shows that set-of-mark prompting is ineffective, and proposes an HTML+visual strategy that still trails oracle grounding. All code, data, and the live-evaluation tool are released.

Significance. If the manual-grounding protocol can be documented reproducibly, the result supplies a concrete upper-bound demonstration that large multimodal models possess substantial planning and reasoning capacity for generalist web agents, while the new online evaluation framework and open-source release provide immediate value for follow-on work on automatic grounding. The direct empirical measurements on live sites strengthen the claim relative to purely offline benchmarks.

major comments (2)
  1. [§4.2] §4.2 (online evaluation protocol): the manual grounding procedure that converts GPT-4V textual plans into executable actions is described only at a high level; without an explicit, reproducible protocol (e.g., how element selection, coordinate resolution, or recovery from ambiguous plans is performed), the 51.1% live-site figure cannot be confidently interpreted as a stable upper-bound proxy for the model's planning capability.
  2. [Table 3] Table 3 (live-website results): the performance gap between the authors' best automatic grounding method and oracle grounding is large, yet no per-task failure-mode breakdown or error analysis is provided to indicate whether the shortfall is primarily in visual localization, HTML parsing, or action sequencing; this information is load-bearing for the claim that 'grounding still remains a major challenge.'
minor comments (2)
  1. [Abstract] Abstract: '51.1 of the tasks' should read '51.1% of the tasks' for precision.
  2. [§5.1] §5.1: the notation for the three grounding variants (set-of-mark, HTML-only, HTML+visual) is introduced without a compact summary table, making cross-references in the results section harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and will revise the manuscript to improve reproducibility and strengthen the supporting analysis.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (online evaluation protocol): the manual grounding procedure that converts GPT-4V textual plans into executable actions is described only at a high level; without an explicit, reproducible protocol (e.g., how element selection, coordinate resolution, or recovery from ambiguous plans is performed), the 51.1% live-site figure cannot be confidently interpreted as a stable upper-bound proxy for the model's planning capability.

    Authors: We agree that the current description in §4.2 is high-level and that an explicit, reproducible protocol is needed to allow confident interpretation of the 51.1% figure as an upper bound on planning capability. In the revised manuscript we will expand §4.2 with a detailed step-by-step protocol that specifies element selection from the rendered HTML, coordinate resolution from screenshots, handling of ambiguous or incomplete plans, and recovery strategies employed during manual execution. revision: yes

  2. Referee: [Table 3] Table 3 (live-website results): the performance gap between the authors' best automatic grounding method and oracle grounding is large, yet no per-task failure-mode breakdown or error analysis is provided to indicate whether the shortfall is primarily in visual localization, HTML parsing, or action sequencing; this information is load-bearing for the claim that 'grounding still remains a major challenge.'

    Authors: We acknowledge that a per-task failure-mode breakdown would strengthen the claim that grounding remains the primary bottleneck. We will add a new subsection to the results section that provides a quantitative error analysis of the best automatic grounding method versus oracle grounding on the live-website tasks. Failures will be manually categorized into visual localization, HTML parsing, action sequencing, and other categories, with counts, percentages, and representative examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper reports direct empirical results from running GPT-4V on the external MIND2WEB benchmark under manual grounding, achieving 51.1% task completion on live websites. No equations, fitted parameters, predictions, or self-citations are used to derive the central performance figure; the result is obtained by straightforward evaluation against an independent benchmark and task set. The work explicitly notes the gap to automatic grounding and presents the manual figure as an upper-bound proxy rather than a derived claim, keeping the derivation chain free of self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical evaluation of an existing model on a new task domain; no new free parameters, axioms beyond standard LMM capabilities, or invented entities are introduced.

axioms (1)
  • domain assumption GPT-4V can jointly reason over web screenshots and HTML structure to produce actionable plans
    Invoked throughout the agent design and evaluation sections.

pith-pipeline@v0.9.0 · 5608 in / 1294 out tokens · 46143 ms · 2026-05-15T19:37:24.478166+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  2. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  3. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.

  4. State-Centric Decision Process

    cs.AI 2026-05 unverdicted novelty 7.0

    SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

  5. Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

    cs.CV 2026-05 unverdicted novelty 7.0

    Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

  6. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 7.0

    ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...

  7. Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...

  8. WAAA! Web Adversaries Against Agentic Browsers

    cs.CR 2026-05 unverdicted novelty 7.0

    Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.

  9. Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

    cs.HC 2026-04 unverdicted novelty 7.0

    VLMs detect primitive motion in UI animations reliably but show inconsistent high-level interpretation of purposes and meanings, with large gaps relative to human performance.

  10. UIPress: Bringing Optical Token Compression to UI-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...

  11. ClawBench: Can AI Agents Complete Everyday Online Tasks?

    cs.CL 2026-04 unverdicted novelty 7.0

    ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.

  12. GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis

    cs.AI 2026-04 unverdicted novelty 7.0

    GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.

  13. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  14. Web Agents Should Adopt the Plan-Then-Execute Paradigm

    cs.CR 2026-05 unverdicted novelty 6.0

    Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.

  15. MMSkills: Towards Multimodal Skills for General Visual Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.

  16. ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

    cs.CL 2026-05 unverdicted novelty 6.0

    ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...

  17. PageGuide: Browser extension to assist users in navigating a webpage and locating information

    cs.HC 2026-04 accept novelty 6.0

    PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.

  18. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  19. Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.

  20. WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

    cs.AI 2026-03 unverdicted novelty 6.0

    WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.

  21. ClawMobile: Rethinking Smartphone-Native Agentic Systems

    cs.MA 2026-02 unverdicted novelty 4.0

    ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 19 Pith papers · 10 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    URL https://api.semanticscholar. org/CorpusID:266359502. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira...

  2. [2]

    org/CorpusID:248476411

    URL https://api.semanticscholar. org/CorpusID:248476411. Anil, G. T. G. R., Borgeaud, S., and et al., Y . W. Gem- ini: A family of highly capable multimodal models

  3. [3]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    URL https://api.semanticscholar. org/CorpusID:266361876. Chandu, K. R., Bisk, Y ., and Black, A. W. Grounding ‘grounding’in nlp. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4283– 4305, 2021. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic....

  4. [4]

    org/CorpusID:259262082

    URL https://api.semanticscholar. org/CorpusID:259262082. Cheng, K., Sun, Q., Chu, Y ., Xu, F., Li, Y ., Zhang, J., and Wu, Z. Seeclick: Harness- ing gui grounding for advanced visual gui agents

  5. [5]

    Scaling Instruction-Finetuned Language Models

    URL https://api.semanticscholar. org/CorpusID:267069082. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Valter, D., Narang, S., Mishra, G., Yu, A. W., Zhao, V ., Huang, Y ., Dai, A. M., Yu, H., Petrov, S., hsin Chi, E. H., D...

  6. [6]

    Mind2Web: Towards a Generalist Agent for the Web

    URL https://api.semanticscholar. org/CorpusID:253018554. Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y . Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070 , 2023. Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y ., Gu, S. S., and Gur, I. Multimodal web navigation with instruction- finetuned ...

  7. [7]

    org/CorpusID:258823350

    URL https://api.semanticscholar. org/CorpusID:258823350. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398 – 414,

  8. [8]

    org/CorpusID:8081284

    URL https://api.semanticscholar. org/CorpusID:8081284. Gu, Y ., Deng, X., and Su, Y . Don’t generate, discrimi- nate: A proposal for grounding language models to real- world environments. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (V ol- ume 1: Long Pape...

  9. [9]

    org/CorpusID:265499116

    URL https://api.semanticscholar. org/CorpusID:265499116. Gur, I., Nachum, O., Miao, Y ., Safdari, M., Huang, A., Chowdhery, A., Narang, S., Fiedel, N., and Faust, A. Understanding html with large language models. In Conference on Empirical Methods in Natural Language Processing , 2022. URL https: //api.semanticscholar.org/CorpusID: 252780086. Gur, I., Fur...

  10. [10]

    Measuring Massive Multitask Language Understanding

    URL https://api.semanticscholar. org/CorpusID:260126067. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. X., and Steinhardt, J. Measuring massive multi- task language understanding. ArXiv, abs/2009.03300,

  11. [11]

    org/CorpusID:221516475

    URL https://api.semanticscholar. org/CorpusID:221516475. Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y ., Wang, Z., Dong, Y ., Ding, M., and Tang, J. Cogagent: A visual language model for gui agents

  12. [12]

    org/CorpusID:266210390

    URL https://api.semanticscholar. org/CorpusID:266210390. Kazemzadeh, S., Ordonez, V ., andre Matten, M., and Berg, T. L. Referitgame: Referring to objects in photographs of natural scenes. In Conference on Empirical Methods in Natural Language Processing ,

  13. [13]

    Language models can solve computer tasks

    URL https://api.semanticscholar. org/CorpusID:6308361. Kim, G., Baldi, P., and McAleer, S. M. Language mod- els can solve computer tasks. ArXiv, abs/2303.17491,

  14. [14]

    org/CorpusID:257834038

    URL https://api.semanticscholar. org/CorpusID:257834038. Koh, J. Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M. C., Huang, P.-Y ., Neubig, G., Zhou, S., Salakhutdi- nov, R., and Fried, D. Visualwebarena: Evaluat- ing multimodal agents on realistic visual web tasks

  15. [15]

    org/CorpusID:267199749

    URL https://api.semanticscholar. org/CorpusID:267199749. Lee, K., Joshi, M., Turc, I., Hu, H., Liu, F., Eisen- schlos, J. M., Khandelwal, U., Shaw, P., Chang, M.-W., and Toutanova, K. Pix2struct: Screenshot parsing as pretraining for visual language understand- ing. ArXiv, abs/2210.03347, 2022. URL https: //api.semanticscholar.org/CorpusID: 252762394. Li,...

  16. [16]

    org/CorpusID:252383606

    URL https://api.semanticscholar. org/CorpusID:252383606. Mazumder, S. and Riva, O. Flin: A flexible natural language interface for web navigation. ArXiv, abs/2010.12844,

  17. [17]

    GPT-4 Technical Report

    URL https://api.semanticscholar. org/CorpusID:225067907. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

  18. [18]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    URL https://api.semanticscholar. org/CorpusID:257532815. Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306.14824,

  19. [19]

    org/CorpusID:259262263

    URL https://api.semanticscholar. org/CorpusID:259262263. 10 GPT-4V(ision) is a Generalist Web Agent, if Grounded Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Com- puter Vision, 123:7...

  20. [20]

    org/CorpusID:250729995

    URL https://api.semanticscholar. org/CorpusID:250729995. Shaw, P., Joshi, M., Cohan, J., Berant, J., Pasupat, P., Hu, H., Khandelwal, U., Lee, K., and Toutanova, K. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. ArXiv, abs/2306.00245,

  21. [21]

    org/CorpusID:258999511

    URL https://api.semanticscholar. org/CorpusID:258999511. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., Yao, Z., and Keutzer, K. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021. Shi, T., Karpathy, A., Fan, L. J., Hernández, J. Z., and Liang, P. World of bits: An open-domain platform for web...

  22. [22]

    org/CorpusID:258108138

    URL https://api.semanticscholar. org/CorpusID:258108138. Sridhar, A., Lo, R., Xu, F. F., Zhu, H., and Zhou, S. Hierarchical prompting assists large language model on web navigation. ArXiv, abs/2305.14257,

  23. [23]

    org/CorpusID:258841249

    URL https://api.semanticscholar. org/CorpusID:258841249. Yan, A., Yang, Z., Zhu, W., Lin, K., Li, L., Wang, J., Yang, J., Zhong, Y ., McAuley, J., Gao, J., Liu, Z., and Wang, L. Gpt-4v in wonderland: Large multi- modal models for zero-shot smartphone gui navigation

  24. [24]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    URL https://api.semanticscholar. org/CorpusID:265149992. Yang, J., Zhang, H., Li, F., Zou, X., yue Li, C., and Gao, J. Set-of-mark prompting unleashes extraordi- nary visual grounding in gpt-4v. ArXiv, abs/2310.11441, 2023a. URL https://api.semanticscholar. org/CorpusID:264172201. Yang, L., Wang, Y ., Li, X., Wang, X., and Yang, J. Fine-grained visual pro...

  25. [25]

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    URL https://api.semanticscholar. org/CorpusID:250264533. You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything any- where at any granularity. ArXiv, abs/2310.07704,

  26. [26]

    MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    URL https://api.semanticscholar. org/CorpusID:263834718. Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mmmu: A mas- sive multi-discipline multimodal understanding and rea- soning ben...

  27. [27]

    org/CorpusID:265466525

    URL https://api.semanticscholar. org/CorpusID:265466525. Zhang, X., Lu, Y ., Wang, W., Yan, A., Yan, J., Qin, L., Wang, H., Yan, X., Wang, W. Y ., and Pet- zold, L. R. Gpt-4v(ision) as a generalist evaluator for vision-language tasks. ArXiv, abs/2311.01361,

  28. [28]

    org/CorpusID:264935635

    URL https://api.semanticscholar. org/CorpusID:264935635. Zhao, Y ., Lin, Z., Zhou, D., Huang, Z., Feng, J., and Kang, B. Bubogpt: Enabling visual ground- ing in multi-modal llms. ArXiv, abs/2307.08581,

  29. [29]

    Zhong, R

    URL https://api.semanticscholar. org/CorpusID:259937702. Zhong, W., Cui, R., Guo, Y ., Liang, Y ., Lu, S., Wang, Y ., Saied, A. S. S., Chen, W., and Duan, N. Agieval: A human-centric benchmark for eval- uating foundation models. ArXiv, abs/2304.06364,

  30. [30]

    org/CorpusID:258108259

    URL https://api.semanticscholar. org/CorpusID:258108259. 11 GPT-4V(ision) is a Generalist Web Agent, if Grounded Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. Scene parsing through ade20k dataset. 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pp. 5122–5130,

  31. [31]

    org/CorpusID:5636055

    URL https://api.semanticscholar. org/CorpusID:5636055. 12 GPT-4V(ision) is a Generalist Web Agent, if Grounded Table of Content:

  32. [32]

    Appendix A: Offline Experiments Method Details

  33. [33]

    Appendix B: Markup Type Ablation Study

  34. [34]

    Appendix C: Online Experiment Details

  35. [35]

    Appendix D:Offline Experiment Prompts

  36. [36]

    Appendix E: Error Examples for Grounding via Image Annotation

  37. [37]

    Appendix F: Strong Capability of Planning

  38. [38]

    Appendix G: Challenges in Grounding via Textual Choices

  39. [39]

    Appendix H: Knowledge and Reasoning Requirements

  40. [40]

    Element",

    Appendix I: Path Variation and Awareness of Error Correction 13 GPT-4V(ision) is a Generalist Web Agent, if Grounded A. Offline Experiments Method Details FLAN-T5. We fine-tune FLAN-T5 using a left-to-right language modeling objective with the target sequence of ground-truth actions in the Mind2Web training data. The fine-tuned FLAN-T5 then serves as the ...

  41. [41]

    You should only issue a valid action given the current observation

  42. [42]

    Schedule

    You should only issue one action at a time. 15 GPT-4V(ision) is a Generalist Web Agent, if Grounded Table 7: Prompt for SEEACT grounding via element attributes. We make a slight modification to enhance action generation and only show the modified part here to save space, as well as the prompts in Table 8 and Table 9. System Role Same as Table 6 Action Gen...