arxiv: 2401.01614 · v2 · submitted 2024-01-03 · 💻 cs.IR · cs.AI· cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

GPT-4V(ision) is a Generalist Web Agent, if Grounded

Boyuan Zheng , Boyu Gou , Jihyung Kil , Huan Sun , Yu Su

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:37 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.CV

keywords web agentsGPT-4Vmultimodal modelsgrounding strategiesMIND2WEB benchmarklive websitesonline evaluationnatural language instructions

0 comments

The pith

GPT-4V completes 51.1 percent of tasks on live websites when its textual plans are manually grounded into actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large multimodal models like GPT-4V can act as generalist web agents that follow natural language instructions to finish tasks on arbitrary websites. It introduces SEEACT and evaluates the model on the MIND2WEB benchmark in both cached and live online settings. GPT-4V reaches 51.1 percent success with manual grounding of its plans, which beats text-only GPT-4 and smaller fine-tuned models, yet automatic grounding remains the main bottleneck and falls short of oracle performance.

Core claim

GPT-4V presents a great potential for web agents as it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models specifically fine-tuned for web agents. However, grounding still remains a major challenge, with existing strategies like set-of-mark prompting proving ineffective and the best developed approach leveraging both HTML structure and visuals but leaving a substantial gap with oracle grounding.

What carries the argument

Manual grounding of GPT-4V textual plans into website actions, which serves as an upper-bound proxy for evaluating the model's integrated planning and visual reasoning inside the SEEACT agent.

If this is right

GPT-4V outperforms GPT-4 and fine-tuned models such as FLAN-T5 and BLIP-2 on web agent tasks.
Set-of-mark prompting fails as a grounding strategy for web agents.
The strongest current grounding method combines HTML structure with visual information.
A large performance gap persists between the best automatic grounding and oracle grounding.
A practical tool now exists for running and evaluating web agents directly on live websites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automatic grounding methods that close the gap to oracle levels would remove the need for human intervention and enable fully autonomous web agents.
Multimodal models may generalize across previously unseen websites more readily than agents trained on narrow task distributions.
Directly embedding grounding inside the model could eliminate the separate manual translation step.
The same visual-plus-structure approach might transfer to other interactive digital environments such as desktop applications or mobile interfaces.

Load-bearing premise

Manual grounding of the model's textual plans supplies a valid upper-bound proxy for its planning and reasoning capability.

What would settle it

A controlled test in which GPT-4V still fails most tasks even when given perfect manual groundings on live sites, or an automatic grounding method that matches the 51.1 percent success rate.

read the original abstract

The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPT-4V reaches 51% success on live web tasks only when plans are manually grounded into actions, which beats text baselines but leaves automatic grounding as the open bottleneck.

read the letter

The paper's main point is straightforward: GPT-4V can complete 51.1% of MIND2WEB tasks on live sites when its text plans are turned into clicks and inputs by hand. That beats GPT-4 and the fine-tuned smaller models they test, and it comes from a new online evaluation setup they built on top of the cached-site benchmark. They also introduce SEEACT, which feeds both screenshots and HTML into the model for planning and then tries different ways to map outputs to actions. Releasing the code, data, and live tool is useful and lets others reproduce the numbers directly. The comparisons are clean, and the results show multimodal input helps over text-only for these navigation tasks. The evidence is empirical and external, with no fitted parameters or circular derivations. The soft spot is the manual-grounding step itself. They treat the 51% figure as an upper-bound proxy because their automatic strategies, including set-of-mark prompting and their own HTML-plus-visual method, still fall well short of oracle grounding. That keeps the result conditional rather than a demonstration of a fully autonomous generalist agent. The gap is acknowledged but not closed, so claims about broad productivity gains stay prospective. The work is aimed at people building web agents or testing LMMs in interactive settings. It gives a realistic snapshot of current capabilities and points to grounding as the concrete next problem. I would bring it to a reading group to talk through the live evaluation mechanics and the multimodal advantage. It deserves peer review because the new online setting and the direct baseline numbers are worth checking and extending.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SEEACT, a generalist web agent that uses GPT-4V to integrate visual understanding from screenshots with HTML structure for planning and executing natural-language web tasks. It reports results on the MIND2WEB benchmark in both offline (cached) and online (live-website) settings, with the headline finding that GPT-4V achieves 51.1% task success on live sites when its textual plans are manually grounded into actions; this outperforms text-only GPT-4 and smaller fine-tuned models (FLAN-T5, BLIP-2). The work identifies grounding as the primary remaining bottleneck, shows that set-of-mark prompting is ineffective, and proposes an HTML+visual strategy that still trails oracle grounding. All code, data, and the live-evaluation tool are released.

Significance. If the manual-grounding protocol can be documented reproducibly, the result supplies a concrete upper-bound demonstration that large multimodal models possess substantial planning and reasoning capacity for generalist web agents, while the new online evaluation framework and open-source release provide immediate value for follow-on work on automatic grounding. The direct empirical measurements on live sites strengthen the claim relative to purely offline benchmarks.

major comments (2)

[§4.2] §4.2 (online evaluation protocol): the manual grounding procedure that converts GPT-4V textual plans into executable actions is described only at a high level; without an explicit, reproducible protocol (e.g., how element selection, coordinate resolution, or recovery from ambiguous plans is performed), the 51.1% live-site figure cannot be confidently interpreted as a stable upper-bound proxy for the model's planning capability.
[Table 3] Table 3 (live-website results): the performance gap between the authors' best automatic grounding method and oracle grounding is large, yet no per-task failure-mode breakdown or error analysis is provided to indicate whether the shortfall is primarily in visual localization, HTML parsing, or action sequencing; this information is load-bearing for the claim that 'grounding still remains a major challenge.'

minor comments (2)

[Abstract] Abstract: '51.1 of the tasks' should read '51.1% of the tasks' for precision.
[§5.1] §5.1: the notation for the three grounding variants (set-of-mark, HTML-only, HTML+visual) is introduced without a compact summary table, making cross-references in the results section harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and will revise the manuscript to improve reproducibility and strengthen the supporting analysis.

read point-by-point responses

Referee: [§4.2] §4.2 (online evaluation protocol): the manual grounding procedure that converts GPT-4V textual plans into executable actions is described only at a high level; without an explicit, reproducible protocol (e.g., how element selection, coordinate resolution, or recovery from ambiguous plans is performed), the 51.1% live-site figure cannot be confidently interpreted as a stable upper-bound proxy for the model's planning capability.

Authors: We agree that the current description in §4.2 is high-level and that an explicit, reproducible protocol is needed to allow confident interpretation of the 51.1% figure as an upper bound on planning capability. In the revised manuscript we will expand §4.2 with a detailed step-by-step protocol that specifies element selection from the rendered HTML, coordinate resolution from screenshots, handling of ambiguous or incomplete plans, and recovery strategies employed during manual execution. revision: yes
Referee: [Table 3] Table 3 (live-website results): the performance gap between the authors' best automatic grounding method and oracle grounding is large, yet no per-task failure-mode breakdown or error analysis is provided to indicate whether the shortfall is primarily in visual localization, HTML parsing, or action sequencing; this information is load-bearing for the claim that 'grounding still remains a major challenge.'

Authors: We acknowledge that a per-task failure-mode breakdown would strengthen the claim that grounding remains the primary bottleneck. We will add a new subsection to the results section that provides a quantitative error analysis of the best automatic grounding method versus oracle grounding on the live-website tasks. Failures will be manually categorized into visual localization, HTML parsing, action sequencing, and other categories, with counts, percentages, and representative examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper reports direct empirical results from running GPT-4V on the external MIND2WEB benchmark under manual grounding, achieving 51.1% task completion on live websites. No equations, fitted parameters, predictions, or self-citations are used to derive the central performance figure; the result is obtained by straightforward evaluation against an independent benchmark and task set. The work explicitly notes the gap to automatic grounding and presents the manual figure as an upper-bound proxy rather than a derived claim, keeping the derivation chain free of self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical evaluation of an existing model on a new task domain; no new free parameters, axioms beyond standard LMM capabilities, or invented entities are introduced.

axioms (1)

domain assumption GPT-4V can jointly reason over web screenshots and HTML structure to produce actionable plans
Invoked throughout the agent design and evaluation sections.

pith-pipeline@v0.9.0 · 5608 in / 1294 out tokens · 46143 ms · 2026-05-15T19:37:24.478166+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 7.0

MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
cs.CV 2026-05 unverdicted novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 7.0

ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
WAAA! Web Adversaries Against Agentic Browsers
cs.CR 2026-05 unverdicted novelty 7.0

Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations
cs.HC 2026-04 unverdicted novelty 7.0

VLMs detect primitive motion in UI animations reliably but show inconsistent high-level interpretation of purposes and meanings, with large gaps relative to human performance.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
ClawBench: Can AI Agents Complete Everyday Online Tasks?
cs.CL 2026-04 unverdicted novelty 7.0

ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
cs.AI 2026-04 unverdicted novelty 7.0

GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
Web Agents Should Adopt the Plan-Then-Execute Paradigm
cs.CR 2026-05 unverdicted novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
MMSkills: Towards Multimodal Skills for General Visual Agents
cs.AI 2026-05 unverdicted novelty 6.0

MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
cs.CL 2026-05 unverdicted novelty 6.0

ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
PageGuide: Browser extension to assist users in navigating a webpage and locating information
cs.HC 2026-04 accept novelty 6.0

PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
cs.CV 2026-04 unverdicted novelty 6.0

Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
cs.AI 2026-03 unverdicted novelty 6.0

WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.
ClawMobile: Rethinking Smartphone-Native Agentic Systems
cs.MA 2026-02 unverdicted novelty 4.0

ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 19 Pith papers · 10 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

URL https://api.semanticscholar. org/CorpusID:266359502. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

org/CorpusID:248476411

URL https://api.semanticscholar. org/CorpusID:248476411. Anil, G. T. G. R., Borgeaud, S., and et al., Y . W. Gem- ini: A family of highly capable multimodal models

work page
[3]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

URL https://api.semanticscholar. org/CorpusID:266361876. Chandu, K. R., Bisk, Y ., and Black, A. W. Grounding ‘grounding’in nlp. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4283– 4305, 2021. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic....

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

org/CorpusID:259262082

URL https://api.semanticscholar. org/CorpusID:259262082. Cheng, K., Sun, Q., Chu, Y ., Xu, F., Li, Y ., Zhang, J., and Wu, Z. Seeclick: Harness- ing gui grounding for advanced visual gui agents

work page
[5]

Scaling Instruction-Finetuned Language Models

URL https://api.semanticscholar. org/CorpusID:267069082. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Valter, D., Narang, S., Mishra, G., Yu, A. W., Zhao, V ., Huang, Y ., Dai, A. M., Yu, H., Petrov, S., hsin Chi, E. H., D...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Mind2Web: Towards a Generalist Agent for the Web

URL https://api.semanticscholar. org/CorpusID:253018554. Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y . Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070 , 2023. Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y ., Gu, S. S., and Gur, I. Multimodal web navigation with instruction- finetuned ...

work page internal anchor Pith review arXiv 2023
[7]

org/CorpusID:258823350

URL https://api.semanticscholar. org/CorpusID:258823350. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398 – 414,

work page
[8]

org/CorpusID:8081284

URL https://api.semanticscholar. org/CorpusID:8081284. Gu, Y ., Deng, X., and Su, Y . Don’t generate, discrimi- nate: A proposal for grounding language models to real- world environments. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (V ol- ume 1: Long Pape...

work page doi:10.18653/v1/2023.acl-long.270 2023
[9]

org/CorpusID:265499116

URL https://api.semanticscholar. org/CorpusID:265499116. Gur, I., Nachum, O., Miao, Y ., Safdari, M., Huang, A., Chowdhery, A., Narang, S., Fiedel, N., and Faust, A. Understanding html with large language models. In Conference on Empirical Methods in Natural Language Processing , 2022. URL https: //api.semanticscholar.org/CorpusID: 252780086. Gur, I., Fur...

work page arXiv 2022
[10]

Measuring Massive Multitask Language Understanding

URL https://api.semanticscholar. org/CorpusID:260126067. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. X., and Steinhardt, J. Measuring massive multi- task language understanding. ArXiv, abs/2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[11]

org/CorpusID:221516475

URL https://api.semanticscholar. org/CorpusID:221516475. Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y ., Wang, Z., Dong, Y ., Ding, M., and Tang, J. Cogagent: A visual language model for gui agents

work page
[12]

org/CorpusID:266210390

URL https://api.semanticscholar. org/CorpusID:266210390. Kazemzadeh, S., Ordonez, V ., andre Matten, M., and Berg, T. L. Referitgame: Referring to objects in photographs of natural scenes. In Conference on Empirical Methods in Natural Language Processing ,

work page
[13]

Language models can solve computer tasks

URL https://api.semanticscholar. org/CorpusID:6308361. Kim, G., Baldi, P., and McAleer, S. M. Language mod- els can solve computer tasks. ArXiv, abs/2303.17491,

work page arXiv
[14]

org/CorpusID:257834038

URL https://api.semanticscholar. org/CorpusID:257834038. Koh, J. Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M. C., Huang, P.-Y ., Neubig, G., Zhou, S., Salakhutdi- nov, R., and Fried, D. Visualwebarena: Evaluat- ing multimodal agents on realistic visual web tasks

work page
[15]

org/CorpusID:267199749

URL https://api.semanticscholar. org/CorpusID:267199749. Lee, K., Joshi, M., Turc, I., Hu, H., Liu, F., Eisen- schlos, J. M., Khandelwal, U., Shaw, P., Chang, M.-W., and Toutanova, K. Pix2struct: Screenshot parsing as pretraining for visual language understand- ing. ArXiv, abs/2210.03347, 2022. URL https: //api.semanticscholar.org/CorpusID: 252762394. Li,...

work page arXiv 2022
[16]

org/CorpusID:252383606

URL https://api.semanticscholar. org/CorpusID:252383606. Mazumder, S. and Riva, O. Flin: A flexible natural language interface for web navigation. ArXiv, abs/2010.12844,

work page arXiv 2010
[17]

GPT-4 Technical Report

URL https://api.semanticscholar. org/CorpusID:225067907. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Kosmos-2: Grounding Multimodal Large Language Models to the World

URL https://api.semanticscholar. org/CorpusID:257532815. Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306.14824,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

org/CorpusID:259262263

URL https://api.semanticscholar. org/CorpusID:259262263. 10 GPT-4V(ision) is a Generalist Web Agent, if Grounded Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Com- puter Vision, 123:7...

work page 2015
[20]

org/CorpusID:250729995

URL https://api.semanticscholar. org/CorpusID:250729995. Shaw, P., Joshi, M., Cohan, J., Berant, J., Pasupat, P., Hu, H., Khandelwal, U., Lee, K., and Toutanova, K. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. ArXiv, abs/2306.00245,

work page arXiv
[21]

org/CorpusID:258999511

URL https://api.semanticscholar. org/CorpusID:258999511. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., Yao, Z., and Keutzer, K. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021. Shi, T., Karpathy, A., Fan, L. J., Hernández, J. Z., and Liang, P. World of bits: An open-domain platform for web...

work page arXiv 2021
[22]

org/CorpusID:258108138

URL https://api.semanticscholar. org/CorpusID:258108138. Sridhar, A., Lo, R., Xu, F. F., Zhu, H., and Zhou, S. Hierarchical prompting assists large language model on web navigation. ArXiv, abs/2305.14257,

work page arXiv
[23]

org/CorpusID:258841249

URL https://api.semanticscholar. org/CorpusID:258841249. Yan, A., Yang, Z., Zhu, W., Lin, K., Li, L., Wang, J., Yang, J., Zhong, Y ., McAuley, J., Gao, J., Liu, Z., and Wang, L. Gpt-4v in wonderland: Large multi- modal models for zero-shot smartphone gui navigation

work page
[24]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

URL https://api.semanticscholar. org/CorpusID:265149992. Yang, J., Zhang, H., Li, F., Zou, X., yue Li, C., and Gao, J. Set-of-mark prompting unleashes extraordi- nary visual grounding in gpt-4v. ArXiv, abs/2310.11441, 2023a. URL https://api.semanticscholar. org/CorpusID:264172201. Yang, L., Wang, Y ., Li, X., Wang, X., and Yang, J. Fine-grained visual pro...

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

URL https://api.semanticscholar. org/CorpusID:250264533. You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything any- where at any granularity. ArXiv, abs/2310.07704,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

URL https://api.semanticscholar. org/CorpusID:263834718. Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mmmu: A mas- sive multi-discipline multimodal understanding and rea- soning ben...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

org/CorpusID:265466525

URL https://api.semanticscholar. org/CorpusID:265466525. Zhang, X., Lu, Y ., Wang, W., Yan, A., Yan, J., Qin, L., Wang, H., Yan, X., Wang, W. Y ., and Pet- zold, L. R. Gpt-4v(ision) as a generalist evaluator for vision-language tasks. ArXiv, abs/2311.01361,

work page arXiv
[28]

org/CorpusID:264935635

URL https://api.semanticscholar. org/CorpusID:264935635. Zhao, Y ., Lin, Z., Zhou, D., Huang, Z., Feng, J., and Kang, B. Bubogpt: Enabling visual ground- ing in multi-modal llms. ArXiv, abs/2307.08581,

work page arXiv
[29]

Zhong, R

URL https://api.semanticscholar. org/CorpusID:259937702. Zhong, W., Cui, R., Guo, Y ., Liang, Y ., Lu, S., Wang, Y ., Saied, A. S. S., Chen, W., and Duan, N. Agieval: A human-centric benchmark for eval- uating foundation models. ArXiv, abs/2304.06364,

work page arXiv
[30]

org/CorpusID:258108259

URL https://api.semanticscholar. org/CorpusID:258108259. 11 GPT-4V(ision) is a Generalist Web Agent, if Grounded Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. Scene parsing through ade20k dataset. 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pp. 5122–5130,

work page 2017
[31]

org/CorpusID:5636055

URL https://api.semanticscholar. org/CorpusID:5636055. 12 GPT-4V(ision) is a Generalist Web Agent, if Grounded Table of Content:

work page
[32]

Appendix A: Offline Experiments Method Details

work page
[33]

Appendix B: Markup Type Ablation Study

work page
[34]

Appendix C: Online Experiment Details

work page
[35]

Appendix D:Offline Experiment Prompts

work page
[36]

Appendix E: Error Examples for Grounding via Image Annotation

work page
[37]

Appendix F: Strong Capability of Planning

work page
[38]

Appendix G: Challenges in Grounding via Textual Choices

work page
[39]

Appendix H: Knowledge and Reasoning Requirements

work page
[40]

Element",

Appendix I: Path Variation and Awareness of Error Correction 13 GPT-4V(ision) is a Generalist Web Agent, if Grounded A. Offline Experiments Method Details FLAN-T5. We fine-tune FLAN-T5 using a left-to-right language modeling objective with the target sequence of ground-truth actions in the Mind2Web training data. The fine-tuned FLAN-T5 then serves as the ...

work page 2021
[41]

You should only issue a valid action given the current observation

work page
[42]

Schedule

You should only issue one action at a time. 15 GPT-4V(ision) is a Generalist Web Agent, if Grounded Table 7: Prompt for SEEACT grounding via element attributes. We make a slight modification to enhance action generation and only show the modified part here to save space, as well as the prompts in Table 8 and Table 9. System Role Same as Table 6 Action Gen...

work page 2023