Recognition: 2 theorem links
· Lean TheoremGPT-4V(ision) is a Generalist Web Agent, if Grounded
Pith reviewed 2026-05-15 19:37 UTC · model grok-4.3
The pith
GPT-4V completes 51.1 percent of tasks on live websites when its textual plans are manually grounded into actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-4V presents a great potential for web agents as it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models specifically fine-tuned for web agents. However, grounding still remains a major challenge, with existing strategies like set-of-mark prompting proving ineffective and the best developed approach leveraging both HTML structure and visuals but leaving a substantial gap with oracle grounding.
What carries the argument
Manual grounding of GPT-4V textual plans into website actions, which serves as an upper-bound proxy for evaluating the model's integrated planning and visual reasoning inside the SEEACT agent.
If this is right
- GPT-4V outperforms GPT-4 and fine-tuned models such as FLAN-T5 and BLIP-2 on web agent tasks.
- Set-of-mark prompting fails as a grounding strategy for web agents.
- The strongest current grounding method combines HTML structure with visual information.
- A large performance gap persists between the best automatic grounding and oracle grounding.
- A practical tool now exists for running and evaluating web agents directly on live websites.
Where Pith is reading between the lines
- Automatic grounding methods that close the gap to oracle levels would remove the need for human intervention and enable fully autonomous web agents.
- Multimodal models may generalize across previously unseen websites more readily than agents trained on narrow task distributions.
- Directly embedding grounding inside the model could eliminate the separate manual translation step.
- The same visual-plus-structure approach might transfer to other interactive digital environments such as desktop applications or mobile interfaces.
Load-bearing premise
Manual grounding of the model's textual plans supplies a valid upper-bound proxy for its planning and reasoning capability.
What would settle it
A controlled test in which GPT-4V still fails most tasks even when given perfect manual groundings on live sites, or an automatic grounding method that matches the 51.1 percent success rate.
read the original abstract
The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In addition to standard offline evaluation on cached websites, we enable a new online evaluation setting by developing a tool that allows running web agents on live websites. We show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites. This substantially outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2) specifically fine-tuned for web agents. However, grounding still remains a major challenge. Existing LMM grounding strategies like set-of-mark prompting turns out to be not effective for web agents, and the best grounding strategy we develop in this paper leverages both the HTML structure and visuals. Yet, there is still a substantial gap with oracle grounding, leaving ample room for further improvement. All code, data, and evaluation tools are available at https://github.com/OSU-NLP-Group/SeeAct.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SEEACT, a generalist web agent that uses GPT-4V to integrate visual understanding from screenshots with HTML structure for planning and executing natural-language web tasks. It reports results on the MIND2WEB benchmark in both offline (cached) and online (live-website) settings, with the headline finding that GPT-4V achieves 51.1% task success on live sites when its textual plans are manually grounded into actions; this outperforms text-only GPT-4 and smaller fine-tuned models (FLAN-T5, BLIP-2). The work identifies grounding as the primary remaining bottleneck, shows that set-of-mark prompting is ineffective, and proposes an HTML+visual strategy that still trails oracle grounding. All code, data, and the live-evaluation tool are released.
Significance. If the manual-grounding protocol can be documented reproducibly, the result supplies a concrete upper-bound demonstration that large multimodal models possess substantial planning and reasoning capacity for generalist web agents, while the new online evaluation framework and open-source release provide immediate value for follow-on work on automatic grounding. The direct empirical measurements on live sites strengthen the claim relative to purely offline benchmarks.
major comments (2)
- [§4.2] §4.2 (online evaluation protocol): the manual grounding procedure that converts GPT-4V textual plans into executable actions is described only at a high level; without an explicit, reproducible protocol (e.g., how element selection, coordinate resolution, or recovery from ambiguous plans is performed), the 51.1% live-site figure cannot be confidently interpreted as a stable upper-bound proxy for the model's planning capability.
- [Table 3] Table 3 (live-website results): the performance gap between the authors' best automatic grounding method and oracle grounding is large, yet no per-task failure-mode breakdown or error analysis is provided to indicate whether the shortfall is primarily in visual localization, HTML parsing, or action sequencing; this information is load-bearing for the claim that 'grounding still remains a major challenge.'
minor comments (2)
- [Abstract] Abstract: '51.1 of the tasks' should read '51.1% of the tasks' for precision.
- [§5.1] §5.1: the notation for the three grounding variants (set-of-mark, HTML-only, HTML+visual) is introduced without a compact summary table, making cross-references in the results section harder to follow.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation for minor revision. We address each major comment below and will revise the manuscript to improve reproducibility and strengthen the supporting analysis.
read point-by-point responses
-
Referee: [§4.2] §4.2 (online evaluation protocol): the manual grounding procedure that converts GPT-4V textual plans into executable actions is described only at a high level; without an explicit, reproducible protocol (e.g., how element selection, coordinate resolution, or recovery from ambiguous plans is performed), the 51.1% live-site figure cannot be confidently interpreted as a stable upper-bound proxy for the model's planning capability.
Authors: We agree that the current description in §4.2 is high-level and that an explicit, reproducible protocol is needed to allow confident interpretation of the 51.1% figure as an upper bound on planning capability. In the revised manuscript we will expand §4.2 with a detailed step-by-step protocol that specifies element selection from the rendered HTML, coordinate resolution from screenshots, handling of ambiguous or incomplete plans, and recovery strategies employed during manual execution. revision: yes
-
Referee: [Table 3] Table 3 (live-website results): the performance gap between the authors' best automatic grounding method and oracle grounding is large, yet no per-task failure-mode breakdown or error analysis is provided to indicate whether the shortfall is primarily in visual localization, HTML parsing, or action sequencing; this information is load-bearing for the claim that 'grounding still remains a major challenge.'
Authors: We acknowledge that a per-task failure-mode breakdown would strengthen the claim that grounding remains the primary bottleneck. We will add a new subsection to the results section that provides a quantitative error analysis of the best automatic grounding method versus oracle grounding on the live-website tasks. Failures will be manually categorized into visual localization, HTML parsing, action sequencing, and other categories, with counts, percentages, and representative examples. revision: yes
Circularity Check
No significant circularity; empirical evaluation is self-contained
full rationale
The paper reports direct empirical results from running GPT-4V on the external MIND2WEB benchmark under manual grounding, achieving 51.1% task completion on live websites. No equations, fitted parameters, predictions, or self-citations are used to derive the central performance figure; the result is obtained by straightforward evaluation against an independent benchmark and task set. The work explicitly notes the gap to automatic grounding and presents the manual figure as an upper-bound proxy rather than a derived claim, keeping the derivation chain free of self-referential reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPT-4V can jointly reason over web screenshots and HTML structure to produce actionable plans
Forward citations
Cited by 21 Pith papers
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills creates compact multimodal skill packages from trajectories and uses a branch-loaded agent to improve visual decision-making on GUI and game benchmarks.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual token usage by 46% on average in agent trajectories via a learned patch selector and improves success rates by 3% on three benchmarks, showing that history saturation stems from inefficient rep...
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
-
WAAA! Web Adversaries Against Agentic Browsers
Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
-
Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations
VLMs detect primitive motion in UI animations reliably but show inconsistent high-level interpretation of purposes and meanings, with large gaps relative to human performance.
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
ClawBench: Can AI Agents Complete Everyday Online Tasks?
ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.
-
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Web Agents Should Adopt the Plan-Then-Execute Paradigm
Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
-
MMSkills: Towards Multimodal Skills for General Visual Agents
MMSkills turns public interaction trajectories into compact multimodal skill packages that visual agents can consult at runtime to improve decision-making on benchmarks.
-
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
ReVision reduces visual tokens in computer-use agent histories by 46% on average and raises success rates by 3% by learning to drop redundant patches across screenshots, allowing longer histories to keep improving per...
-
PageGuide: Browser extension to assist users in navigating a webpage and locating information
PageGuide grounds LLM answers in webpage DOM elements using visual overlays for find, guide, and hide modes, yielding measurable gains in a 94-user study.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents
Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.
-
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.
-
ClawMobile: Rethinking Smartphone-Native Agentic Systems
ClawMobile proposes a hierarchical system separating probabilistic LLM planning from structured deterministic execution to improve stability and reproducibility of agentic systems on real smartphones.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
URL https://api.semanticscholar. org/CorpusID:266359502. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URL https://api.semanticscholar. org/CorpusID:248476411. Anil, G. T. G. R., Borgeaud, S., and et al., Y . W. Gem- ini: A family of highly capable multimodal models
-
[3]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
URL https://api.semanticscholar. org/CorpusID:266361876. Chandu, K. R., Bisk, Y ., and Black, A. W. Grounding ‘grounding’in nlp. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4283– 4305, 2021. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., and Zhao, R. Shikra: Unleashing multimodal llm’s referential dialogue magic....
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
URL https://api.semanticscholar. org/CorpusID:259262082. Cheng, K., Sun, Q., Chu, Y ., Xu, F., Li, Y ., Zhang, J., and Wu, Z. Seeclick: Harness- ing gui grounding for advanced visual gui agents
-
[5]
Scaling Instruction-Finetuned Language Models
URL https://api.semanticscholar. org/CorpusID:267069082. Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Valter, D., Narang, S., Mishra, G., Yu, A. W., Zhao, V ., Huang, Y ., Dai, A. M., Yu, H., Petrov, S., hsin Chi, E. H., D...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Mind2Web: Towards a Generalist Agent for the Web
URL https://api.semanticscholar. org/CorpusID:253018554. Deng, X., Gu, Y ., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y . Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070 , 2023. Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y ., Gu, S. S., and Gur, I. Multimodal web navigation with instruction- finetuned ...
work page internal anchor Pith review arXiv 2023
-
[7]
URL https://api.semanticscholar. org/CorpusID:258823350. Goyal, Y ., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398 – 414,
-
[8]
URL https://api.semanticscholar. org/CorpusID:8081284. Gu, Y ., Deng, X., and Su, Y . Don’t generate, discrimi- nate: A proposal for grounding language models to real- world environments. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (V ol- ume 1: Long Pape...
-
[9]
URL https://api.semanticscholar. org/CorpusID:265499116. Gur, I., Nachum, O., Miao, Y ., Safdari, M., Huang, A., Chowdhery, A., Narang, S., Fiedel, N., and Faust, A. Understanding html with large language models. In Conference on Empirical Methods in Natural Language Processing , 2022. URL https: //api.semanticscholar.org/CorpusID: 252780086. Gur, I., Fur...
-
[10]
Measuring Massive Multitask Language Understanding
URL https://api.semanticscholar. org/CorpusID:260126067. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. X., and Steinhardt, J. Measuring massive multi- task language understanding. ArXiv, abs/2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[11]
URL https://api.semanticscholar. org/CorpusID:221516475. Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y ., Wang, Z., Dong, Y ., Ding, M., and Tang, J. Cogagent: A visual language model for gui agents
-
[12]
URL https://api.semanticscholar. org/CorpusID:266210390. Kazemzadeh, S., Ordonez, V ., andre Matten, M., and Berg, T. L. Referitgame: Referring to objects in photographs of natural scenes. In Conference on Empirical Methods in Natural Language Processing ,
-
[13]
Language models can solve computer tasks
URL https://api.semanticscholar. org/CorpusID:6308361. Kim, G., Baldi, P., and McAleer, S. M. Language mod- els can solve computer tasks. ArXiv, abs/2303.17491,
-
[14]
URL https://api.semanticscholar. org/CorpusID:257834038. Koh, J. Y ., Lo, R., Jang, L., Duvvur, V ., Lim, M. C., Huang, P.-Y ., Neubig, G., Zhou, S., Salakhutdi- nov, R., and Fried, D. Visualwebarena: Evaluat- ing multimodal agents on realistic visual web tasks
-
[15]
URL https://api.semanticscholar. org/CorpusID:267199749. Lee, K., Joshi, M., Turc, I., Hu, H., Liu, F., Eisen- schlos, J. M., Khandelwal, U., Shaw, P., Chang, M.-W., and Toutanova, K. Pix2struct: Screenshot parsing as pretraining for visual language understand- ing. ArXiv, abs/2210.03347, 2022. URL https: //api.semanticscholar.org/CorpusID: 252762394. Li,...
-
[16]
URL https://api.semanticscholar. org/CorpusID:252383606. Mazumder, S. and Riva, O. Flin: A flexible natural language interface for web navigation. ArXiv, abs/2010.12844,
-
[17]
URL https://api.semanticscholar. org/CorpusID:225067907. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Kosmos-2: Grounding Multimodal Large Language Models to the World
URL https://api.semanticscholar. org/CorpusID:257532815. Peng, Z., Wang, W., Dong, L., Hao, Y ., Huang, S., Ma, S., and Wei, F. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306.14824,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
URL https://api.semanticscholar. org/CorpusID:259262263. 10 GPT-4V(ision) is a Generalist Web Agent, if Grounded Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., and Lazebnik, S. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Com- puter Vision, 123:7...
work page 2015
-
[20]
URL https://api.semanticscholar. org/CorpusID:250729995. Shaw, P., Joshi, M., Cohan, J., Berant, J., Pasupat, P., Hu, H., Khandelwal, U., Lee, K., and Toutanova, K. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. ArXiv, abs/2306.00245,
-
[21]
URL https://api.semanticscholar. org/CorpusID:258999511. Shen, S., Li, L. H., Tan, H., Bansal, M., Rohrbach, A., Chang, K.-W., Yao, Z., and Keutzer, K. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021. Shi, T., Karpathy, A., Fan, L. J., Hernández, J. Z., and Liang, P. World of bits: An open-domain platform for web...
-
[22]
URL https://api.semanticscholar. org/CorpusID:258108138. Sridhar, A., Lo, R., Xu, F. F., Zhu, H., and Zhou, S. Hierarchical prompting assists large language model on web navigation. ArXiv, abs/2305.14257,
-
[23]
URL https://api.semanticscholar. org/CorpusID:258841249. Yan, A., Yang, Z., Zhu, W., Lin, K., Li, L., Wang, J., Yang, J., Zhong, Y ., McAuley, J., Gao, J., Liu, Z., and Wang, L. Gpt-4v in wonderland: Large multi- modal models for zero-shot smartphone gui navigation
-
[24]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
URL https://api.semanticscholar. org/CorpusID:265149992. Yang, J., Zhang, H., Li, F., Zou, X., yue Li, C., and Gao, J. Set-of-mark prompting unleashes extraordi- nary visual grounding in gpt-4v. ArXiv, abs/2310.11441, 2023a. URL https://api.semanticscholar. org/CorpusID:264172201. Yang, L., Wang, Y ., Li, X., Wang, X., and Yang, J. Fine-grained visual pro...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Ferret: Refer and Ground Anything Anywhere at Any Granularity
URL https://api.semanticscholar. org/CorpusID:250264533. You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.-F., and Yang, Y . Ferret: Refer and ground anything any- where at any granularity. ArXiv, abs/2310.07704,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
URL https://api.semanticscholar. org/CorpusID:263834718. Yue, X., Ni, Y ., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y ., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y ., Huang, W., Sun, H., Su, Y ., and Chen, W. Mmmu: A mas- sive multi-discipline multimodal understanding and rea- soning ben...
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
URL https://api.semanticscholar. org/CorpusID:265466525. Zhang, X., Lu, Y ., Wang, W., Yan, A., Yan, J., Qin, L., Wang, H., Yan, X., Wang, W. Y ., and Pet- zold, L. R. Gpt-4v(ision) as a generalist evaluator for vision-language tasks. ArXiv, abs/2311.01361,
-
[28]
URL https://api.semanticscholar. org/CorpusID:264935635. Zhao, Y ., Lin, Z., Zhou, D., Huang, Z., Feng, J., and Kang, B. Bubogpt: Enabling visual ground- ing in multi-modal llms. ArXiv, abs/2307.08581,
- [29]
-
[30]
URL https://api.semanticscholar. org/CorpusID:258108259. 11 GPT-4V(ision) is a Generalist Web Agent, if Grounded Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. Scene parsing through ade20k dataset. 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pp. 5122–5130,
work page 2017
-
[31]
URL https://api.semanticscholar. org/CorpusID:5636055. 12 GPT-4V(ision) is a Generalist Web Agent, if Grounded Table of Content:
-
[32]
Appendix A: Offline Experiments Method Details
-
[33]
Appendix B: Markup Type Ablation Study
-
[34]
Appendix C: Online Experiment Details
-
[35]
Appendix D:Offline Experiment Prompts
-
[36]
Appendix E: Error Examples for Grounding via Image Annotation
-
[37]
Appendix F: Strong Capability of Planning
-
[38]
Appendix G: Challenges in Grounding via Textual Choices
-
[39]
Appendix H: Knowledge and Reasoning Requirements
-
[40]
Appendix I: Path Variation and Awareness of Error Correction 13 GPT-4V(ision) is a Generalist Web Agent, if Grounded A. Offline Experiments Method Details FLAN-T5. We fine-tune FLAN-T5 using a left-to-right language modeling objective with the target sequence of ground-truth actions in the Mind2Web training data. The fine-tuned FLAN-T5 then serves as the ...
work page 2021
-
[41]
You should only issue a valid action given the current observation
-
[42]
You should only issue one action at a time. 15 GPT-4V(ision) is a Generalist Web Agent, if Grounded Table 7: Prompt for SEEACT grounding via element attributes. We make a slight modification to enhance action generation and only show the modified part here to save space, as well as the prompts in Table 8 and Table 9. System Role Same as Table 6 Action Gen...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.