{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:YUTF4YCITC3EGOBXRLV65Q3MUC","short_pith_number":"pith:YUTF4YCI","schema_version":"1.0","canonical_sha256":"c5265e604898b64338378aebeec36ca0a9bd6641f715b6691a2cec878dad0d8f","source":{"kind":"arxiv","id":"2401.01614","version":2},"attestation_state":"computed","paper":{"title":"GPT-4V(ision) is a Generalist Web Agent, if Grounded","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GPT-4V completes 51.1 percent of tasks on live websites when its textual plans are manually grounded into actions.","cross_cats":["cs.AI","cs.CL","cs.CV"],"primary_cat":"cs.IR","authors_text":"Boyuan Zheng, Boyu Gou, Huan Sun, Jihyung Kil, Yu Su","submitted_at":"2024-01-03T08:33:09Z","abstract_excerpt":"The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2401.01614","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.IR","submitted_at":"2024-01-03T08:33:09Z","cross_cats_sorted":["cs.AI","cs.CL","cs.CV"],"title_canon_sha256":"226fa896a9db28a6cfee31311a098a43f3414f63a0fab3c5023cb7fce7453933","abstract_canon_sha256":"c5f8fd01b2b3b4d5ef5e685f92a621b6282414be9d2e0bdc8ea8d15f6eb155eb"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.386346Z","signature_b64":"XB6YPGyWMPXx7YKvHhX1DneBUxLpUDx3DCvz5ZlbxyWQvbJB35nZOrYmF+TRvjYhJ58CseOIncYCCaxY3gbQCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c5265e604898b64338378aebeec36ca0a9bd6641f715b6691a2cec878dad0d8f","last_reissued_at":"2026-05-17T23:38:50.385915Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.385915Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"GPT-4V(ision) is a Generalist Web Agent, if Grounded","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GPT-4V completes 51.1 percent of tasks on live websites when its textual plans are manually grounded into actions.","cross_cats":["cs.AI","cs.CL","cs.CV"],"primary_cat":"cs.IR","authors_text":"Boyuan Zheng, Boyu Gou, Huan Sun, Jihyung Kil, Yu Su","submitted_at":"2024-01-03T08:33:09Z","abstract_excerpt":"The recent development on large multimodal models (LMMs), especially GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries of multimodal models beyond traditional tasks like image captioning and visual question answering. In this work, we explore the potential of LMMs like GPT-4V as a generalist web agent that can follow natural language instructions to complete tasks on any given website. We propose SEEACT, a generalist web agent that harnesses the power of LMMs for integrated visual understanding and acting on the web. We evaluate on the recent MIND2WEB benchmark. In"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That manual grounding of the model's textual plans provides a valid upper-bound proxy for evaluating the agent's planning and reasoning capability, while automatic grounding methods remain underdeveloped.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GPT-4V completes 51.1 percent of tasks on live websites when its textual plans are manually grounded into actions.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f21015be4d03b1371becfa7055fb740a1eea13576647e352e660594c367d329c"},"source":{"id":"2401.01614","kind":"arxiv","version":2},"verdict":{"id":"453bd831-211d-4e90-b218-340e0ca1b4d7","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T19:37:24.478166Z","strongest_claim":"we show that GPT-4V presents a great potential for web agents -- it can successfully complete 51.1 of the tasks on live websites if we manually ground its textual plans into actions on the websites.","one_line_summary":"GPT-4V achieves 51.1% success on live web tasks as a generalist agent when plans are manually grounded, outperforming text-only models, but automatic grounding lags far behind oracle performance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That manual grounding of the model's textual plans provides a valid upper-bound proxy for evaluating the agent's planning and reasoning capability, while automatic grounding methods remain underdeveloped.","pith_extraction_headline":"GPT-4V completes 51.1 percent of tasks on live websites when its textual plans are manually grounded into actions."},"references":{"count":42,"sample":[{"doi":"","year":null,"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","ref_index":1,"cited_arxiv_id":"2204.14198","is_internal_anchor":true},{"doi":"","year":null,"title":"org/CorpusID:248476411","work_id":"c474190e-eb6d-4bb6-b3c1-6316440acd57","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic","work_id":"44525076-312a-4259-b79c-134cd7eeb297","ref_index":3,"cited_arxiv_id":"2306.15195","is_internal_anchor":true},{"doi":"","year":null,"title":"org/CorpusID:259262082","work_id":"3a62ab25-76b6-49dd-9ad6-bc5a137343a9","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Scaling Instruction-Finetuned Language Models","work_id":"8405abb1-7558-4fdf-af24-f4c52fa77a06","ref_index":5,"cited_arxiv_id":"2210.11416","is_internal_anchor":true}],"resolved_work":42,"snapshot_sha256":"e2f7c88029e9af74c8c3cf54814800bba9b325275bae538d70b4b2e7f7a71ae8","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"26ff05c78d1f57e1a39b54c923415ee22bbbcac73d1fd0114e64422de5386f5d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2401.01614","created_at":"2026-05-17T23:38:50.385978+00:00"},{"alias_kind":"arxiv_version","alias_value":"2401.01614v2","created_at":"2026-05-17T23:38:50.385978+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2401.01614","created_at":"2026-05-17T23:38:50.385978+00:00"},{"alias_kind":"pith_short_12","alias_value":"YUTF4YCITC3E","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"YUTF4YCITC3EGOBX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"YUTF4YCI","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2605.18758","citing_title":"OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16565","citing_title":"Skim: Speculative Execution for Fast and Efficient Web Agents","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18652","citing_title":"MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2406.12373","citing_title":"WebCanvas: Benchmarking Web Agents in Online Environments","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":130,"is_internal_anchor":true},{"citing_arxiv_id":"2506.02387","citing_title":"VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments","ref_index":88,"is_internal_anchor":true},{"citing_arxiv_id":"2411.18279","citing_title":"Large Language Model-Brained GUI Agents: A Survey","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2506.12382","citing_title":"Exploring the Secondary Risks of Large Language Models","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04227","citing_title":"Mobile GUI Agents under Real-world Threats: Are We There Yet?","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2508.15832","citing_title":"A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10935","citing_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","ref_index":109,"is_internal_anchor":true},{"citing_arxiv_id":"2401.05459","citing_title":"Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security","ref_index":108,"is_internal_anchor":true},{"citing_arxiv_id":"2401.16158","citing_title":"Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2602.22942","citing_title":"ClawMobile: Rethinking Smartphone-Native Agentic Systems","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2603.05295","citing_title":"WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13527","citing_title":"MMSkills: Towards Multimodal Skills for General Visual Agents","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14290","citing_title":"Web Agents Should Adopt the Plan-Then-Execute Paradigm","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":130,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11212","citing_title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12755","citing_title":"State-Centric Decision Process","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13527","citing_title":"MMSkills: Towards Multimodal Skills for General Visual Agents","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12501","citing_title":"Covering Human Action Space for Computer Use: Data Synthesis and Benchmark","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11212","citing_title":"ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2404.07972","citing_title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26148","citing_title":"Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations","ref_index":7,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC","json":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC.json","graph_json":"https://pith.science/api/pith-number/YUTF4YCITC3EGOBXRLV65Q3MUC/graph.json","events_json":"https://pith.science/api/pith-number/YUTF4YCITC3EGOBXRLV65Q3MUC/events.json","paper":"https://pith.science/paper/YUTF4YCI"},"agent_actions":{"view_html":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC","download_json":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC.json","view_paper":"https://pith.science/paper/YUTF4YCI","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2401.01614&json=true","fetch_graph":"https://pith.science/api/pith-number/YUTF4YCITC3EGOBXRLV65Q3MUC/graph.json","fetch_events":"https://pith.science/api/pith-number/YUTF4YCITC3EGOBXRLV65Q3MUC/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC/action/timestamp_anchor","attest_storage":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC/action/storage_attestation","attest_author":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC/action/author_attestation","sign_citation":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC/action/citation_signature","submit_replication":"https://pith.science/pith/YUTF4YCITC3EGOBXRLV65Q3MUC/action/replication_record"}},"created_at":"2026-05-17T23:38:50.385978+00:00","updated_at":"2026-05-17T23:38:50.385978+00:00"}