{"work":{"id":"7058ffd2-a339-4102-89eb-248eeb074652","openalex_id":null,"doi":null,"arxiv_id":"2307.13854","raw_key":null,"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","authors":null,"authors_text":"Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar","year":2023,"venue":"cs.AI","abstract":"With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software development, and content management. Our environment is enriched with tools (e.g., a map) and external knowledge bases (e.g., user manuals) to encourage human-like task-solving. Building upon our environment, we release a set of benchmark tasks focusing on evaluating the functional correctness of task completions. The tasks in our benchmark are diverse, long-horizon, and designed to emulate tasks that humans routinely perform on the internet. We experiment with several baseline agents, integrating recent techniques such as reasoning before acting. The results demonstrate that solving complex tasks is challenging: our best GPT-4-based agent only achieves an end-to-end task success rate of 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents, that current state-of-the-art large language models are far from perfect performance in these real-life tasks, and that WebArena can be used to measure such progress.","external_url":"https://arxiv.org/abs/2307.13854","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T23:33:16.340901+00:00","pith_arxiv_id":"2307.13854","created_at":"2026-05-08T20:34:10.390428+00:00","updated_at":"2026-05-14T23:33:16.340901+00:00","title_quality_ok":true,"display_title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","render_title":"WebArena: A Realistic Web Environment for Building Autonomous Agents"},"hub":{"state":{"work_id":"7058ffd2-a339-4102-89eb-248eeb074652","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":108,"external_cited_by_count":null,"distinct_field_count":13,"first_pith_cited_at":"2023-03-31T01:09:00+00:00","last_pith_cited_at":"2026-05-13T14:52:40+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T00:16:14.142883+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"dataset","n":1}],"polarity_counts":[{"context_polarity":"use_dataset","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","claims":[{"claim_text":"With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software develop","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks WebArena: A Realistic Web Environment for Building Autonomous Agents because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:56:15.678238+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"0a20ea58-cc67-4bb5-b3c7-36e138a49dca","orcid":null,"display_name":"Shuyan Zhou"},{"id":"6cb2fd15-51ef-4235-b5bd-7bdf58e8dd06","orcid":null,"display_name":"Frank F. Xu"},{"id":"35cdf57e-39b5-4d9c-96eb-90dbf3012810","orcid":null,"display_name":"Hao Zhu"},{"id":"81c080f1-19fd-4144-bd77-1de3c2a6981e","orcid":null,"display_name":"Xuhui Zhou"},{"id":"3398e952-0b0b-46a5-a03e-dc6c2f136d10","orcid":null,"display_name":"Robert Lo"},{"id":"fc4b7b2f-1397-4433-962e-17af7a15d51f","orcid":null,"display_name":"Abishek Sridhar"}]},"error":null,"updated_at":"2026-05-14T18:56:15.667383+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T05:36:49.871443+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":28},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":28},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":21},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":20},{"title":"ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs","work_id":"3c555b48-a4d9-42dd-9fdd-0f6018fbe9cb","shared_citers":17},{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","work_id":"6a8d8dc4-0cc0-4052-8109-abbcdcd4a962","shared_citers":16},{"title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","work_id":"793d9419-734d-45fe-9f51-d4c5a3a57cf8","shared_citers":14},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":14},{"title":"Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718","work_id":"5ac27d9e-4522-46f8-985e-0e4f73130803","shared_citers":14},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"Identifying the Risks of LM Agents with an LM-Emulated Sandbox","work_id":"3d4c3b66-d749-4939-b1bc-62b10b2ebbb6","shared_citers":10},{"title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","work_id":"c5116d19-d3d3-40fd-9620-f7489812a9ba","shared_citers":9},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":9},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":9},{"title":"OpenHands: An Open Platform for AI Software Developers as Generalist Agents","work_id":"f1762ea0-e382-4f38-a28c-adc643789859","shared_citers":9},{"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","work_id":"0bbcf263-a46d-4525-a438-11fce3316568","shared_citers":9},{"title":"WebGPT: Browser-assisted question-answering with human feedback","work_id":"e25ef3e1-4848-4cb9-bf28-67a420591165","shared_citers":9},{"title":"Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z","work_id":"8ba3cce8-4fc7-4286-9bae-513243ed4e6e","shared_citers":9},{"title":"BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents","work_id":"25adb508-d97c-49d6-ae43-7a70c2478a34","shared_citers":8},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"GAIA: a benchmark for General AI Assistants","work_id":"cf222b33-f7a3-4044-a570-ecfe25edb3f8","shared_citers":8},{"title":"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V","work_id":"ddc8878e-ed0c-4a66-952a-254dce1c622a","shared_citers":8}],"time_series":[{"n":1,"year":2023},{"n":5,"year":2024},{"n":3,"year":2025},{"n":89,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T05:36:47.121396+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T05:36:55.578024+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","claims":[{"claim_text":"With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software develop","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks WebArena: A Realistic Web Environment for Building Autonomous Agents because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:56:15.674161+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","claims":[{"claim_text":"With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software develop","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks WebArena: A Realistic Web Environment for Building Autonomous Agents because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T05:36:47.128525+00:00"}},"summary":{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","claims":[{"claim_text":"With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a disconnect with real-world scenarios. In this paper, we build an environment for language-guided agents that is highly realistic and reproducible. Specifically, we focus on agents that perform tasks on the web, and create an environment with fully functional websites from four common domains: e-commerce, social forum discussions, collaborative software develop","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks WebArena: A Realistic Web Environment for Building Autonomous Agents because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":28},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":28},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":21},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":20},{"title":"ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs","work_id":"3c555b48-a4d9-42dd-9fdd-0f6018fbe9cb","shared_citers":17},{"title":"$\\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains","work_id":"6a8d8dc4-0cc0-4052-8109-abbcdcd4a962","shared_citers":16},{"title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","work_id":"793d9419-734d-45fe-9f51-d4c5a3a57cf8","shared_citers":14},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":14},{"title":"Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718","work_id":"5ac27d9e-4522-46f8-985e-0e4f73130803","shared_citers":14},{"title":"Toolformer: Language Models Can Teach Themselves to Use Tools","work_id":"9bce40c8-cfd7-4983-80e0-c3bd4402322a","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"Identifying the Risks of LM Agents with an LM-Emulated Sandbox","work_id":"3d4c3b66-d749-4939-b1bc-62b10b2ebbb6","shared_citers":10},{"title":"AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents","work_id":"c5116d19-d3d3-40fd-9620-f7489812a9ba","shared_citers":9},{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":9},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":9},{"title":"OpenHands: An Open Platform for AI Software Developers as Generalist Agents","work_id":"f1762ea0-e382-4f38-a28c-adc643789859","shared_citers":9},{"title":"UI-TARS: Pioneering Automated GUI Interaction with Native Agents","work_id":"0bbcf263-a46d-4525-a438-11fce3316568","shared_citers":9},{"title":"WebGPT: Browser-assisted question-answering with human feedback","work_id":"e25ef3e1-4848-4cb9-bf28-67a420591165","shared_citers":9},{"title":"Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z","work_id":"8ba3cce8-4fc7-4286-9bae-513243ed4e6e","shared_citers":9},{"title":"BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents","work_id":"25adb508-d97c-49d6-ae43-7a70c2478a34","shared_citers":8},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"GAIA: a benchmark for General AI Assistants","work_id":"cf222b33-f7a3-4044-a570-ecfe25edb3f8","shared_citers":8},{"title":"Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V","work_id":"ddc8878e-ed0c-4a66-952a-254dce1c622a","shared_citers":8}],"time_series":[{"n":1,"year":2023},{"n":5,"year":2024},{"n":3,"year":2025},{"n":89,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"fc4b7b2f-1397-4433-962e-17af7a15d51f","orcid":null,"display_name":"Abishek Sridhar","source":"manual","import_confidence":0.72},{"id":"6cb2fd15-51ef-4235-b5bd-7bdf58e8dd06","orcid":null,"display_name":"Frank F. Xu","source":"manual","import_confidence":0.72},{"id":"35cdf57e-39b5-4d9c-96eb-90dbf3012810","orcid":null,"display_name":"Hao Zhu","source":"manual","import_confidence":0.72},{"id":"3398e952-0b0b-46a5-a03e-dc6c2f136d10","orcid":null,"display_name":"Robert Lo","source":"manual","import_confidence":0.72},{"id":"0a20ea58-cc67-4bb5-b3c7-36e138a49dca","orcid":null,"display_name":"Shuyan Zhou","source":"manual","import_confidence":0.72},{"id":"81c080f1-19fd-4144-bd77-1de3c2a6981e","orcid":null,"display_name":"Xuhui Zhou","source":"manual","import_confidence":0.72}]}}