WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

· 2025 · cs.CL · arXiv 2508.13024

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, the latter allowing for the exact reproduction of the experimental setup. While DeepShop and ShoppingComp provide online benchmarks that require agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, and Mind2Web cover only comparatively simple e-commerce tasks performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex retrieval tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced searches for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with the best-performing agents achieving task completion rates below 65% in the task categories cheapest product search and vague product search.

representative citing papers

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

cs.AI · 2026-05-15 · conditional · novelty 7.0

ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.

RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents

cs.IR · 2026-05-11 · unverdicted · novelty 7.0

RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.

MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

cs.CL · 2025-11-28 · accept · novelty 7.0

RAG, MCP, and NLWeb interfaces let LLM web agents achieve higher F1 scores (0.75-0.77 vs 0.67) and much lower token usage and runtime than HTML in controlled e-commerce tasks.

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.

citing papers explorer

Showing 4 of 4 citing papers.

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents cs.AI · 2026-05-15 · conditional · none · ref 13 · internal anchor
ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.
RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents cs.IR · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.
MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report) cs.CL · 2025-11-28 · accept · none · ref 8 · internal anchor
RAG, MCP, and NLWeb interfaces let LLM web agents achieve higher F1 scores (0.75-0.77 vs 0.67) and much lower token usage and runtime than HTML in controlled e-commerce tasks.
SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 13 · internal anchor
SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

fields

years

verdicts

representative citing papers

citing papers explorer