pith. sign in

arxiv: 2605.16116 · v1 · pith:F4YLIHERnew · submitted 2026-05-15 · 💻 cs.AI

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Pith reviewed 2026-05-20 17:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords e-commerce web agentssimulation environmentsbenchmark taskssynthetic shopsagent evaluationShopArenareproducible benchmarks
0
0 comments X

The pith

ShopGym turns live e-commerce sites into controllable sandbox shops that keep the same structural properties and produce matching agent performance signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ShopGym as a framework to build simulation environments and benchmark tasks for e-commerce web agents. Its ShopArena component converts real storefronts into self-contained sandbox versions using anonymized specifications and a staged generation process. ShopGuru then creates tasks grounded in each shop's catalog, navigation, policies, and interaction patterns. This setup aims to combine the realism of live sites with the control, reproducibility, and scalability that hand-built tests lack. Validation on six shops and 224 tasks shows that synthetic versions preserve key structures and that agent results on them correlate positively with results on the original live sites.

Core claim

ShopGym produces self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. ShopArena converts live seed storefronts into sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these, ShopGuru synthesizes benchmark tasks across seven skill categories while grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Results from graph-based structural analysis and agent-based behavioral evaluation confirm that the synthetic shops maintain key properties of live storefronts and that agent性能 on

What carries the argument

ShopArena, the simulation layer that converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process.

If this is right

  • Evaluation of web agents can use many diverse shops while remaining fully reproducible and inspectable.
  • Benchmark tasks stay grounded in real catalog, navigation, and policy details rather than abstract templates.
  • Agent performance measured in the synthetic environments tracks performance on the live versions they derive from.
  • The same seed storefronts can generate multiple controlled variants for systematic comparison across skill categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could support iterative agent development by allowing rapid resets and targeted variations without hitting live-site rate limits or non-stationarity.
  • If the correlation between synthetic and live performance holds for a wider range of agent architectures, the framework might serve as a pre-deployment filter before live testing.
  • Similar staged-generation methods could be applied to other interactive web domains such as travel booking or news reading to create analogous controlled environments.

Load-bearing premise

The anonymized shop specifications and staged generation process in ShopArena capture the essential navigation structure, catalog, policies, and interaction affordances of original live storefronts without introducing systematic biases that alter agent behavior or evaluation signals.

What would settle it

A controlled experiment in which the same agents are run on both the generated sandbox shops and their corresponding live storefronts, showing that success rates, error patterns, or performance rankings fail to correlate.

Figures

Figures reproduced from arXiv: 2605.16116 by Alberto Castelo, Chinmay Savadikar, Han Li, Lingyun Wang, Mingyu Zhao, Shuang Xie, Tianfu Wu, Yuanzheng Zhu.

Figure 1
Figure 1. Figure 1: ShopGym comprises two components. ShopArena provides a simulation environment populated with synthetic sandbox shops, along with a scalable pipeline that generates new sandbox shops from one or more live seed storefronts through specification synthesis followed by data and code generation. ShopGuru then consumes the resulting catalog, collections, pages, and shop statistics to generate both short-horizon t… view at source ↗
Figure 2
Figure 2. Figure 2: The exploration phase fetches key pages, and the public product catalogue. A planner agent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The generation phase. A staged sequence of feature-scoped steps is each driven by an [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The ShopGuru task generation pipeline. Deterministic generators emit short-horizon tasks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) An example directed graph visualization of the website; (b) the State Transition statistics; [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Success rates on real storefronts vs. ShopArena twins. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Success rates on the synthetic sandbox shops. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Screenshots of Sandbox Shop Homepages [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: shows more detailed interaction surfaces within the generated sandbox shops. Built from specifications, the examples include a collection page with faceted filters, a homepage with a promotional popup, and a product detail page with search suggestions and purchasing controls. These screenshots highlight that the generated shops contain not only realistic visual layouts, but also functional e-commerce eleme… view at source ↗
Figure 10
Figure 10. Figure 10: LLM-authored E2E polish loop. The same validator that audits deterministic generators [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗
read the original abstract

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces ShopGym, a framework for realistic simulation and scalable benchmarking of e-commerce web agents. Its core components are ShopArena, which converts live seed storefronts into self-contained sandbox shops via anonymized specifications and a staged, validated generation process, and ShopGuru, which synthesizes benchmark tasks across seven skill categories grounded in each shop's catalog, navigation structure, policies, and interaction affordances. The authors validate the approach through graph-based structural analysis and behavioral evaluation on 224 tasks across six sandbox shops (three synthetic, three real), claiming that the synthetic shops preserve key structural properties of live storefronts and that agent performance on synthetic shops is positively correlated with performance on live storefronts.

Significance. If the central claims hold, ShopGym would address a key methodological bottleneck in e-commerce agent research by providing environments that are simultaneously realistic (grounded in live data), controllable, inspectable, and reproducible. The dual validation strategy combining structural graph metrics with behavioral agent runs on tasks derived from real storefront properties is a strength that could support more standardized and scalable evaluation protocols.

major comments (1)
  1. The headline claim that synthetic shops preserve structural properties and yield positively correlated agent performance (abstract) rests on behavioral evaluation across only six shops total. No correlation coefficient, p-value, exact performance metric (e.g., success rate or steps), or controls for task difficulty/shop size are reported, which is load-bearing for the external-validity argument that ShopArena faithfully captures navigation, catalog, and policy affordances without systematic biases.
minor comments (2)
  1. The abstract states the correlation exists but supplies limited detail on exact metrics, exclusion criteria, or statistical controls; adding these would improve clarity without altering the core contribution.
  2. Clarify how the 224 tasks are distributed across the six shops and whether aggregation was performed before computing correlations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of ShopGym to address methodological challenges in e-commerce agent evaluation. We address the major comment below, providing additional context from our experiments while committing to strengthen the reporting in revision.

read point-by-point responses
  1. Referee: The headline claim that synthetic shops preserve structural properties and yield positively correlated agent performance (abstract) rests on behavioral evaluation across only six shops total. No correlation coefficient, p-value, exact performance metric (e.g., success rate or steps), or controls for task difficulty/shop size are reported, which is load-bearing for the external-validity argument that ShopArena faithfully captures navigation, catalog, and policy affordances without systematic biases.

    Authors: We agree that quantitative details on the correlation strengthen the external-validity argument and will add them in revision. The behavioral evaluation uses task success rate (binary completion of the specified shopping goal within a step budget) as the primary metric, averaged across the 224 tasks. We will report the Pearson correlation coefficient and p-value computed over the three matched shop pairs (synthetic vs. live), along with per-shop success rates and standard deviations. Task difficulty was controlled by generating tasks from the same seven skill categories with equivalent grounding in catalog size, navigation depth, and policy complexity for each pair; shop size was matched by selecting live and synthetic instances with comparable numbers of products and categories. The structural analysis (detailed in Section 4.1) uses graph metrics including average shortest path length, degree distribution, and clustering coefficient to demonstrate preservation independent of the behavioral results. While the sample of three pairs limits statistical power, the consistent positive trend across pairs supports the claim as preliminary evidence; we will explicitly note the small n as a limitation and outline plans for larger-scale validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; validation is externally grounded

full rationale

The paper's core contribution is an empirical framework (ShopArena for generating sandbox shops from live seeds via anonymized specs and staged process, plus ShopGuru for task synthesis) whose validation rests on separate graph-structural metrics and agent behavioral runs across six independent shops (three synthetic, three real-data). The reported positive correlation between synthetic and live agent performance is computed from these external evaluations rather than defined into existence or fitted by construction within the work. No equations, self-definitional reductions, load-bearing self-citations, or ansatz smuggling appear; the derivation chain remains self-contained against the live storefront benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on domain assumptions about data fidelity during anonymization and generation rather than new mathematical axioms or fitted parameters.

axioms (1)
  • domain assumption Live storefronts can be converted via anonymized specifications and staged generation into self-contained simulations that preserve structural properties and agent-evaluation signals.
    Invoked in the description of ShopArena's conversion process from seed storefronts.
invented entities (2)
  • ShopArena no independent evidence
    purpose: Simulation layer that converts live seed storefronts into sandbox shops.
    Core new component of the framework introduced to address realism-control tradeoff.
  • ShopGuru no independent evidence
    purpose: Task synthesis layer that generates benchmark tasks grounded in shop catalog, navigation, and policies.
    Core new component for creating scalable, grounded evaluation tasks.

pith-pipeline@v0.9.0 · 5847 in / 1284 out tokens · 60310 ms · 2026-05-20T17:51:27.232975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

  1. [1]

    Gym-Anything: Turn any Software into an Agent Environment

    Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment.arXiv preprint arXiv:2604.06126, 2026

  2. [4]

    Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados

    Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexan- dre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados. The browsergym ecosys...

  3. [5]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samual Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Alice Oh, Tris- tan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Syst...

  4. [6]

    Go-browse: Training web agents with structured explo- ration

    Apurva Gandhi and Graham Neubig. Go-browse: Training web agents with structured explo- ration. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=IpzRWE52yw

  5. [7]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024,...

  6. [8]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024,

  7. [9]

    URLhttps://arxiv.org/abs/2401.13649

  8. [10]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2023. URL https://api. semanticscholar.org/CorpusID:259360665

  9. [11]

    Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

    Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Qi He, Dakuo Wang, et al. Can llm agents simulate multi-turn human behavior? evidence from real online customer behavior data.arXiv preprint arXiv:2503.20749, 2025

  10. [12]

    Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

    Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuyi Chen. Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025. URLhttps://api.semanticscholar.org/CorpusID:279118560. 10

  11. [13]

    WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

    Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall–a multi-shop benchmark for evaluating web agents [technical report].arXiv preprint arXiv:2508.13024, 2025

  12. [14]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  13. [15]

    The illusion of diminishing returns: Measuring long horizon execution in llms.ArXiv, abs/2509.09677, 2025

    Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms.ArXiv, abs/2509.09677, 2025. URLhttps://api.semanticscholar.org/CorpusID:281252776

  14. [16]

    Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

    Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, et al. Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

  15. [17]

    Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

    Jiang Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jiandong Zhang, and Xiaoyi Zeng. Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents. In AAAI Conference on Artificial Intelligence, 2025. URL https://api.semanticscholar. org/CorpusID:280536823

  16. [19]

    Shopsimulator: Evaluating and exploring rl-driven llm agent for shopping assistants.ArXiv, abs/2601.18225, 2026

    Pei Wang, Yanan Wu, Xiaoshuai Song, Weixun Wang, Gengru Chen, Zhongwen Li, Ke Yan, Ken Deng, Qi Liu, Shu-Man Zhao, Shaopan Xiong, Xuepeng Liu, Xuefeng Chen, Wanxi Deng, Wenbo Su, and Bo Zheng. Shopsimulator: Evaluating and exploring rl-driven llm agent for shopping assistants.ArXiv, abs/2601.18225, 2026. URL https://api.semanticscholar. org/CorpusID:285050373

  17. [20]

    OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

    Ziyi Wang, Yuxuan Lu, Wenbo Li, Amir A. Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia B. Chilton, and Dakuo Wang. Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation.ArXiv, abs/2506.05606, 2025....

  18. [21]

    Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

    Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

  19. [22]

    An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.CoRR, abs/2504.01382,

  20. [24]

    Webshop: Towards scal- able real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scal- able real-world web interaction with grounded language agents. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022,...

  21. [25]

    Shopping companion: A memory-augmented LLM agent for real-world e-commerce tasks.CoRR, abs/2603.14864,

    Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, and Xiaoyi Zeng. Shopping companion: A memory-augmented LLM agent for real-world e-commerce tasks.CoRR, abs/2603.14864,

  22. [27]

    WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

    Peng Yuan, Yuyang Yin, Yuxuan Cai, and Zheng Wei. Webforge: Breaking the realism-reproducibility-scalability trilemma in browser agent benchmark.arXiv preprint arXiv:2604.10988, 2026. 11

  23. [28]

    Na, Sungwon Kim, Junseok Lee, and Chanyoung Park

    Shuo Zhang, Boci Peng, Xinping Zhao, Boren Hu, Yun Zhu, Yanjia Zeng, and Xuming Hu. Llasa: Large language and e-commerce shopping assistant.CoRR, abs/2408.02006, 2024. doi: 10.48550/ARXIV .2408.02006. URLhttps://doi.org/10.48550/arXiv.2408.02006

  24. [29]

    See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025

    Yimeng Zhang, Jiri Gesi, Ran Xue, Tian Wang, Ziyi Wang, Yuxuan Lu, Sinong Zhan, Huimin Zeng, Qingjun Cui, Yufan Guo, et al. See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025

  25. [30]

    Webarena-infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026

    Shuyan Zhou. Webarena-infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026. URL https://webarena.dev/webarena-infinity/

  26. [31]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...

  27. [32]

    WebMall [12] expands to four simulated shops with authentic product offers and comparison-shopping tasks

    introduces a Chinese web environment for multi-turn shopping dialog and fine-grained product differentiation. WebMall [12] expands to four simulated shops with authentic product offers and comparison-shopping tasks. Although these benchmarks enable controlled evaluation, they remain tied to a fixed set of layouts, taxonomies, and policies, and therefore c...

  28. [33]

    click on the filter menu titledPrice

    proposes a multi-agent pipeline for generating tasks and trajectories by navigating websites and maintaining a graph of visited URLs for efficient exploration. 13 B Implementation Details B.1 Models used for agent implementations • Planning Agent: Claude Code with Claude Opus 4.6 • Specification Agent: Claude Code with Claude Opus 4.6 • Collections genera...

  29. [34]

    Search & atomic add-to-cart

  30. [35]

    Nav drilldown (menu -> sub-menu -> collection -> product)

  31. [36]

    Format=Hardcover then sort by price)

    Filter + sort (e.g. Format=Hardcover then sort by price)

  32. [37]

    Filter that returns zero results, then recover

  33. [38]

    Substitute-match discovery (intended product missing -> close alternative)

  34. [39]

    Review / detail read on a product page

  35. [40]

    Size chart / fit guide lookup

  36. [41]

    Shipping policy lookup (with cart action after)

  37. [42]

    Returns / refunds lookup (with cart action after)

  38. [43]

    Gift card purchase (only if the shop sells gift cards)

  39. [44]

    Multi-product cart with edit (add A, add B, remove A, set qty 2 on B)

  40. [45]

    Cross-collection or cross-brand comparison (compare A and B, pick one)

  41. [46]

    Contact / store locator / about page

  42. [47]

    Search for X. Pick a Color and Size variant. Add to cart

    Free-shipping threshold or sale-discount calculation (if banner exists) ### High-quality reference tasks (from other shops in this benchmark suite) [See Appendix sections below for three representative few-shot examples. Five additional examples are included in the released benchmark code.] ### Anti-patterns to AVOID - "Search for X. Pick a Color and Size...