ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Alberto Castelo; Chinmay Savadikar; Han Li; Lingyun Wang; Mingyu Zhao; Shuang Xie; Tianfu Wu; Yuanzheng Zhu

arxiv: 2605.16116 · v1 · pith:F4YLIHERnew · submitted 2026-05-15 · 💻 cs.AI

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

Chinmay Savadikar , Mingyu Zhao , Yuanzheng Zhu , Han Li , Shuang Xie , Alberto Castelo , Tianfu Wu , Lingyun Wang This is my paper

Pith reviewed 2026-05-20 17:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords e-commerce web agentssimulation environmentsbenchmark taskssynthetic shopsagent evaluationShopArenareproducible benchmarks

0 comments

The pith

ShopGym turns live e-commerce sites into controllable sandbox shops that keep the same structural properties and produce matching agent performance signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ShopGym as a framework to build simulation environments and benchmark tasks for e-commerce web agents. Its ShopArena component converts real storefronts into self-contained sandbox versions using anonymized specifications and a staged generation process. ShopGuru then creates tasks grounded in each shop's catalog, navigation, policies, and interaction patterns. This setup aims to combine the realism of live sites with the control, reproducibility, and scalability that hand-built tests lack. Validation on six shops and 224 tasks shows that synthetic versions preserve key structures and that agent results on them correlate positively with results on the original live sites.

Core claim

ShopGym produces self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. ShopArena converts live seed storefronts into sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these, ShopGuru synthesizes benchmark tasks across seven skill categories while grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Results from graph-based structural analysis and agent-based behavioral evaluation confirm that the synthetic shops maintain key properties of live storefronts and that agent性能 on

What carries the argument

ShopArena, the simulation layer that converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process.

If this is right

Evaluation of web agents can use many diverse shops while remaining fully reproducible and inspectable.
Benchmark tasks stay grounded in real catalog, navigation, and policy details rather than abstract templates.
Agent performance measured in the synthetic environments tracks performance on the live versions they derive from.
The same seed storefronts can generate multiple controlled variants for systematic comparison across skill categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support iterative agent development by allowing rapid resets and targeted variations without hitting live-site rate limits or non-stationarity.
If the correlation between synthetic and live performance holds for a wider range of agent architectures, the framework might serve as a pre-deployment filter before live testing.
Similar staged-generation methods could be applied to other interactive web domains such as travel booking or news reading to create analogous controlled environments.

Load-bearing premise

The anonymized shop specifications and staged generation process in ShopArena capture the essential navigation structure, catalog, policies, and interaction affordances of original live storefronts without introducing systematic biases that alter agent behavior or evaluation signals.

What would settle it

A controlled experiment in which the same agents are run on both the generated sandbox shops and their corresponding live storefronts, showing that success rates, error patterns, or performance rankings fail to correlate.

Figures

Figures reproduced from arXiv: 2605.16116 by Alberto Castelo, Chinmay Savadikar, Han Li, Lingyun Wang, Mingyu Zhao, Shuang Xie, Tianfu Wu, Yuanzheng Zhu.

**Figure 1.** Figure 1: ShopGym comprises two components. ShopArena provides a simulation environment populated with synthetic sandbox shops, along with a scalable pipeline that generates new sandbox shops from one or more live seed storefronts through specification synthesis followed by data and code generation. ShopGuru then consumes the resulting catalog, collections, pages, and shop statistics to generate both short-horizon t… view at source ↗

**Figure 2.** Figure 2: The exploration phase fetches key pages, and the public product catalogue. A planner agent [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The generation phase. A staged sequence of feature-scoped steps is each driven by an [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The ShopGuru task generation pipeline. Deterministic generators emit short-horizon tasks [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: (a) An example directed graph visualization of the website; (b) the State Transition statistics; [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Success rates on real storefronts vs. ShopArena twins. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Success rates on the synthetic sandbox shops. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Screenshots of Sandbox Shop Homepages [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: shows more detailed interaction surfaces within the generated sandbox shops. Built from specifications, the examples include a collection page with faceted filters, a homepage with a promotional popup, and a product detail page with search suggestions and purchasing controls. These screenshots highlight that the generated shops contain not only realistic visual layouts, but also functional e-commerce eleme… view at source ↗

**Figure 10.** Figure 10: LLM-authored E2E polish loop. The same validator that audits deterministic generators [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

read the original abstract

Developing and evaluating e-commerce web agents requires environments that preserve meaningful task structure while enabling controllable, reproducible, and scalable scientific comparison. Existing methodologies force a tradeoff: live storefronts provide realism but are non-stationary, difficult to inspect, and irreproducible, while hand-built sandbox benchmarks provide control but cover only a narrow range of layouts, catalogs, policies, and interaction patterns. We argue that the core bottleneck is methodological: the field lacks a scalable way to construct evaluation settings that are simultaneously realistic, diverse, controllable, inspectable, and reproducible. We introduce ShopGym, an integrated framework for realistic simulation and scalable benchmarking of e-commerce web agents. ShopGym is a framework for constructing e-commerce simulation environments and grounded benchmark tasks. Its simulation layer, ShopArena, converts live seed storefronts into self-contained sandbox shops through anonymized shop specifications and a staged, validated generation process. On top of these simulated storefronts, ShopGuru synthesizes benchmark tasks across seven skill categories, grounding each task in the shop's catalog, navigation structure, policies, and interaction affordances. Together, ShopArena and ShopGuru produce self-contained, resettable, inspectable, and stable evaluation artifacts that preserve structural properties and agent-evaluation signals relevant to shopping tasks. We validate the framework through graph-based structural analysis and agent-based behavioral evaluation with 224 generated tasks across six sandbox shops: three constructed with synthetic data and three with real data. Our results show that the synthetic shops preserve key structural properties of live storefronts, with agent performance on synthetic shops positively correlated with performance on live storefronts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ShopGym gives a workable path from live e-commerce sites to controllable, inspectable benchmarks, but the key correlation result sits on a very small sample of shops with no numbers or controls shown.

read the letter

The main thing here is that ShopGym converts live storefront seeds into self-contained sandbox shops via anonymized specs and a staged generation process, then layers on grounded task synthesis across skill categories. This directly tackles the live-versus-sandbox tradeoff that has limited agent evaluation in e-commerce. The framework ships two named pieces—ShopArena for the simulation layer and ShopGuru for task creation—and they validate it with graph-based structural checks plus agent runs on 224 tasks spread over six shops, three built from synthetic data and three from real data. The claim is that the synthetic versions keep enough of the navigation, catalog, and policy structure that agent performance correlates positively with live storefronts. That is the concrete advance: a reproducible pipeline that starts from real data rather than hand-crafted toys. It is useful because it produces resettable, inspectable environments that still aim to reflect actual shopping affordances. The approach is grounded in external live seeds, so there is no obvious circularity or free parameters driving the results. The validation protocol itself is a reasonable first step for this kind of work. The soft spot is the behavioral correlation. It rests on only six shops total, and the abstract supplies neither the actual coefficient, the precise success metric, nor any controls for shop size or task difficulty. With so few independent cases, even a moderate positive link could reflect artifacts in how the staged generation was tuned rather than faithful capture of the properties that matter for agents. If the full paper adds more shops, reports the numbers, or shows robustness checks, the result strengthens; otherwise the external-validity link stays preliminary. This paper is aimed at groups building web agents for retail or similar interactive domains who need scalable, reproducible testbeds. Readers who want a practical framework with some empirical backing will find it worth reading. It deserves a serious referee because the problem is real, the components are clearly described, and the validation direction is honest even if the current evidence is thin. I would send it for review and ask specifically for the correlation details and any additional shops or controls.

Referee Report

1 major / 2 minor

Summary. The paper introduces ShopGym, a framework for realistic simulation and scalable benchmarking of e-commerce web agents. Its core components are ShopArena, which converts live seed storefronts into self-contained sandbox shops via anonymized specifications and a staged, validated generation process, and ShopGuru, which synthesizes benchmark tasks across seven skill categories grounded in each shop's catalog, navigation structure, policies, and interaction affordances. The authors validate the approach through graph-based structural analysis and behavioral evaluation on 224 tasks across six sandbox shops (three synthetic, three real), claiming that the synthetic shops preserve key structural properties of live storefronts and that agent performance on synthetic shops is positively correlated with performance on live storefronts.

Significance. If the central claims hold, ShopGym would address a key methodological bottleneck in e-commerce agent research by providing environments that are simultaneously realistic (grounded in live data), controllable, inspectable, and reproducible. The dual validation strategy combining structural graph metrics with behavioral agent runs on tasks derived from real storefront properties is a strength that could support more standardized and scalable evaluation protocols.

major comments (1)

The headline claim that synthetic shops preserve structural properties and yield positively correlated agent performance (abstract) rests on behavioral evaluation across only six shops total. No correlation coefficient, p-value, exact performance metric (e.g., success rate or steps), or controls for task difficulty/shop size are reported, which is load-bearing for the external-validity argument that ShopArena faithfully captures navigation, catalog, and policy affordances without systematic biases.

minor comments (2)

The abstract states the correlation exists but supplies limited detail on exact metrics, exclusion criteria, or statistical controls; adding these would improve clarity without altering the core contribution.
Clarify how the 224 tasks are distributed across the six shops and whether aggregation was performed before computing correlations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of ShopGym to address methodological challenges in e-commerce agent evaluation. We address the major comment below, providing additional context from our experiments while committing to strengthen the reporting in revision.

read point-by-point responses

Referee: The headline claim that synthetic shops preserve structural properties and yield positively correlated agent performance (abstract) rests on behavioral evaluation across only six shops total. No correlation coefficient, p-value, exact performance metric (e.g., success rate or steps), or controls for task difficulty/shop size are reported, which is load-bearing for the external-validity argument that ShopArena faithfully captures navigation, catalog, and policy affordances without systematic biases.

Authors: We agree that quantitative details on the correlation strengthen the external-validity argument and will add them in revision. The behavioral evaluation uses task success rate (binary completion of the specified shopping goal within a step budget) as the primary metric, averaged across the 224 tasks. We will report the Pearson correlation coefficient and p-value computed over the three matched shop pairs (synthetic vs. live), along with per-shop success rates and standard deviations. Task difficulty was controlled by generating tasks from the same seven skill categories with equivalent grounding in catalog size, navigation depth, and policy complexity for each pair; shop size was matched by selecting live and synthetic instances with comparable numbers of products and categories. The structural analysis (detailed in Section 4.1) uses graph metrics including average shortest path length, degree distribution, and clustering coefficient to demonstrate preservation independent of the behavioral results. While the sample of three pairs limits statistical power, the consistent positive trend across pairs supports the claim as preliminary evidence; we will explicitly note the small n as a limitation and outline plans for larger-scale validation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; validation is externally grounded

full rationale

The paper's core contribution is an empirical framework (ShopArena for generating sandbox shops from live seeds via anonymized specs and staged process, plus ShopGuru for task synthesis) whose validation rests on separate graph-structural metrics and agent behavioral runs across six independent shops (three synthetic, three real-data). The reported positive correlation between synthetic and live agent performance is computed from these external evaluations rather than defined into existence or fitted by construction within the work. No equations, self-definitional reductions, load-bearing self-citations, or ansatz smuggling appear; the derivation chain remains self-contained against the live storefront benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on domain assumptions about data fidelity during anonymization and generation rather than new mathematical axioms or fitted parameters.

axioms (1)

domain assumption Live storefronts can be converted via anonymized specifications and staged generation into self-contained simulations that preserve structural properties and agent-evaluation signals.
Invoked in the description of ShopArena's conversion process from seed storefronts.

invented entities (2)

ShopArena no independent evidence
purpose: Simulation layer that converts live seed storefronts into sandbox shops.
Core new component of the framework introduced to address realism-control tradeoff.
ShopGuru no independent evidence
purpose: Task synthesis layer that generates benchmark tasks grounded in shop catalog, navigation, and policies.
Core new component for creating scalable, grounded evaluation tasks.

pith-pipeline@v0.9.0 · 5847 in / 1284 out tokens · 60310 ms · 2026-05-20T17:51:27.232975+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 8 internal anchors

[1]

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment.arXiv preprint arXiv:2604.06126, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexan- dre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados. The browsergym ecosys...

work page 2025
[5]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samual Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Alice Oh, Tris- tan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Syst...

work page 2023
[6]

Go-browse: Training web agents with structured explo- ration

Apurva Gandhi and Graham Neubig. Go-browse: Training web agents with structured explo- ration. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=IpzRWE52yw

work page 2026
[7]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024,...

work page doi:10.18653/v1/2024.acl-long.371 2024
[8]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024,

work page 2024
[9]

URLhttps://arxiv.org/abs/2401.13649

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2023. URL https://api. semanticscholar.org/CorpusID:259360665

work page 2023
[11]

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Qi He, Dakuo Wang, et al. Can llm agents simulate multi-turn human behavior? evidence from real online customer behavior data.arXiv preprint arXiv:2503.20749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuyi Chen. Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025. URLhttps://api.semanticscholar.org/CorpusID:279118560. 10

work page arXiv 2025
[13]

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall–a multi-shop benchmark for evaluating web agents [technical report].arXiv preprint arXiv:2508.13024, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

The illusion of diminishing returns: Measuring long horizon execution in llms.ArXiv, abs/2509.09677, 2025

Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms.ArXiv, abs/2509.09677, 2025. URLhttps://api.semanticscholar.org/CorpusID:281252776

work page arXiv 2025
[16]

Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, et al. Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

work page arXiv 2025
[17]

Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

Jiang Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jiandong Zhang, and Xiaoyi Zeng. Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents. In AAAI Conference on Artificial Intelligence, 2025. URL https://api.semanticscholar. org/CorpusID:280536823

work page 2025
[19]

Shopsimulator: Evaluating and exploring rl-driven llm agent for shopping assistants.ArXiv, abs/2601.18225, 2026

Pei Wang, Yanan Wu, Xiaoshuai Song, Weixun Wang, Gengru Chen, Zhongwen Li, Ke Yan, Ken Deng, Qi Liu, Shu-Man Zhao, Shaopan Xiong, Xuepeng Liu, Xuefeng Chen, Wanxi Deng, Wenbo Su, and Bo Zheng. Shopsimulator: Evaluating and exploring rl-driven llm agent for shopping assistants.ArXiv, abs/2601.18225, 2026. URL https://api.semanticscholar. org/CorpusID:285050373

work page arXiv 2026
[20]

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Ziyi Wang, Yuxuan Lu, Wenbo Li, Amir A. Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia B. Chilton, and Dakuo Wang. Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation.ArXiv, abs/2506.05606, 2025....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

work page arXiv 2025
[22]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.CoRR, abs/2504.01382,

work page arXiv
[24]

Webshop: Towards scal- able real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scal- able real-world web interaction with grounded language agents. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022,...

work page 2022
[25]

Shopping companion: A memory-augmented LLM agent for real-world e-commerce tasks.CoRR, abs/2603.14864,

Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, and Xiaoyi Zeng. Shopping companion: A memory-augmented LLM agent for real-world e-commerce tasks.CoRR, abs/2603.14864,

work page arXiv
[27]

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan, Yuyang Yin, Yuxuan Cai, and Zheng Wei. Webforge: Breaking the realism-reproducibility-scalability trilemma in browser agent benchmark.arXiv preprint arXiv:2604.10988, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Na, Sungwon Kim, Junseok Lee, and Chanyoung Park

Shuo Zhang, Boci Peng, Xinping Zhao, Boren Hu, Yun Zhu, Yanjia Zeng, and Xuming Hu. Llasa: Large language and e-commerce shopping assistant.CoRR, abs/2408.02006, 2024. doi: 10.48550/ARXIV .2408.02006. URLhttps://doi.org/10.48550/arXiv.2408.02006

work page internal anchor Pith review doi:10.48550/arxiv 2024
[29]

See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025

Yimeng Zhang, Jiri Gesi, Ran Xue, Tian Wang, Ziyi Wang, Yuxuan Lu, Sinong Zhan, Huimin Zeng, Qingjun Cui, Yufan Guo, et al. See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025

work page arXiv 2025
[30]

Webarena-infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026

Shuyan Zhou. Webarena-infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026. URL https://webarena.dev/webarena-infinity/

work page 2026
[31]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...

work page 2024
[32]

WebMall [12] expands to four simulated shops with authentic product offers and comparison-shopping tasks

introduces a Chinese web environment for multi-turn shopping dialog and fine-grained product differentiation. WebMall [12] expands to four simulated shops with authentic product offers and comparison-shopping tasks. Although these benchmarks enable controlled evaluation, they remain tied to a fixed set of layouts, taxonomies, and policies, and therefore c...

work page
[33]

click on the filter menu titledPrice

proposes a multi-agent pipeline for generating tasks and trajectories by navigating websites and maintaining a graph of visited URLs for efficient exploration. 13 B Implementation Details B.1 Models used for agent implementations • Planning Agent: Claude Code with Claude Opus 4.6 • Specification Agent: Claude Code with Claude Opus 4.6 • Collections genera...

work page
[34]

Search & atomic add-to-cart

work page
[35]

Nav drilldown (menu -> sub-menu -> collection -> product)

work page
[36]

Format=Hardcover then sort by price)

Filter + sort (e.g. Format=Hardcover then sort by price)

work page
[37]

Filter that returns zero results, then recover

work page
[38]

Substitute-match discovery (intended product missing -> close alternative)

work page
[39]

Review / detail read on a product page

work page
[40]

Size chart / fit guide lookup

work page
[41]

Shipping policy lookup (with cart action after)

work page
[42]

Returns / refunds lookup (with cart action after)

work page
[43]

Gift card purchase (only if the shop sells gift cards)

work page
[44]

Multi-product cart with edit (add A, add B, remove A, set qty 2 on B)

work page
[45]

Cross-collection or cross-brand comparison (compare A and B, pick one)

work page
[46]

Contact / store locator / about page

work page
[47]

Search for X. Pick a Color and Size variant. Add to cart

Free-shipping threshold or sale-discount calculation (if banner exists) ### High-quality reference tasks (from other shops in this benchmark suite) [See Appendix sections below for three representative few-shot examples. Five additional examples are included in the released benchmark code.] ### Anti-patterns to AVOID - "Search for X. Pick a Color and Size...

work page

[1] [1]

Gym-Anything: Turn any Software into an Agent Environment

Pranjal Aggarwal, Graham Neubig, and Sean Welleck. Gym-anything: Turn any software into an agent environment.arXiv preprint arXiv:2604.06126, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [4]

Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexan- dre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados. The browsergym ecosys...

work page 2025

[3] [5]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samual Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Alice Oh, Tris- tan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Syst...

work page 2023

[4] [6]

Go-browse: Training web agents with structured explo- ration

Apurva Gandhi and Graham Neubig. Go-browse: Training web agents with structured explo- ration. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=IpzRWE52yw

work page 2026

[5] [7]

Webvoyager: Building an end-to-end web agent with large multimodal models

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024,...

work page doi:10.18653/v1/2024.acl-long.371 2024

[6] [8]

Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024,

work page 2024

[7] [9]

URLhttps://arxiv.org/abs/2401.13649

work page internal anchor Pith review Pith/arXiv arXiv

[8] [10]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2023. URL https://api. semanticscholar.org/CorpusID:259360665

work page 2023

[9] [11]

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Qi He, Dakuo Wang, et al. Can llm agents simulate multi-turn human behavior? evidence from real online customer behavior data.arXiv preprint arXiv:2503.20749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [12]

Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xiuyi Chen. Deepshop: A benchmark for deep research shopping agents.ArXiv, abs/2506.02839, 2025. URLhttps://api.semanticscholar.org/CorpusID:279118560. 10

work page arXiv 2025

[11] [13]

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, and Christian Bizer. Webmall–a multi-shop benchmark for evaluating web agents [technical report].arXiv preprint arXiv:2508.13024, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [14]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [15]

The illusion of diminishing returns: Measuring long horizon execution in llms.ArXiv, abs/2509.09677, 2025

Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, and Jonas Geiping. The illusion of diminishing returns: Measuring long horizon execution in llms.ArXiv, abs/2509.09677, 2025. URLhttps://api.semanticscholar.org/CorpusID:281252776

work page arXiv 2025

[14] [16]

Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, et al. Agenta/b: Automated and scalable web a/btesting with interactive llm agents.arXiv preprint arXiv:2504.09723, 2025

work page arXiv 2025

[15] [17]

Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents

Jiang Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jiandong Zhang, and Xiaoyi Zeng. Shoppingbench: A real-world intent-grounded shopping benchmark for llm-based agents. In AAAI Conference on Artificial Intelligence, 2025. URL https://api.semanticscholar. org/CorpusID:280536823

work page 2025

[16] [19]

Shopsimulator: Evaluating and exploring rl-driven llm agent for shopping assistants.ArXiv, abs/2601.18225, 2026

Pei Wang, Yanan Wu, Xiaoshuai Song, Weixun Wang, Gengru Chen, Zhongwen Li, Ke Yan, Ken Deng, Qi Liu, Shu-Man Zhao, Shaopan Xiong, Xuepeng Liu, Xuefeng Chen, Wanxi Deng, Wenbo Su, and Bo Zheng. Shopsimulator: Evaluating and exploring rl-driven llm agent for shopping assistants.ArXiv, abs/2601.18225, 2026. URL https://api.semanticscholar. org/CorpusID:285050373

work page arXiv 2026

[17] [20]

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Ziyi Wang, Yuxuan Lu, Wenbo Li, Amir A. Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia B. Chilton, and Dakuo Wang. Opera: A dataset of observation, persona, rationale, and action for evaluating llms on human online shopping behavior simulation.ArXiv, abs/2506.05606, 2025....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [21]

Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, and Dakuo Wang. Customer-r1: Personal- ized simulation of human behaviors via rl-based llm agent in online shopping.arXiv preprint arXiv:2510.07230, 2025

work page arXiv 2025

[19] [22]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents.CoRR, abs/2504.01382,

work page arXiv

[20] [24]

Webshop: Towards scal- able real-world web interaction with grounded language agents

Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scal- able real-world web interaction with grounded language agents. In Sanmi Koyejo, S. Mo- hamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022,...

work page 2022

[21] [25]

Shopping companion: A memory-augmented LLM agent for real-world e-commerce tasks.CoRR, abs/2603.14864,

Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, and Xiaoyi Zeng. Shopping companion: A memory-augmented LLM agent for real-world e-commerce tasks.CoRR, abs/2603.14864,

work page arXiv

[22] [27]

WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

Peng Yuan, Yuyang Yin, Yuxuan Cai, and Zheng Wei. Webforge: Breaking the realism-reproducibility-scalability trilemma in browser agent benchmark.arXiv preprint arXiv:2604.10988, 2026. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [28]

Na, Sungwon Kim, Junseok Lee, and Chanyoung Park

Shuo Zhang, Boci Peng, Xinping Zhao, Boren Hu, Yun Zhu, Yanjia Zeng, and Xuming Hu. Llasa: Large language and e-commerce shopping assistant.CoRR, abs/2408.02006, 2024. doi: 10.48550/ARXIV .2408.02006. URLhttps://doi.org/10.48550/arXiv.2408.02006

work page internal anchor Pith review doi:10.48550/arxiv 2024

[24] [29]

See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025

Yimeng Zhang, Jiri Gesi, Ran Xue, Tian Wang, Ziyi Wang, Yuxuan Lu, Sinong Zhan, Huimin Zeng, Qingjun Cui, Yufan Guo, et al. See, think, act: Online shopper behavior simulation with vlm agents.arXiv preprint arXiv:2510.19245, 2025

work page arXiv 2025

[25] [30]

Webarena-infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026

Shuyan Zhou. Webarena-infinity: Generating browser environments with verifiable tasks at scale.shuyanzhou.com, March 2026. URL https://webarena.dev/webarena-infinity/

work page 2026

[26] [31]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2...

work page 2024

[27] [32]

WebMall [12] expands to four simulated shops with authentic product offers and comparison-shopping tasks

introduces a Chinese web environment for multi-turn shopping dialog and fine-grained product differentiation. WebMall [12] expands to four simulated shops with authentic product offers and comparison-shopping tasks. Although these benchmarks enable controlled evaluation, they remain tied to a fixed set of layouts, taxonomies, and policies, and therefore c...

work page

[28] [33]

click on the filter menu titledPrice

proposes a multi-agent pipeline for generating tasks and trajectories by navigating websites and maintaining a graph of visited URLs for efficient exploration. 13 B Implementation Details B.1 Models used for agent implementations • Planning Agent: Claude Code with Claude Opus 4.6 • Specification Agent: Claude Code with Claude Opus 4.6 • Collections genera...

work page

[29] [34]

Search & atomic add-to-cart

work page

[30] [35]

Nav drilldown (menu -> sub-menu -> collection -> product)

work page

[31] [36]

Format=Hardcover then sort by price)

Filter + sort (e.g. Format=Hardcover then sort by price)

work page

[32] [37]

Filter that returns zero results, then recover

work page

[33] [38]

Substitute-match discovery (intended product missing -> close alternative)

work page

[34] [39]

Review / detail read on a product page

work page

[35] [40]

Size chart / fit guide lookup

work page

[36] [41]

Shipping policy lookup (with cart action after)

work page

[37] [42]

Returns / refunds lookup (with cart action after)

work page

[38] [43]

Gift card purchase (only if the shop sells gift cards)

work page

[39] [44]

Multi-product cart with edit (add A, add B, remove A, set qty 2 on B)

work page

[40] [45]

Cross-collection or cross-brand comparison (compare A and B, pick one)

work page

[41] [46]

Contact / store locator / about page

work page

[42] [47]

Search for X. Pick a Color and Size variant. Add to cart

Free-shipping threshold or sale-discount calculation (if banner exists) ### High-quality reference tasks (from other shops in this benchmark suite) [See Appendix sections below for three representative few-shot examples. Five additional examples are included in the released benchmark code.] ### Anti-patterns to AVOID - "Search for X. Pick a Color and Size...

work page