A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains
Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3
The pith
A new benchmark called Amazon-Bench evaluates web agents on diverse e-commerce tasks while checking for safety risks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes Amazon-Bench to generate functionality-grounded user queries using webpage content and interactive elements for tasks beyond search, such as address and wish list management. It also introduces an automated evaluation framework that measures both performance and safety of web agents. Systematic evaluations reveal that current agents struggle with complex queries and pose safety risks on e-commerce platforms.
What carries the argument
The data generation pipeline that uses webpage content and interactive elements like buttons and checkboxes to create diverse, functionality-grounded queries, paired with an automated evaluation framework assessing performance and safety.
If this is right
- Agents must be improved to handle complex multi-step tasks in e-commerce without errors.
- Evaluation of web agents should routinely include checks for unintended account changes and safety violations.
- Development of web agents needs to prioritize robustness to avoid negative impacts on user accounts.
- Broader functionalities like brand following and gift card operations should be standard test cases.
Where Pith is reading between the lines
- Similar benchmarks could be adapted for other website domains like banking or travel to test agent safety more widely.
- Future agents might incorporate explicit safety checks before executing actions on live sites.
- The findings suggest that scaling up training data with safety examples could reduce risks in deployed agents.
Load-bearing premise
The data generation pipeline that leverages webpage content and interactive elements produces queries that are representative of real user tasks and potential safety risks on actual e-commerce platforms.
What would settle it
If a large-scale study of real Amazon user logs shows that the generated queries in Amazon-Bench do not match the distribution of actual user behaviors or risk scenarios, the benchmark's validity would be undermined.
Figures
read the original abstract
Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Amazon-Bench, a new benchmark for web agents operating on e-commerce platforms such as Amazon. It identifies two gaps in prior benchmarks: narrow focus on product-search tasks and omission of safety risks arising from unintended account changes. To address these, the authors describe a data-generation pipeline that extracts webpage content and interactive elements (buttons, checkboxes) to produce functionality-grounded queries spanning address management, wish-list operations, brand following, and similar tasks. They further introduce an automated evaluation framework that jointly measures task completion and safety violations. Systematic experiments on several agents lead to the conclusion that current agents struggle with complex queries and can trigger safety issues such as erroneous purchases or deletion of saved addresses.
Significance. If the generated tasks prove representative and the safety metrics are shown to be reliable, the benchmark would supply a useful instrument for assessing web agents beyond simple search. The explicit inclusion of safety evaluation alongside performance is a constructive direction, as it directly targets risks that matter for real deployment. The work also supplies concrete examples of failure modes (wrong purchases, address deletion, misconfigured auto-reload) that future agent designs could target.
major comments (2)
- [Data Generation Pipeline] Data Generation Pipeline: the central claim that agents pose safety risks (wrong purchases, deleted addresses, misconfigured auto-reload) rests on the assumption that queries produced by extracting static webpage content and interactive elements correspond to plausible unintended state changes on live Amazon. No external anchor—user logs, expert review, or comparison against real sessions—is described to test this assumption. This is load-bearing for the headline safety finding.
- [Automated Evaluation Framework] Automated Evaluation Framework: the abstract states that the framework assesses both performance and safety, yet supplies no concrete definitions of the safety metrics, error-handling rules, or how unintended state changes are detected and scored. Without these operational details it is impossible to determine whether the reported safety risks are supported by reproducible measurements.
minor comments (1)
- [Abstract] The abstract would benefit from a brief statement of the number of queries generated, the number of agents evaluated, and the primary quantitative results (e.g., success rates or safety-violation counts) to give readers an immediate sense of scale.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Data Generation Pipeline] Data Generation Pipeline: the central claim that agents pose safety risks (wrong purchases, deleted addresses, misconfigured auto-reload) rests on the assumption that queries produced by extracting static webpage content and interactive elements correspond to plausible unintended state changes on live Amazon. No external anchor—user logs, expert review, or comparison against real sessions—is described to test this assumption. This is load-bearing for the headline safety finding.
Authors: We thank the referee for this observation. The data-generation pipeline extracts interactive elements (buttons, forms, checkboxes) directly from live Amazon pages, so the resulting queries are grounded in actual site functionalities that can produce state changes such as purchases or address deletions. We did not, however, validate the generated queries against user logs, expert review, or recorded sessions. In the revised manuscript we will add a subsection that (a) explains the design rationale for relying on webpage elements and (b) explicitly notes the absence of external validation as a limitation, together with a brief discussion of how future work could address it. This addition clarifies the claim without changing the pipeline itself. revision: partial
-
Referee: [Automated Evaluation Framework] Automated Evaluation Framework: the abstract states that the framework assesses both performance and safety, yet supplies no concrete definitions of the safety metrics, error-handling rules, or how unintended state changes are detected and scored. Without these operational details it is impossible to determine whether the reported safety risks are supported by reproducible measurements.
Authors: We agree that the current description of the automated evaluation framework is insufficiently detailed for reproducibility. The manuscript introduces the framework at a conceptual level but does not supply the precise definitions, detection rules, or scoring procedures. In the revised version we will expand the relevant section to include (i) explicit safety-metric definitions, (ii) the state-comparison method used to detect unintended changes, (iii) error-handling conventions, and (iv) the joint performance-safety scoring scheme. These additions will make the reported safety findings fully traceable. revision: yes
Circularity Check
No circularity: benchmark construction and external agent evaluation are independent
full rationale
The paper constructs Amazon-Bench via a data-generation pipeline that extracts from webpage content and interactive elements to produce functionality-grounded queries, then applies an automated evaluation framework to measure performance and safety of external web agents on those queries. No equations, fitted parameters, or predictions are defined in terms of the evaluation outputs; no self-citations are invoked as load-bearing uniqueness theorems or ansatzes; and the reported findings (agents struggle with complex queries and exhibit safety risks) are direct observations from running agents on the newly generated tasks rather than reductions to the pipeline inputs by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Webpage content and interactive elements (buttons, checkboxes) can be used to generate representative user queries for e-commerce tasks
- domain assumption Safety risks such as unintended account changes can be automatically detected and quantified by an evaluation framework
invented entities (1)
-
Amazon-Bench
no independent evidence
Forward citations
Cited by 1 Pith paper
-
RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management
RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.
Reference graph
Works this paper leans on
-
[1]
Agent-e: From autonomous web navigation to foundational design principles in agentic systems,
Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, and Ravi Kokku. Agent-e: From autonomous web navigation to foundational design principles in agentic systems. arXiv preprint arXiv:2407.13032, 2024. 8
-
[2]
Amazon AGI Labs. Introducing amazon nova act. https://labs.amazon.science/blog/ nova-act, March 31 2025. Amazon Science blog; research preview announcement
work page 2025
-
[3]
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use , October 22 2024
work page 2024
-
[4]
Chatshop: Interactive information seeking with language agents.arXiv preprint arXiv:2404.09911,
Sanxing Chen, Sam Wiseman, and Bhuwan Dhingra. Chatshop: Interactive information seeking with language agents. arXiv preprint arXiv:2404.09911, 2024
-
[5]
De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research. arXiv preprint arXiv:2412.05467, 2024
-
[6]
Mind2web: Towards a generalist agent for the web
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36:28091–28114, 2023
work page 2023
-
[7]
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Visualwebarena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 881–905, 2024
work page 2024
-
[10]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Weblinx: Real-world website navigation with multi-turn dialogue
Xing Han Lu, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue. In International Conference on Machine Learning , pages 33007–33056. PMLR, 2024
work page 2024
-
[12]
Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025
Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xi- uying Chen. Deepshop: A benchmark for deep research shopping agents. arXiv preprint arXiv:2506.02839, 2025
-
[13]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. arXiv preprint arXiv:2503.23350, 2025
-
[15]
Introducing chatgpt agent: bridging research and action
OpenAI. Introducing chatgpt agent: bridging research and action. https://openai.com/ index/introducing-chatgpt-agent/, July 17 2025
work page 2025
-
[16]
OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, February 2 2025
work page 2025
-
[17]
WebCanvas: Benchmarking Web Agents in Online Environments
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024
work page internal anchor Pith review arXiv 2024
-
[18]
Sentence-bert: Sentence embeddings using siamese bert- networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. 9
work page 2019
-
[19]
Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents. URL https://arxiv. org/abs/2504.01382, 2025
-
[20]
Webshop: Towards scalable real-world web interaction with grounded language agents
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022
work page 2022
-
[21]
GPT-4V(ision) is a Generalist Web Agent, if Grounded
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 7 Appendix. 7.1 Dataset Statistics The dataset contains user queries covering a diverse set of e-commerce tasks. ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.