pith. sign in

arxiv: 2508.15832 · v2 · submitted 2025-08-18 · 💻 cs.CL · cs.AI

A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Pith reviewed 2026-05-18 22:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords web agentse-commerce benchmarkAmazon-Benchfunctionality-grounded queriessafety evaluationautomated evaluation frameworkaccount management tasksagent robustness
0
0 comments X

The pith

A new benchmark called Amazon-Bench evaluates web agents on diverse e-commerce tasks while checking for safety risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for web agents in e-commerce mainly test product searches and overlook safety. This paper introduces Amazon-Bench, generated from actual webpage elements to cover account management, wish lists, and other functions. An automated framework evaluates both task completion and unintended changes like wrong purchases or deleted addresses. Tests show current agents fail on complex queries and create safety issues, pointing to needs for better agent design.

Core claim

The paper proposes Amazon-Bench to generate functionality-grounded user queries using webpage content and interactive elements for tasks beyond search, such as address and wish list management. It also introduces an automated evaluation framework that measures both performance and safety of web agents. Systematic evaluations reveal that current agents struggle with complex queries and pose safety risks on e-commerce platforms.

What carries the argument

The data generation pipeline that uses webpage content and interactive elements like buttons and checkboxes to create diverse, functionality-grounded queries, paired with an automated evaluation framework assessing performance and safety.

If this is right

  • Agents must be improved to handle complex multi-step tasks in e-commerce without errors.
  • Evaluation of web agents should routinely include checks for unintended account changes and safety violations.
  • Development of web agents needs to prioritize robustness to avoid negative impacts on user accounts.
  • Broader functionalities like brand following and gift card operations should be standard test cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks could be adapted for other website domains like banking or travel to test agent safety more widely.
  • Future agents might incorporate explicit safety checks before executing actions on live sites.
  • The findings suggest that scaling up training data with safety examples could reduce risks in deployed agents.

Load-bearing premise

The data generation pipeline that leverages webpage content and interactive elements produces queries that are representative of real user tasks and potential safety risks on actual e-commerce platforms.

What would settle it

If a large-scale study of real Amazon user logs shows that the generated queries in Amazon-Bench do not match the distribution of actual user behaviors or risk scenarios, the benchmark's validity would be undermined.

Figures

Figures reproduced from arXiv: 2508.15832 by Di Wang, Mat Hans, Qiuhai Zeng, Shreyas Prasad, Suhang Wang, Wenbo Yan, Xianren Zhang.

Figure 1
Figure 1. Figure 1: On the left: Amazon-Bench provides user queries related to diverse tasks such as adding [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Amazon-Bench: functionality-grounded user query generation pipeline. The pipeline can [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The distribution of categories of web￾pages. Over half of them are product pages. Breadth-First Exploration. Amazon hosts millions of webpages, each serving different purposes. To discover diverse functionalities on Amazon, we first explore webpages with a breath-first-search. The detailed algorithm of exploration is described in Algorithm 1 in Ap￾pendix 7.2. Webpage Categorization. To or￾ganize the crawle… view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency of different agents. From left to right: (1) The efficiency score of different agents [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of queries by tasks. 10 [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The diversity score of each webpage category. We can observe that account related pages [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case Study 1:Consecutive screenshots showing the agent entering a loop without making [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case Study 2: A harmful failure where the agent adds two coach bags to cart but the [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Amazon-Bench, a new benchmark for web agents operating on e-commerce platforms such as Amazon. It identifies two gaps in prior benchmarks: narrow focus on product-search tasks and omission of safety risks arising from unintended account changes. To address these, the authors describe a data-generation pipeline that extracts webpage content and interactive elements (buttons, checkboxes) to produce functionality-grounded queries spanning address management, wish-list operations, brand following, and similar tasks. They further introduce an automated evaluation framework that jointly measures task completion and safety violations. Systematic experiments on several agents lead to the conclusion that current agents struggle with complex queries and can trigger safety issues such as erroneous purchases or deletion of saved addresses.

Significance. If the generated tasks prove representative and the safety metrics are shown to be reliable, the benchmark would supply a useful instrument for assessing web agents beyond simple search. The explicit inclusion of safety evaluation alongside performance is a constructive direction, as it directly targets risks that matter for real deployment. The work also supplies concrete examples of failure modes (wrong purchases, address deletion, misconfigured auto-reload) that future agent designs could target.

major comments (2)
  1. [Data Generation Pipeline] Data Generation Pipeline: the central claim that agents pose safety risks (wrong purchases, deleted addresses, misconfigured auto-reload) rests on the assumption that queries produced by extracting static webpage content and interactive elements correspond to plausible unintended state changes on live Amazon. No external anchor—user logs, expert review, or comparison against real sessions—is described to test this assumption. This is load-bearing for the headline safety finding.
  2. [Automated Evaluation Framework] Automated Evaluation Framework: the abstract states that the framework assesses both performance and safety, yet supplies no concrete definitions of the safety metrics, error-handling rules, or how unintended state changes are detected and scored. Without these operational details it is impossible to determine whether the reported safety risks are supported by reproducible measurements.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement of the number of queries generated, the number of agents evaluated, and the primary quantitative results (e.g., success rates or safety-violation counts) to give readers an immediate sense of scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Data Generation Pipeline] Data Generation Pipeline: the central claim that agents pose safety risks (wrong purchases, deleted addresses, misconfigured auto-reload) rests on the assumption that queries produced by extracting static webpage content and interactive elements correspond to plausible unintended state changes on live Amazon. No external anchor—user logs, expert review, or comparison against real sessions—is described to test this assumption. This is load-bearing for the headline safety finding.

    Authors: We thank the referee for this observation. The data-generation pipeline extracts interactive elements (buttons, forms, checkboxes) directly from live Amazon pages, so the resulting queries are grounded in actual site functionalities that can produce state changes such as purchases or address deletions. We did not, however, validate the generated queries against user logs, expert review, or recorded sessions. In the revised manuscript we will add a subsection that (a) explains the design rationale for relying on webpage elements and (b) explicitly notes the absence of external validation as a limitation, together with a brief discussion of how future work could address it. This addition clarifies the claim without changing the pipeline itself. revision: partial

  2. Referee: [Automated Evaluation Framework] Automated Evaluation Framework: the abstract states that the framework assesses both performance and safety, yet supplies no concrete definitions of the safety metrics, error-handling rules, or how unintended state changes are detected and scored. Without these operational details it is impossible to determine whether the reported safety risks are supported by reproducible measurements.

    Authors: We agree that the current description of the automated evaluation framework is insufficiently detailed for reproducibility. The manuscript introduces the framework at a conceptual level but does not supply the precise definitions, detection rules, or scoring procedures. In the revised version we will expand the relevant section to include (i) explicit safety-metric definitions, (ii) the state-comparison method used to detect unintended changes, (iii) error-handling conventions, and (iv) the joint performance-safety scoring scheme. These additions will make the reported safety findings fully traceable. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and external agent evaluation are independent

full rationale

The paper constructs Amazon-Bench via a data-generation pipeline that extracts from webpage content and interactive elements to produce functionality-grounded queries, then applies an automated evaluation framework to measure performance and safety of external web agents on those queries. No equations, fitted parameters, or predictions are defined in terms of the evaluation outputs; no self-citations are invoked as load-bearing uniqueness theorems or ansatzes; and the reported findings (agents struggle with complex queries and exhibit safety risks) are direct observations from running agents on the newly generated tasks rather than reductions to the pipeline inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central contribution rests on the assumption that webpage-derived queries capture real user functionality and risks, plus standard assumptions that automated evaluation can reliably detect safety issues without human judgment.

axioms (2)
  • domain assumption Webpage content and interactive elements (buttons, checkboxes) can be used to generate representative user queries for e-commerce tasks
    Invoked in the description of the data generation pipeline
  • domain assumption Safety risks such as unintended account changes can be automatically detected and quantified by an evaluation framework
    Invoked in the proposal of the automated evaluation framework
invented entities (1)
  • Amazon-Bench no independent evidence
    purpose: New benchmark dataset and evaluation framework for web agents
    Newly proposed construct whose validity depends on the data generation and evaluation methods

pith-pipeline@v0.9.0 · 5797 in / 1575 out tokens · 42728 ms · 2026-05-18T22:00:51.072600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Agent-e: From autonomous web navigation to foundational design principles in agentic systems,

    Tamer Abuelsaad, Deepak Akkil, Prasenjit Dey, Ashish Jagmohan, Aditya Vempaty, and Ravi Kokku. Agent-e: From autonomous web navigation to foundational design principles in agentic systems. arXiv preprint arXiv:2407.13032, 2024. 8

  2. [2]

    Introducing amazon nova act

    Amazon AGI Labs. Introducing amazon nova act. https://labs.amazon.science/blog/ nova-act, March 31 2025. Amazon Science blog; research preview announcement

  3. [3]

    Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

    Anthropic. Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku. https: //www.anthropic.com/news/3-5-models-and-computer-use , October 22 2024

  4. [4]

    Chatshop: Interactive information seeking with language agents.arXiv preprint arXiv:2404.09911,

    Sanxing Chen, Sam Wiseman, and Bhuwan Dhingra. Chatshop: Interactive information seeking with language agents. arXiv preprint arXiv:2404.09911, 2024

  5. [5]

    Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste

    De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research. arXiv preprint arXiv:2412.05467, 2024

  6. [6]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  7. [7]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024

  8. [8]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  9. [9]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 881–905, 2024

  10. [10]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024

  11. [11]

    Weblinx: Real-world website navigation with multi-turn dialogue

    Xing Han Lu, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue. In International Conference on Machine Learning , pages 33007–33056. PMLR, 2024

  12. [12]

    Deepshop: A benchmark for deep research shopping agents.arXiv preprint arXiv:2506.02839, 2025

    Yougang Lyu, Xiaoyu Zhang, Lingyong Yan, Maarten de Rijke, Zhaochun Ren, and Xi- uying Chen. Deepshop: A benchmark for deep research shopping agents. arXiv preprint arXiv:2506.02839, 2025

  13. [13]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  14. [14]

    A survey of webagents: Towards next-generation ai agents for web automation with large foundation models.arXiv preprint arXiv:2503.23350, 2025

    Liangbo Ning, Ziran Liang, Zhuohang Jiang, Haohao Qu, Yujuan Ding, Wenqi Fan, Xiao-yong Wei, Shanru Lin, Hui Liu, Philip S Yu, et al. A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. arXiv preprint arXiv:2503.23350, 2025

  15. [15]

    Introducing chatgpt agent: bridging research and action

    OpenAI. Introducing chatgpt agent: bridging research and action. https://openai.com/ index/introducing-chatgpt-agent/, July 17 2025

  16. [16]

    Introducing deep research

    OpenAI. Introducing deep research. https://openai.com/index/ introducing-deep-research/, February 2 2025

  17. [17]

    WebCanvas: Benchmarking Web Agents in Online Environments

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024

  18. [18]

    Sentence-bert: Sentence embeddings using siamese bert- networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. 9

  19. [19]

    An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382, 2025

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An illusion of progress? assessing the current state of web agents. URL https://arxiv. org/abs/2504.01382, 2025

  20. [20]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  21. [21]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v(ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024

  22. [22]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 7 Appendix. 7.1 Dataset Statistics The dataset contains user queries covering a diverse set of e-commerce tasks. ...