pith. machine review for the scientific record. sign in

arxiv: 2604.09581 · v2 · submitted 2026-02-25 · 💻 cs.AI · cs.CY· cs.HC

Recognition: no theorem link

Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:17 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.HC
keywords automated UX evaluationGUI groundingweb simulationusability scoresuser experience agentSUS evaluation
0
0 comments X

The pith

Avenir-UX simulates user behavior on real web pages using GUI grounding to generate standardized usability scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Avenir-UX, an automated agent for evaluating web usability by simulating how humans interact with websites. Traditional methods depend on parsing the document object model or require costly human studies, but this system grounds its actions directly in the graphical user interface to maintain a consistent journey trace across real pages. It combines this interaction capability with predefined user profiles and a protocol that applies the System Usability Scale along with step-by-step ease questions and think-aloud commentary. The result is a full UX report produced without manual testing. This setup aims to speed up iteration for developers by making usability assessment continuous and scalable.

Core claim

Avenir-UX is a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. By grounding actions and observations in the GUI rather than relying on DOM parsing, it interacts with real web pages end-to-end while preserving a coherent trace of the user journey. Integrated with simulated behavior profiles and an evaluation protocol using SUS, SEQ, and Think Aloud, it generates comprehensive UX reports.

What carries the argument

GUI grounding of actions and observations for end-to-end web interaction simulation

If this is right

  • Produces standardized usability scores without human participants
  • Generates reports combining SUS, SEQ, and Think Aloud methods
  • Enables interaction with arbitrary real-world web pages through coherent journey traces
  • Supports agile workflows by reducing reliance on time-consuming studies

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could integrate this into continuous integration pipelines for ongoing UX monitoring
  • Extensions to non-web interfaces like mobile apps might follow from the grounding approach
  • Accuracy on highly dynamic or JavaScript-heavy sites remains a key area for validation

Load-bearing premise

Multimodal grounding reliably produces coherent user journeys and accurate usability scores on arbitrary real-world web pages without typical automated agent errors.

What would settle it

Running the agent on a set of complex real websites and finding that the generated usability scores do not match those from actual human user studies or expert reviews.

Figures

Figures reproduced from arXiv: 2604.09581 by Aiden Yiliu Li, Karim Obegi, Shashank Durgad, Wee Joe Tan, Zi Rui Lucas Lim.

Figure 1
Figure 1. Figure 1: From deployment to insights: Avenir-UX’s web [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Avenir-UX system architecture built on the Avenir-Web framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Recreation.gov task execution workflow showing [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Evaluating web usability typically requires time-consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present Avenir-UX, a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, Avenir-UX grounds actions and observations, enabling it to interact with real web pages end-to-end while maintaining a coherent trace of the user journey. Building on Avenir-Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of Avenir-UX and illustrate how its multimodal grounding improves robustness for web-based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data-driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: https://github.com/Onflow-AI/Avenir-UX

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Avenir-UX, an automated UX evaluation agent that simulates user behavior on websites via multimodal GUI grounding for actions and observations. This enables end-to-end interaction with real web pages while maintaining coherent user journey traces, in contrast to traditional DOM parsing tools. The system incorporates simulated user profiles and a structured protocol integrating the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud to generate comprehensive UX reports.

Significance. If the multimodal grounding reliably yields coherent journeys and scores that correlate with human judgments, Avenir-UX could enable scalable, continuous usability testing accessible to small teams, reducing dependence on resource-intensive user studies and supporting faster, data-driven web interface iteration.

major comments (1)
  1. [Abstract] Abstract: The manuscript describes the Avenir-UX architecture, simulated profiles, and evaluation protocol (SUS + SEQ + Think Aloud) but supplies no empirical results, baseline comparisons, error rates, grounding failure metrics, trace coherence measures, or validation against human judgments. This leaves the central claim of reliable standardized usability scoring on arbitrary real-world pages entirely untested.
minor comments (1)
  1. [Abstract] Abstract: The closing claim that the work paves 'the way for a new era' is promotional; a more measured statement of potential impact would be appropriate.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for major revision. We agree that the absence of empirical validation leaves key claims untested and will strengthen the manuscript by adding a dedicated evaluation section with results, metrics, and human comparisons.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript describes the Avenir-UX architecture, simulated profiles, and evaluation protocol (SUS + SEQ + Think Aloud) but supplies no empirical results, baseline comparisons, error rates, grounding failure metrics, trace coherence measures, or validation against human judgments. This leaves the central claim of reliable standardized usability scoring on arbitrary real-world pages entirely untested.

    Authors: We acknowledge this limitation. The submitted manuscript is primarily a system description focused on architecture, multimodal GUI grounding, simulated profiles, and the integrated SUS/SEQ/Think Aloud protocol. In the revision we will add a new Experiments section that reports: (1) grounding success/failure rates across real-world sites, (2) trace coherence metrics (e.g., journey completion and deviation from expected paths), (3) baseline comparisons against both human evaluators and existing DOM-based tools, and (4) correlation analysis between Avenir-UX-generated SUS/SEQ scores and those obtained from human participants. These additions will directly substantiate the reliability claims. revision: yes

Circularity Check

0 steps flagged

No circularity: pure system description with no derivations or fitted quantities

full rationale

The paper is a descriptive account of the Avenir-UX architecture, its GUI grounding mechanism, simulated user profiles, and SUS/SEQ/Think-Aloud protocol. It references prior work via the phrase 'Building on Avenir-Web' but supplies no equations, no parameter fitting, no quantitative predictions, and no uniqueness theorems or ansatzes that could reduce to self-definition. All content is architectural and conceptual; the central claim (robust end-to-end interaction via multimodal grounding) is presented as a design choice rather than a derived result. No step in the manuscript exhibits any of the six enumerated circularity patterns. The paper is therefore self-contained as a system description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a system proposal rather than a mathematical model. No free parameters, axioms, or invented physical entities are described in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1055 out tokens · 54051 ms · 2026-05-15T19:17:02.703567+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    I., and Kalai, A

    Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. 2023. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. arXiv:2208.10264 [cs.CL] https://arxiv.org/abs/2208.10264

  2. [2]

    Kortum, and James T

    Aaron Bangor, Philip T. Kortum, and James T. Miller. 2008. An Empirical Evalu- ation of the System Usability Scale.International Journal of Human-Computer Interaction24, 6 (2008), 574–594. doi:10.1080/10447310802205776

  3. [3]

    John Brooke. 1996. SUS: A Quick and Dirty Usability Scale. InUsability Evaluation in Industry, P. W. Jordan, B. Thomas, B. A. Weerdmeester, and I. L. McClelland (Eds.). Academic Press, London, UK, 189–194

  4. [4]

    John Brooke. 2013. SUS: A Retrospective.Journal of Usability Studies8, 2 (2013), 29–40

  5. [5]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL] https://arxiv.org/abs/2306.06070

  6. [6]

    Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. 2023. Large Language Models Empowered Agent-based Modeling and Simulation: A Survey and Perspectives. arXiv:2312.11970 [cs.AI] https://arxiv.org/abs/2312.11970

  7. [7]

    Nien-Lin Hsueh, Hsuen-Jen Lin, and Lien-Chi Lai. 2024. Applying Large Language Model to User Experience Testing.Electronics13 (11 2024), 4633. doi:10.3390/ electronics13234633

  8. [8]

    James R. Lewis. 2018. The System Usability Scale: Past, Present, and Future. International Journal of Human-Computer Interaction34, 7 (2018), 577–590. doi:10. 1080/10447318.2018.1455307

  9. [9]

    Aiden Yiliu Li, Xinyue Hao, Shilong Liu, and Mengdi Wang. 2026. Avenir- Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts. arXiv:2602.02468 [cs.AI] https://arxiv.org/abs/2602.02468

  10. [10]

    Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. 2025. UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents. arXiv:2504.09407 [cs.CL] https://arxiv.org/abs/2504.09407

  11. [11]

    Reuben A. Luera, Ryan Rossi, Franck Dernoncourt, Samyadeep Basu, Sungchul Kim, Subhojyoti Mukherjee, Puneet Mathur, Ruiyi Zhang, Jihyung Kil, Nedim Lipka, Seunghyun Yoon, Jiuxiang Gu, Zichao Wang, Cindy Xiong Bearfield, and Branislav Kveton. 2025. MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces. arXiv:25...

  12. [12]

    1993.Usability Engineering

    Jakob Nielsen. 1993.Usability Engineering. Academic Press, Boston, MA

  13. [13]

    Generative Agents: Interactive Simulacra of Human Behavior

    Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC] https://arxiv.org/abs/2304.03442

  14. [14]

    2008.Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests(2nd ed.)

    Jeffrey Rubin and Dana Chisnell. 2008.Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests(2nd ed.). Wiley Publishing, Inc., Indianapolis, IN

  15. [15]

    Jeff Sauro. 2012. 10 Things To Know About The Single Ease Question (SEQ). MeasuringU. https://measuringu.com/seq10/ Accessed: February 10, 2026

  16. [16]

    2018.How Much Does a Usability Test Cost?MeasuringU

    Jeff Sauro. 2018.How Much Does a Usability Test Cost?MeasuringU. https: //measuringu.com/usability-cost/ Accessed: 2026-02-10

  17. [17]

    Jeff Sauro and Joseph S. Dumas. 2009. Comparison of Three One-Question, Post- Task Usability Questionnaires. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’09). ACM, New York, NY, USA, 1599–1608. doi:10.1145/1518701.1518946

  18. [18]

    Jeff Sauro and James R. Lewis. 2016.Quantifying the User Experience: Practical Statistics for User Research(2nd ed.). Morgan Kaufmann, Cambridge, MA, USA

  19. [19]

    Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, and Jessie Wang. 2025. AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents. arXiv:2504.09723 [cs.HC] https://arxiv.org/abs/2504. 09723

  20. [20]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI] https://arxiv.org/abs/2307.13854 , , Tan et al. Appendix: Research Methods & System Prompts A. Sys...

  21. [21]

    Help” link and a commercial product listing. Avenir-UX’s Visual Perception module (using MoGE) allows the agent to “see

    Solving Layout Ambiguity via MoGE.The Discogs homepage is dense with marketplace listings and ads. A traditional DOM-based agent might struggle to distinguish between a “Help” link and a commercial product listing. Avenir-UX’s Visual Perception module (using MoGE) allows the agent to “see” the page layout as a human does, discarding styles and layout ambi...

  22. [22]

    infrastructure

    Strategic Navigation via EIP.The task required the agent to ignore the prominent search bar—which is usually the primary interaction point—and instead seek documentation. This behavior is powered by Experience-Imitation Planning (EIP), which allows the agent to emulate the strategy of an informed user who knows that guidelines are typically “infrastructur...

  23. [23]

    find submission guidelines

    State Consistency via Adaptive Memory.Transitioning from the main www.discogs.com domain to support.discogs.com often resets the DOM context. The Adaptive Memory module ensures the agent retains the original goal (“find submission guidelines”) across this boundary, preventing the navigational drift often seen in less capable agents. By completing this tas...