Recognition: no theorem link
Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
Pith reviewed 2026-05-15 19:17 UTC · model grok-4.3
The pith
Avenir-UX simulates user behavior on real web pages using GUI grounding to generate standardized usability scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Avenir-UX is a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. By grounding actions and observations in the GUI rather than relying on DOM parsing, it interacts with real web pages end-to-end while preserving a coherent trace of the user journey. Integrated with simulated behavior profiles and an evaluation protocol using SUS, SEQ, and Think Aloud, it generates comprehensive UX reports.
What carries the argument
GUI grounding of actions and observations for end-to-end web interaction simulation
If this is right
- Produces standardized usability scores without human participants
- Generates reports combining SUS, SEQ, and Think Aloud methods
- Enables interaction with arbitrary real-world web pages through coherent journey traces
- Supports agile workflows by reducing reliance on time-consuming studies
Where Pith is reading between the lines
- Developers could integrate this into continuous integration pipelines for ongoing UX monitoring
- Extensions to non-web interfaces like mobile apps might follow from the grounding approach
- Accuracy on highly dynamic or JavaScript-heavy sites remains a key area for validation
Load-bearing premise
Multimodal grounding reliably produces coherent user journeys and accurate usability scores on arbitrary real-world web pages without typical automated agent errors.
What would settle it
Running the agent on a set of complex real websites and finding that the generated usability scores do not match those from actual human user studies or expert reviews.
Figures
read the original abstract
Evaluating web usability typically requires time-consuming user studies and expert reviews, which often limits iteration speed during product development, especially for small teams and agile workflows. We present Avenir-UX, a user-experience evaluation agent that simulates user behavior on websites and produces standardized usability. Unlike traditional tools that rely on DOM parsing, Avenir-UX grounds actions and observations, enabling it to interact with real web pages end-to-end while maintaining a coherent trace of the user journey. Building on Avenir-Web, our system pairs this robust interaction with simulated user behavior profiles and a structured evaluation protocol that integrates the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud. Subsequently, a comprehensive User Experience (UX) report will be generated. We discuss the architecture of Avenir-UX and illustrate how its multimodal grounding improves robustness for web-based interaction and UX evaluation scenarios, paving the way for a new era of continuous, scalable, and data-driven usability testing that empowers every developer to build web interfaces that are usable. Code is available at: https://github.com/Onflow-AI/Avenir-UX
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Avenir-UX, an automated UX evaluation agent that simulates user behavior on websites via multimodal GUI grounding for actions and observations. This enables end-to-end interaction with real web pages while maintaining coherent user journey traces, in contrast to traditional DOM parsing tools. The system incorporates simulated user profiles and a structured protocol integrating the System Usability Scale (SUS), step-wise Single Ease Questions (SEQ), and concurrent Think Aloud to generate comprehensive UX reports.
Significance. If the multimodal grounding reliably yields coherent journeys and scores that correlate with human judgments, Avenir-UX could enable scalable, continuous usability testing accessible to small teams, reducing dependence on resource-intensive user studies and supporting faster, data-driven web interface iteration.
major comments (1)
- [Abstract] Abstract: The manuscript describes the Avenir-UX architecture, simulated profiles, and evaluation protocol (SUS + SEQ + Think Aloud) but supplies no empirical results, baseline comparisons, error rates, grounding failure metrics, trace coherence measures, or validation against human judgments. This leaves the central claim of reliable standardized usability scoring on arbitrary real-world pages entirely untested.
minor comments (1)
- [Abstract] Abstract: The closing claim that the work paves 'the way for a new era' is promotional; a more measured statement of potential impact would be appropriate.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and the recommendation for major revision. We agree that the absence of empirical validation leaves key claims untested and will strengthen the manuscript by adding a dedicated evaluation section with results, metrics, and human comparisons.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript describes the Avenir-UX architecture, simulated profiles, and evaluation protocol (SUS + SEQ + Think Aloud) but supplies no empirical results, baseline comparisons, error rates, grounding failure metrics, trace coherence measures, or validation against human judgments. This leaves the central claim of reliable standardized usability scoring on arbitrary real-world pages entirely untested.
Authors: We acknowledge this limitation. The submitted manuscript is primarily a system description focused on architecture, multimodal GUI grounding, simulated profiles, and the integrated SUS/SEQ/Think Aloud protocol. In the revision we will add a new Experiments section that reports: (1) grounding success/failure rates across real-world sites, (2) trace coherence metrics (e.g., journey completion and deviation from expected paths), (3) baseline comparisons against both human evaluators and existing DOM-based tools, and (4) correlation analysis between Avenir-UX-generated SUS/SEQ scores and those obtained from human participants. These additions will directly substantiate the reliability claims. revision: yes
Circularity Check
No circularity: pure system description with no derivations or fitted quantities
full rationale
The paper is a descriptive account of the Avenir-UX architecture, its GUI grounding mechanism, simulated user profiles, and SUS/SEQ/Think-Aloud protocol. It references prior work via the phrase 'Building on Avenir-Web' but supplies no equations, no parameter fitting, no quantitative predictions, and no uniqueness theorems or ansatzes that could reduce to self-definition. All content is architectural and conceptual; the central claim (robust end-to-end interaction via multimodal grounding) is presented as a design choice rather than a derived result. No step in the manuscript exhibits any of the six enumerated circularity patterns. The paper is therefore self-contained as a system description.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Reference graph
Works this paper leans on
-
[1]
Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. 2023. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. arXiv:2208.10264 [cs.CL] https://arxiv.org/abs/2208.10264
-
[2]
Aaron Bangor, Philip T. Kortum, and James T. Miller. 2008. An Empirical Evalu- ation of the System Usability Scale.International Journal of Human-Computer Interaction24, 6 (2008), 574–594. doi:10.1080/10447310802205776
-
[3]
John Brooke. 1996. SUS: A Quick and Dirty Usability Scale. InUsability Evaluation in Industry, P. W. Jordan, B. Thomas, B. A. Weerdmeester, and I. L. McClelland (Eds.). Academic Press, London, UK, 189–194
work page 1996
-
[4]
John Brooke. 2013. SUS: A Retrospective.Journal of Usability Studies8, 2 (2013), 29–40
work page 2013
-
[5]
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL] https://arxiv.org/abs/2306.06070
work page internal anchor Pith review arXiv 2023
- [6]
-
[7]
Nien-Lin Hsueh, Hsuen-Jen Lin, and Lien-Chi Lai. 2024. Applying Large Language Model to User Experience Testing.Electronics13 (11 2024), 4633. doi:10.3390/ electronics13234633
work page 2024
- [8]
- [9]
- [10]
-
[11]
Reuben A. Luera, Ryan Rossi, Franck Dernoncourt, Samyadeep Basu, Sungchul Kim, Subhojyoti Mukherjee, Puneet Mathur, Ruiyi Zhang, Jihyung Kil, Nedim Lipka, Seunghyun Yoon, Jiuxiang Gu, Zichao Wang, Cindy Xiong Bearfield, and Branislav Kveton. 2025. MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces. arXiv:25...
-
[12]
Jakob Nielsen. 1993.Usability Engineering. Academic Press, Boston, MA
work page 1993
-
[13]
Generative Agents: Interactive Simulacra of Human Behavior
Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. arXiv:2304.03442 [cs.HC] https://arxiv.org/abs/2304.03442
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
2008.Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests(2nd ed.)
Jeffrey Rubin and Dana Chisnell. 2008.Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests(2nd ed.). Wiley Publishing, Inc., Indianapolis, IN
work page 2008
-
[15]
Jeff Sauro. 2012. 10 Things To Know About The Single Ease Question (SEQ). MeasuringU. https://measuringu.com/seq10/ Accessed: February 10, 2026
work page 2012
-
[16]
2018.How Much Does a Usability Test Cost?MeasuringU
Jeff Sauro. 2018.How Much Does a Usability Test Cost?MeasuringU. https: //measuringu.com/usability-cost/ Accessed: 2026-02-10
work page 2018
-
[17]
Jeff Sauro and Joseph S. Dumas. 2009. Comparison of Three One-Question, Post- Task Usability Questionnaires. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’09). ACM, New York, NY, USA, 1599–1608. doi:10.1145/1518701.1518946
-
[18]
Jeff Sauro and James R. Lewis. 2016.Quantifying the User Experience: Practical Statistics for User Research(2nd ed.). Morgan Kaufmann, Cambridge, MA, USA
work page 2016
-
[19]
Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, and Jessie Wang. 2025. AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents. arXiv:2504.09723 [cs.HC] https://arxiv.org/abs/2504. 09723
-
[20]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv:2307.13854 [cs.AI] https://arxiv.org/abs/2307.13854 , , Tan et al. Appendix: Research Methods & System Prompts A. Sys...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Solving Layout Ambiguity via MoGE.The Discogs homepage is dense with marketplace listings and ads. A traditional DOM-based agent might struggle to distinguish between a “Help” link and a commercial product listing. Avenir-UX’s Visual Perception module (using MoGE) allows the agent to “see” the page layout as a human does, discarding styles and layout ambi...
-
[22]
Strategic Navigation via EIP.The task required the agent to ignore the prominent search bar—which is usually the primary interaction point—and instead seek documentation. This behavior is powered by Experience-Imitation Planning (EIP), which allows the agent to emulate the strategy of an informed user who knows that guidelines are typically “infrastructur...
-
[23]
State Consistency via Adaptive Memory.Transitioning from the main www.discogs.com domain to support.discogs.com often resets the DOM context. The Adaptive Memory module ensures the agent retains the original goal (“find submission guidelines”) across this boundary, preventing the navigational drift often seen in less capable agents. By completing this tas...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.