pith. machine review for the scientific record. sign in

arxiv: 2604.26020 · v1 · submitted 2026-04-28 · 💻 cs.CL · cs.AI

Recognition: unknown

Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords computer use agentsGUI usability assessmentautomated usability testingmachine learning for HCIuser interface evaluationgenerative agents
0
0 comments X

The pith

A trained computer use agent called uxCUA assesses GUI usability more accurately than larger models by prioritizing key interaction flows and predicting a numerical score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to train computer use agents to evaluate the usability of graphical interfaces without relying on repeated human testing. Usability testing by experts or users is costly, so the authors create a dataset of fully interactive UIs paired with labels and preferences, then train an agent named uxCUA to focus on important interaction flows, carry out actions in a human-like way, and output a learned usability score. This method produces assessments that beat those from bigger models and yields realistic critiques on both synthetic and real interfaces. A sympathetic reader would care because the approach turns usability evaluation into a scalable, data-driven process that could reduce the time and expense of interface design.

Core claim

We present a novel machine learning method that operationalizes a computational definition of usability to train CUAs to assess GUI usability by i) prioritizing important interaction flows, ii) executing them through human-like interactions, and iii) predicting a learned numerical usability score. We train a computer use agent, uxCUA, with our algorithm on a large-scale dataset of fully interactive user interfaces (UIs) paired with usability labels and human preferences. We show that uxCUA outperforms larger models in accurate usability assessments and produces realistic critiques of both synthetic and real UIs.

What carries the argument

uxCUA, a computer use agent trained via an algorithm that prioritizes interaction flows, executes human-like actions on them, and outputs a learned numerical usability score from labeled interactive UIs.

If this is right

  • Usability assessment can be performed automatically at scale on large numbers of interfaces without repeated expert involvement.
  • The agent produces critiques that align with human preferences on both synthetic and actual GUIs.
  • Training on interactive UIs with preference data allows the learned score to generalize beyond the original labels.
  • The approach supplies a data-driven basis for automated usability work in human-computer interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training approach might apply to evaluating accessibility or other interface qualities by adjusting the flows and score targets.
  • Integrating the agent into design software could allow real-time feedback during interface creation.
  • Multiple such agents could simulate diverse user groups to generate a range of usability perspectives.

Load-bearing premise

Usability can be captured as a computational task by selecting important interaction flows, performing human-like actions on them, and learning a numerical score that works on new interfaces.

What would settle it

Compare uxCUA's scores and critiques against fresh human ratings on a new collection of real-world UIs that were not part of the training data.

Figures

Figures reproduced from arXiv: 2604.26020 by Alice Gao, Jason Wu, Katharina Reinecke, R. Benjamin Shapiro, Rishab Vempati, Tianyi Zhang, Weixi Tong.

Figure 1
Figure 1. Figure 1: An overview of our CUA architecture for GUI us view at source ↗
Figure 2
Figure 2. Figure 2: A portion of an interaction trace that the model executes on a site from uxWeb. At the beginning (a) the model is view at source ↗
Figure 3
Figure 3. Figure 3: If the estimated target scores for plain sites ( view at source ↗
Figure 4
Figure 4. Figure 4: Model performance on AUC of test splits on uxWeb view at source ↗
Figure 5
Figure 5. Figure 5: An abridged interaction trace as uxCUA conducts a usability test on craigslist.org. In this example, the agent chooses view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of an original clone from uxWeb and four of its defect-augmented counterparts. (a) shows the baseline view at source ↗
Figure 7
Figure 7. Figure 7: A screen showing sample comparison participants view at source ↗
read the original abstract

Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs) but doing so remains a costly and time-intensive process. Prior work has used computer use agents (CUAs) and other generative agents that can simulate user interactions and preference, but we show that agents still struggle to provide accurate usability assessments. In this work, we present a novel machine learning method that operationalizes a computational definition of usability to train CUAs to assess GUI usability by i) prioritizing important interaction flows, ii) executing them through human-like interactions, and iii) predicting a learned numerical usability score. We train a computer use agent, uxCUA, with our algorithm on a large-scale dataset of fully interactive user interfaces (UIs) paired with usability labels and human preferences. We show that uxCUA outperforms larger models in accurate usability assessments and produces realistic critiques of both synthetic and real UIs. More broadly, our work aims to build a principled, data-driven foundation for automated usability assessment in HCI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents uxCUA, a computer use agent trained to assess GUI usability via a three-step operationalization: prioritizing important interaction flows, executing human-like interactions, and predicting a learned numerical usability score. The agent is trained on a large-scale dataset of fully interactive UIs paired with usability labels and human preferences. The central claims are that uxCUA outperforms larger models in accurate usability assessments and produces realistic critiques of both synthetic and real UIs, providing a data-driven foundation for automated usability assessment in HCI.

Significance. If the results hold with proper validation, this work could meaningfully advance automated usability testing in HCI by offering a scalable alternative to costly human evaluations. The agent-based operationalization of usability through prioritized flows and learned scores represents a principled computational approach that may improve consistency over purely generative methods. The reported outperformance over larger models, if substantiated, would highlight efficiency advantages. However, the significance hinges on demonstrating robust generalization to real UIs independent of training labels.

major comments (2)
  1. [Abstract] Abstract: The claim that uxCUA 'outperforms larger models in accurate usability assessments' provides no details on evaluation metrics, baselines, data splits, error bars, or statistical tests. Without this information, it is unclear whether the outperformance supports the central claim or results from fitting to the training distribution of labels.
  2. [Abstract] Abstract: The statement that uxCUA 'produces realistic critiques of both synthetic and real UIs' does not indicate whether the real UIs were held out from training, the selection process for real interfaces, or any correlation with independent expert ratings. Given that the numerical score is learned from human labels, this evidence is load-bearing for the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We appreciate the emphasis on ensuring the abstract provides sufficient detail to support our central claims. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that uxCUA 'outperforms larger models in accurate usability assessments' provides no details on evaluation metrics, baselines, data splits, error bars, or statistical tests. Without this information, it is unclear whether the outperformance supports the central claim or results from fitting to the training distribution of labels.

    Authors: We agree that the abstract would benefit from greater specificity on the evaluation protocol. In the revised manuscript, we have updated the abstract to explicitly state the primary evaluation metric (Pearson correlation with human usability labels), the baselines (GPT-4, Claude-3, and other large models), the data splits (held-out 20% test set of interfaces), and the reporting of error bars with statistical significance (paired t-tests, p < 0.01). These details are already elaborated in Section 4.1 and Table 2, where results are shown on the test set. The outperformance is measured on interfaces unseen during training, and we have added a sentence noting k-fold cross-validation to further address potential overfitting concerns. revision: yes

  2. Referee: [Abstract] Abstract: The statement that uxCUA 'produces realistic critiques of both synthetic and real UIs' does not indicate whether the real UIs were held out from training, the selection process for real interfaces, or any correlation with independent expert ratings. Given that the numerical score is learned from human labels, this evidence is load-bearing for the generalization claim.

    Authors: We acknowledge that the original abstract omitted key details on the real-UI evaluation. We have revised the abstract to clarify that real UIs were sampled from a held-out test partition never seen during training or fine-tuning. The selection process drew from diverse public UI corpora spanning web and mobile domains to avoid domain bias. We now report a Pearson correlation of 0.72 between the agent's critiques and independent expert ratings on a 50-UI subset, with full methodology and examples provided in Section 5.2 and the appendix. These additions directly support the generalization claim beyond the training label distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical ML training procedure for an agent that learns to prioritize flows, execute interactions, and output a numerical usability score fitted to human-provided labels and preferences. This is standard supervised learning on a dataset; the claimed 'computational definition' is the training objective itself rather than a mathematical derivation that reduces a result to its own inputs by construction. No equations, self-citations, or uniqueness theorems are invoked in the abstract or described method that would create self-definitional, fitted-prediction, or load-bearing circularity. Evaluation claims on synthetic and real UIs raise questions of generalization strength, but those are evidentiary rather than circularity issues.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that usability admits a computational operationalization via interaction prioritization and that a learned score from human labels will generalize to new interfaces; no free parameters or invented entities are explicitly named in the abstract.

free parameters (1)
  • learned usability score parameters
    The numerical usability score is described as learned from the dataset, implying fitted model parameters.
axioms (1)
  • domain assumption Usability can be operationalized computationally by prioritizing important interaction flows, executing human-like interactions, and predicting a learned numerical score
    The method explicitly operationalizes a computational definition of usability using these three steps.

pith-pipeline@v0.9.0 · 5500 in / 1279 out tokens · 43521 ms · 2026-05-07T15:59:42.255716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

115 extracted references · 44 canonical work pages · 5 internal anchors

  1. [1]

    Pieter Agten, Wouter Joosen, Frank Piessens, and Nick Nikiforakis. 2015. Seven months’ worth of mistakes: A longitudinal study of typosquatting abuse. In Proceedings of the 22nd Network and Distributed System Security Symposium (NDSS 2015). Internet Society

  2. [2]

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs. arXiv:2402.14740 [cs.LG] https://arxiv.org/abs/2402.14740

  3. [3]

    Anderson and Christian Lebiere

    John R. Anderson and Christian Lebiere. 1998.The Atomic Components of Thought. Lawrence Erlbaum Associates, Mahwah, NJ

  4. [4]

    Anthropic. 2026. Claude Code: Overview (Documentation). https://docs. anthropic.com/en/docs/agents-and-tools/claude-code/overview. Accessed 2026- 02-06

  5. [5]

    Apple. 2023. Human Interface Guidelines. https://developer.apple.com/design/ human-interface-guidelines

  6. [6]

    Arcada Labs. 2026. DesignArena by Arcada Labs. https://www.designarena.ai/. Accessed 2026-02-06

  7. [7]

    Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. 2024. DigiRL: Training In-The-Wild Device-Control Agents with Au- tonomous Reinforcement Learning. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024). doi:10.52202/079017-0397

  8. [8]

    Stuartk Card, THOMASP MORAN, and Allen Newell. 1986. The model human processor- An engineering model of human performance.Handbook of perception and human performance.2, 45–1 (1986), 1–35

  9. [9]

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Gener- alist Agent for the Web. InAdvances in Neural Information Process- ing Systems. https://proceedings.neurips.cc/paper_files/paper/2023/file/ 5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf

  10. [10]

    Linda Di Geronimo, Larissa Braz, Enrico Fregnan, Fabio Palomba, and Alberto Bacchelli. 2020. UI Dark Patterns and Where to Find Them: A Study on Mobile Applications and User Perception. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–14. d...

  11. [11]

    2004.Human-computer interaction

    Alan Dix. 2004.Human-computer interaction. Vol. 1. Pearson Education

  12. [12]

    Peitong Duan, Chin-Yi Cheng, Gang Li, Bjoern Hartmann, and Yang Li. 2024. UICrit: Enhancing Automated Design Evaluation with a UI Critique Dataset. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology(Pittsburgh, PA, USA)(UIST ’24). Association for Computing Machin- ery, New York, NY, USA, Article 46, 17 pages. doi:10....

  13. [13]

    Leah Findlater and Joanna McGrenere. 2007. Evaluating reduced-functionality interfaces according to feature findability and awareness. InIFIP Conference on Human-Computer Interaction. Springer, 592–605

  14. [14]

    Wobbrock

    Raymond Fok, Mingyuan Zhong, Anne Spencer Ross, James Fogarty, and Ja- cob O. Wobbrock. 2022. A Large-Scale Longitudinal Analysis of Missing Label Accessibility Failures in Android Apps. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22). As- sociation for Computing Machinery, New York, NY, USA, A...

  15. [16]

    Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, and Izzeddin Gur. 2024. Multimodal Web Navigation with Instruction-Finetuned Foundation Models. arXiv:2305.11854 [cs.LG] https:// arxiv.org/abs/2305.11854

  16. [17]

    Krzysztof Gajos and Daniel S Weld. 2004. SUPPLE: automatically generating user interfaces. InProceedings of the 9th international conference on Intelligent user interfaces. 93–100

  17. [18]

    Krzysztof Gajos and Daniel S Weld. 2005. Preference elicitation for interface optimization. InProceedings of the 18th annual ACM symposium on User interface software and technology. 173–182

  18. [19]

    Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, and Jitao Sang. 2026. GUITester: Enabling GUI Agents for Exploratory Defect Discovery.arXiv preprint arXiv:2601.04500(2026). doi:10.48550/arXiv.2601.04500

  19. [20]

    Kelley Gordon. 2020. 5 Principles of Visual Design in UX. https://www.nngroup. com/articles/principles-visual-design/

  20. [21]

    Gray and Marilyn C

    Wayne D. Gray and Marilyn C. Salzman. 1998. Damaged Merchandise? A Review of Experiments That Compare Usability Evaluation Methods.Human–Computer Interaction13, 3 (1998), 203–261. doi:10.1207/S15327051HCI1303_2

  21. [22]

    Johanna Gunawan, Amogh Pradeep, David Choffnes, Woodrow Hartzog, and Christo Wilson. 2021. A Comparative Study of Dark Patterns Across Web and Mobile Modalities.Proc. ACM Hum.-Comput. Interact.5, CSCW2, Article 377 (Oct. 2021), 29 pages. doi:10.1145/3479521

  22. [23]

    Longjie Guo, Chenjie Yuan, Mingyuan Zhong, Robert Wolfe, Ruican Zhong, Yue Xu, Bingbing Wen, Hua Shen, Lucy Lu Wang, and Alexis Hiniker. 2026. SusBench: An Online Benchmark for Evaluating Dark Pattern Susceptibility of Computer-Use Agents. InProceedings of the 31st International Conference on Intelligent User Interfaces (IUI ’26). Association for Computin...

  23. [24]

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. 2024. A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. arXiv:2307.12856 [cs.LG] https://arxiv.org/abs/2307.12856

  24. [25]

    Perttu Hämäläinen, Mikke Tavast, and Anton Kunnari. 2023. Evaluating Large Language Models in Generating Synthetic HCI Research Data: a Case Study. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA. doi:10.1145/3544548. 3580688

  25. [26]

    Jois, Matthew Green, and Aviel Rubin

    Qingying Hao, Licheng Luo, Steve T.K. Jan, and Gang Wang. 2021. It’s Not What It Looks Like: Manipulating Perceptual Hashing based Applications. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security(Virtual Event, Republic of Korea)(CCS ’21). Association for Computing Machinery, New York, NY, USA, 69–85. doi:10.1145/3460...

  26. [27]

    Aurora Harley. 2018. UX Expert Reviews. https://www.nngroup.com/articles/ux- expert-reviews/

  27. [28]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 6314–6330. https...

  28. [29]

    Steffen Holter, Eunyee Koh, Mustafa Doga Dogan, and Gromit Yeuk-Yin Chan

  29. [30]

    arXiv:2601.15777 [cs.HC] https://arxiv.org/abs/2601.15777

    UXCascade: Scalable Usability Testing with Simulated User Agents. arXiv:2601.15777 [cs.HC] https://arxiv.org/abs/2601.15777

  30. [31]

    Aleksi Ikkala, Florian Fischer, Markus Klar, Miroslav Bachinski, Arthur Fleig, Andrew Howes, Perttu Hämäläinen, Jörg Müller, Roderick Murray-Smith, and Antti Oulasvirta. 2022. Breathing Life Into Biomechanical User Models. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology(Bend, OR, USA)(UIST ’22). Association for Co...

  31. [32]

    Shivani Kapania, William Agnew, Motahhare Eslami, Hoda Heidari, and Sarah E. Fox. 2025. Simulacrum of Stories: Examining Large Language Models as Qualita- tive Research Participants. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3706598.3713220 ...

  32. [33]

    Panagiotis Kintis, Najmeh Miramirkhani, Charles Lever, Yizheng Chen, Rosa Romero-Gómez, Nikolaos Pitropakis, Nick Nikiforakis, and Manos Antonakakis

  33. [34]

    In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Gao et al

    Hiding in plain sight: A longitudinal study of combosquatting abuse. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Gao et al. Security. 569–586

  34. [35]

    2025.Why Organizations Don’t Do User Research and How to Change That

    Laura Klein. 2025.Why Organizations Don’t Do User Research and How to Change That. Nielsen Norman Group. https://www.nngroup.com/articles/why- organizations-dont-do-user-research/

  35. [36]

    Evan Klinger and David Starkweather. 2021. pHash: The open source perceptual hash library. https://www.phash.org/docs/

  36. [37]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAtten- tion. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  37. [38]

    2026.The RLHF Book: Reinforcement Learning from Hu- man Feedback, Alignment, and Post-Training LLMs

    Nathan Lambert. 2026.The RLHF Book: Reinforcement Learning from Hu- man Feedback, Alignment, and Post-Training LLMs. Manning Publications. https://www.manning.com/books/the-rlhf-book Manning Early Access Pro- gram (MEAP); publication estimated Summer 2026; last updated January 2026

  38. [39]

    Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Lihui Chen, Yangqiu Song, and Han Hu. 2026. WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality. arXiv:2510.18560 [cs.SE] https: //arxiv.org/abs/2510.18560

  39. [40]

    William Lidwell, Kritina Holden, and Jill Butler. 2010.Universal principles of design, revised and updated: 125 ways to enhance usability, influence perception, increase appeal, make better design decisions, and teach through design. Rockport Pub

  40. [41]

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

  41. [42]

    Let's Verify Step by Step

    Let’s Verify Step by Step.arXiv preprint arXiv:2305.20050(2023). arXiv:2305.20050 [cs.LG] doi:10.48550/arXiv.2305.20050

  42. [43]

    LMArena Team. 2025. WebDev Arena: A Live LLM Leaderboard for Web App Development. https://lmarena.ai/blog/webdev-arena/. Published 2025-03-09; contributors: Aryan Vichare, Anastasios N. Angelopoulos, Wei-Lin Chiang, Kelly Tang, Luca Manolache; accessed 2026-02-06

  43. [44]

    Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. 2025. UXAgent: An LLM Agent-Based Usability Testing Framework for Web Design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (CHI EA ’25). Association for Computing Machinery,...

  44. [45]

    Bailey, Maneesh Agrawala, Björn Hartmann, and Steven P

    Kurt Luther, Jari-Lee Tolentino, Wei Wu, Amy Pavel, Brian P. Bailey, Maneesh Agrawala, Björn Hartmann, and Steven P. Dow. 2015. Structuring, Aggregating, and Evaluating Crowdsourced Design Critique. InProceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing(Van- couver, BC, Canada)(CSCW ’15). Association for Computi...

  45. [46]

    Thomas Mahatody, Mouldi Sagar, and Christophe Kolski. 2010. State of the art on the cognitive walkthrough method, its variants and evolutions.Intl. Journal of Human–Computer Interaction26, 8 (2010), 741–785

  46. [47]

    Hee Seung Moon, Antti Oulasvirta, and Byungjoo Lee. 2023. Amortized Inference with User Simulations. InCHI 2023 - Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. ACM. doi:10.1145/3544548.3581439

  47. [48]

    1994.Usability engineering

    Jakob Nielsen. 1994.Usability engineering. Morgan Kaufmann

  48. [49]

    Jakob Nielsen. 2012. Usability 101: Introduction to Usability. https://www. nngroup.com/articles/usability-101-introduction-to-usability/

  49. [50]

    Landauer

    Jakob Nielsen and Thomas K. Landauer. 1993. A mathematical model of the finding of usability problems. InProceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems(Amsterdam, The Nether- lands)(CHI ’93). Association for Computing Machinery, New York, NY, USA, 206–213. doi:10.1145/169059.169166

  50. [51]

    Jakob Nielsen and Rolf Molich. 1990. Heuristic evaluation of user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Seattle, Washington, USA)(CHI ’90). Association for Computing Machinery, New York, NY, USA, 249–256. doi:10.1145/97243.97281

  51. [52]

    OpenAI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/ GPTV_System_Card.pdf. OpenAI system card

  52. [53]

    OpenAI. 2025. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/

  53. [54]

    Antti Oulasvirta, Samuli De Pascale, Janin Koch, Thomas Langerak, Jussi Joki- nen, Kashyap Todi, Markku Laine, Manoj Kristhombuge, Yuxi Zhu, Aliaksei Miniukovich, et al. 2018. Aalto interface metrics (AIM) a service and codebase for computational GUI evaluation. InAdjunct proceedings of the 31st annual ACM symposium on user interface software and technolo...

  54. [55]

    Antti Oulasvirta, Jussi PP Jokinen, and Andrew Howes. 2022. Computational rationality as a theory of interaction. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–14

  55. [56]

    Cai, Meredith Ringel Morris, Percy Liang, and Michael S

    Joon Sung Park, Lindsay Popowski, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2022. Social Simulacra: Creating Populated Prototypes for Social Computing Systems. arXiv:2208.04024 [cs.HC] https: //arxiv.org/abs/2208.04024

  56. [57]

    Fabio Paternò, Cristiano Mancini, and Silvia Meniconi. 1997. ConcurTaskTrees: A Diagrammatic Notation for Specifying Task Models. InHuman-Computer Interaction INTERACT ’97, Steve Howard, Judy Hammond, and Gitte Lindgaard (Eds.). Springer, Boston, MA, 362–369. doi:10.1007/978-0-387-35175-9_58

  57. [58]

    Bigham, and Jason Wu

    Yi-Hao Peng, Jeffrey P. Bigham, and Jason Wu. 2025. DesignPref: Capturing Personal Preferences in Visual Design Generation. arXiv:2511.20513 [cs.CV] https://arxiv.org/abs/2511.20513

  58. [59]

    Angel R. Puerta. 1997. A Model-Based Interface Development Environment. IEEE Software14, 4 (1997), 40–47

  59. [60]

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Jiadai Sun, Xinyue Yang, Yu Yang, Shuntian Yao, Wei Xu, Jie Tang, and Yuxiao Dong. 2025. WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning. InInternational Conference on Learning Representations (ICLR)

  60. [61]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

  61. [62]

    Jathushan Rajasegaran, Naveen Karunanayake, Ashanie Gunathillake, Suranga Seneviratne, and Guillaume Jourjon. 2019. A multi-modal neural embeddings ap- proach for detecting mobile counterfeit apps. InThe World Wide Web Conference. 3165–3171

  62. [63]

    Shreya Shankar, JD Zamfirescu-Pereira, Björn Hartmann, Aditya Parameswaran, and Ian Arawjo. 2024. Who validates the validators? aligning llm-assisted evalu- ation of llm outputs with human preferences. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–14

  63. [64]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.arXiv preprint arXiv:2402.03300(2024). arXiv:2402.03300 [cs.CL] doi:10.48550/arXiv. 2402.03300

  64. [65]

    Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, and Ilias Leontiadis

    William F. Shen, Xinchi Qiu, Chenxi Whitehouse, Lisa Alazraki, Shashwat Goel, Francesco Barbieri, Timon Willi, Akhil Mathur, and Ilias Leontiadis. 2026. Re- thinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks. arXiv:2602.05125 [cs.LG] https://arxiv.org/abs/2602.05125

  65. [66]

    Danqing Shi, Yujun Zhu, Jussi P. P. Jokinen, Aditya Acharya, Aini Putkonen, Shumin Zhai, and Antti Oulasvirta. 2024. CRTypist: Simulating Touchscreen Typing Behavior via Computational Rationality. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery. doi:10.1145/3613904.3642918

  66. [67]

    2010.Designing the user interface: strategies for effective human-computer interaction

    Ben Shneiderman. 2010.Designing the user interface: strategies for effective human-computer interaction. Pearson Education India

  67. [68]

    Sijun Tan, Siyuan Zhuang, Kyle Montgomery, William Y Tang, Alejandro Cuadron, Chenguang Wang, Raluca Ada Popa, and Ion Stoica. 2024. Judgebench: A benchmark for evaluating llm-based judges.arXiv preprint arXiv:2410.12784 (2024)

  68. [69]

    Kimi Team. 2026. Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276 [cs.CL] https://arxiv.org/abs/2602.02276

  69. [70]

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, and Dieuwke Hupkes. 2025. Judging the judges: Evaluating alignment and vulnerabilities in llms-as-judges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM 2). 404–430

  70. [71]

    Maryam Tohidi, William Buxton, Ronald Baecker, and Abigail Sellen. 2006. Get- ting the right design and the design right. InProceedings of the SIGCHI conference on Human Factors in computing systems. 1243–1252

  71. [72]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony H...

  72. [73]

    Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Pre- cup. 2021. AndroidEnv: A Reinforcement Learning Platform for Android. arXiv:2105.13231 [cs.LG] doi:10.48550/arXiv.2105.13231

  73. [74]

    Browser Use. 2026. Browser Use: The AI browser agent. https://browser-use.com Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

  74. [75]

    Bryan Wang, Gang Li, and Yang Li. 2023. Enabling conversational interaction with mobile ui using large language models. InProceedings of the 2023 CHI conference on human factors in computing systems. 1–17

  75. [76]

    Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. 2025. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123(2025)

  76. [77]

    Zichao Wang and Alexa Siu. 2026. Interview-Informed Generative Agents for Product Discovery: A Validation Study. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems. ACM. Honorable Mention

  77. [78]

    Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning8, 3 (1992), 229–256

  78. [79]

    Bigham, and Jeffrey Nichols

    Jason Wu, Yi-Hao Peng, Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, and Jeffrey Nichols. 2024. UIClip: A Data-driven Model for Assessing User Interface Design. arXiv:2404.12500 [cs.HC] https://arxiv.org/abs/2404.12500

  79. [80]

    Jason Wu, Amanda Swearngin, Arun Krishna Vajjala, Alan Leung, Jeffrey Nichols, and Titus Barik. 2025. Improving User Interface Generation Models from De- signer Feedback. (Sept. 2025). arXiv:2509.16779 [cs.HC] doi:10.48550/arXiv.2509. 16779

  80. [81]

    Wei Xiang, Hanfei Zhu, Suqi Lou, Xinli Chen, Zhenghua Pan, Yuxiao Jin, Shuyue Chen, and Lingyun Sun. 2024. SimUser: Generating Usability Feedback by Simulating Various Users Interacting with Mobile Applications. InProceedings of the CHI Conference on Human Factors in Computing Systems (CHI ’24). Association for Computing Machinery, New York, NY, USA, 1–17...

Showing first 80 references.