arxiv: 2604.09876 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.AI· cs.CV· cs.HC

Recognition: unknown

Efficient Personalization of Generative User Interfaces

Jason Wu, Jeffrey P. Bigham, Samarth Das, Yi-Hao Peng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.HC

keywords generative user interfacespersonalizationpreference modelingpairwise judgmentsdesignerssample efficiencyUI adaptation

0 comments

The pith

A method represents new users through combinations of prior designers' preferences to personalize generative UIs efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that even trained designers disagree substantially on UI quality, with average agreement of kappa 0.25, and that similar-sounding concepts like hierarchy get defined and prioritized differently. It introduces a dataset of pairwise judgments from 20 designers over 600 generated interfaces to study this divergence directly. From these observations the authors build a personalization approach that models any new user as a combination of the existing designers rather than a universal list of design rules. This representation supports learning from limited feedback and leads to generated interfaces that new designers prefer over those from direct prompting or other baselines. If the approach holds, personalization becomes feasible without needing to articulate subjective tastes from scratch each time.

Core claim

By representing a new user's preferences as a linear combination of the judgments from 20 prior designers, the method enables effective personalization of generative UIs with limited feedback, leading to interfaces that better match individual tastes compared to fixed rubrics or larger models.

What carries the argument

The preference model that represents new users in terms of prior designers rather than a fixed rubric of design concepts, allowing sample-efficient adaptation from sparse pairwise feedback.

If this is right

Personalization requires fewer feedback samples than methods that learn directly from the new user or from general evaluators.
Performance improves as more pairwise judgments from the new user are collected, with better scaling than larger multimodal models.
Generated interfaces are preferred by new designers over those produced by baseline approaches including direct user prompting.
Subjective design preferences can be captured without requiring users to articulate concepts like hierarchy or cleanliness explicitly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same representation strategy might apply to other creative tasks where expert opinions diverge but share underlying structure, such as graphic design or product aesthetics.
Collecting a small fixed panel of expert judgments once could serve as a reusable foundation for many future users instead of repeated large studies.
If disagreement patterns prove stable across broader populations, the approach could support testing whether certain UI domains have more or less predictable preference structures.

Load-bearing premise

That the preferences of arbitrary new users can be adequately captured by linear or low-dimensional combinations of the 20 prior designers' judgments, and that the observed disagreement pattern generalizes beyond the specific set of designers and UIs studied.

What would settle it

A new designer whose preferences cannot be well approximated by any combination of the existing 20 shows no improvement or worse performance with the model compared to direct prompting or pretrained evaluators when given the same amount of feedback.

Figures

Figures reproduced from arXiv: 2604.09876 by Jason Wu, Jeffrey P. Bigham, Samarth Das, Yi-Hao Peng.

**Figure 1.** Figure 1: Overview of our personalization pipeline for generative UIs. (1) We collect repeated pairwise preference judgments [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Cohen’s kappa scores for pairwise binary preference [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE embedding of preference rationale themes. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Rationales for divergent preferences. Each comparison shows an even preference split between screen A and B. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Our query selection algorithm queries a new user [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Pairwise prediction accuracy as a function of the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Arena-style Elo ratings and win rates across four [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Personalized UI widget editor prototype. (1) The [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Personalized design suggestions in a slide editor. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Our annotation UI with a screen description, two [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Generative user interfaces (UIs) create new opportunities to adapt interfaces to individual users on demand, but personalization remains difficult because desirable UI properties are subjective, hard to articulate, and costly to infer from sparse feedback. We study this problem through a new dataset in which 20 trained designers each provide pairwise judgments over the same 600 generated UIs, enabling direct analysis of preference divergence. We find substantial disagreement across designers (average kappa = 0.25), and written rationales reveal that even when designers appeal to similar concepts such as hierarchy or cleanliness, designers differ in how they define, prioritize, and apply those concepts. Motivated by these findings, we develop a sample-efficient personalization method that represents a new user in terms of prior designers rather than a fixed rubric of design concepts. In a technical evaluation, our preference model outperforms both a pretrained UI evaluator and a larger multimodal model, and scales better with additional feedback. When used to personalize generation, it also produces interfaces preferred by 12 new designers over baseline approaches, including direct user prompting. Our findings suggest that lightweight preference elicitation can serve as a practical foundation for personalized generative UI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dataset and disagreement analysis are the real addition here, with a practical personalization trick via prior designers, but the small study and thin details on generalization leave the performance claims provisional.

read the letter

This paper's main point is that 20 designers disagree substantially on the same 600 generated UIs (kappa 0.25), even when they invoke the same concepts like hierarchy, and that this can be turned into a sample-efficient way to personalize new generative UIs by representing fresh users as combinations of those priors instead of fixed rubrics or raw prompting. The dataset construction and the quantitative-plus-qualitative look at divergence are the clearest new pieces. Having multiple designers judge identical outputs lets them measure real preference spread and show that concept application differs, which is useful raw material for anyone working on subjective design tasks. The technical evaluation reports better results than a pretrained UI evaluator and a larger multimodal model, with better scaling as feedback increases, and the 12-designer follow-up finds the outputs preferred over direct prompting and other baselines. That gives the work a concrete empirical anchor. The soft spots are mostly around scale and scope. The user study is small, and the abstract supplies no effect sizes, statistical tests, or baseline implementation details, so the size of the gains is difficult to judge. The core representation assumes new preferences can be recovered from the 20-designer basis, but with disagreement that high the span could easily miss orthogonal tastes; the stress-test note flags this, and nothing in the provided text shows an explicit check for extrapolation error or how much variance the priors actually capture. If the full methods only test similar designers, the reported advantages may not travel far. This is for people building or evaluating generative design tools who need lightweight preference handling. It has enough new data and a clear applied angle to merit a serious referee, even if the evaluation section will need tightening on metrics and generalization tests. I would bring the dataset and disagreement findings to a reading group. I would not cite the performance claims until the numbers and controls are clearer. It should go to peer review rather than a desk reject.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces a dataset of pairwise preference judgments from 20 trained designers over 600 generated UIs, revealing substantial inter-designer disagreement (average kappa = 0.25) even on shared concepts like hierarchy. Motivated by this, the authors develop a sample-efficient personalization method that represents new users via combinations of the prior designers' judgments rather than fixed rubrics or full retraining. Technical evaluations show the resulting preference model outperforming a pretrained UI evaluator and a larger multimodal model while scaling better with added feedback. A user study with 12 new designers finds that UIs generated via this personalization are preferred over baselines including direct prompting.

Significance. If the results hold, the work offers a practical, data-driven route to personalizing generative UIs that directly leverages observed preference divergence instead of assuming a universal rubric. The collected dataset with both judgments and rationales is a clear strength, enabling quantitative study of subjectivity in design and providing a reusable resource. The approach demonstrates that lightweight elicitation over a small designer pool can outperform standard prompting or large-model baselines, with potential extension to other subjective generation domains. The user study supplies direct preference evidence rather than proxy metrics alone.

major comments (2)

[Section 4] Section 4 (personalization method): The technique encodes new users via low-dimensional (linear or near-linear) coefficients over the 20 prior designers' judgments. With average kappa = 0.25 indicating high disagreement, new users may possess preference components orthogonal to this span; the 12-designer study does not report alignment statistics or extrapolation error for these cases, which is load-bearing for the claim that the method reliably personalizes for arbitrary new users.
[Section 5] Section 5 (technical evaluation): The outperformance and superior scaling claims versus the pretrained UI evaluator and larger multimodal model are central, yet the manuscript provides insufficient detail on the exact metrics (pairwise accuracy, AUC, etc.), statistical tests, baseline implementations, and controls for implementation differences, leaving the quantitative superiority only partially supported.

minor comments (3)

[Abstract] Abstract: The summary of results would be strengthened by including one or two concrete quantitative indicators (e.g., accuracy delta or preference win rate) rather than qualitative statements alone.
[Section 3] Section 3 (dataset): The description of how the 600 UIs were generated and sampled should explicitly state the generative model, prompt distribution, and any diversity controls to allow replication.
[User study] User study results: The preference comparisons would benefit from reporting exact win rates, confidence intervals, and inter-rater agreement for the 12 new designers to improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us identify areas for improvement in the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Section 4] Section 4 (personalization method): The technique encodes new users via low-dimensional (linear or near-linear) coefficients over the 20 prior designers' judgments. With average kappa = 0.25 indicating high disagreement, new users may possess preference components orthogonal to this span; the 12-designer study does not report alignment statistics or extrapolation error for these cases, which is load-bearing for the claim that the method reliably personalizes for arbitrary new users.

Authors: We agree that the high inter-designer disagreement (kappa=0.25) raises the possibility of preference components outside the linear span of the 20 designers. Our method is designed to provide a practical approximation for personalization using limited feedback, and the user study demonstrates that the personalized models are preferred by new designers over non-personalized baselines. However, we did not include an explicit analysis of the alignment between the 12 new designers and the existing span or quantify extrapolation errors. In the revised version, we will add this analysis, including metrics such as the norm of residuals when projecting new users' preference vectors onto the designer space, and discuss limitations for users with highly orthogonal preferences. revision: yes
Referee: [Section 5] Section 5 (technical evaluation): The outperformance and superior scaling claims versus the pretrained UI evaluator and larger multimodal model are central, yet the manuscript provides insufficient detail on the exact metrics (pairwise accuracy, AUC, etc.), statistical tests, baseline implementations, and controls for implementation differences, leaving the quantitative superiority only partially supported.

Authors: We acknowledge that the technical evaluation section would benefit from greater detail to substantiate the claims. In the revision, we will provide: (1) exact definitions and formulas for all reported metrics including pairwise accuracy and AUC; (2) results of statistical tests (e.g., p-values from appropriate tests comparing our method to baselines); (3) comprehensive descriptions of baseline implementations, including model architectures, training procedures, and any hyperparameter choices; and (4) additional controls or ablations to address potential implementation differences. These additions will be incorporated into Section 5 and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

The paper introduces a new empirical dataset of 20 designers' pairwise judgments on 600 UIs and validates its personalization method via a separate user study with 12 new designers. The core modeling choice (representing new users via coefficients over prior designers) is motivated by observed disagreement (kappa=0.25) but is not self-definitional, nor does any reported result reduce by construction to fitted parameters or self-citations. No equations appear in the provided text, and the outperformance claims rest on direct preference comparisons rather than tautological re-use of inputs. This is a standard non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a modest set of expert designers provides a sufficient basis for representing new users' subjective preferences; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Pairwise judgments collected from a small group of trained designers can be used to infer generalizable preference models for new users
This underpins both the dataset analysis and the personalization method described.

pith-pipeline@v0.9.0 · 5511 in / 1233 out tokens · 63818 ms · 2026-05-10T16:48:23.684655+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 52 canonical work pages · 3 internal anchors

[1]

Sergio Alves, Ricardo Costa, Kyle Montague, and Tiago Guerreiro. 2024. Citizen- led personalization of user interfaces: Investigating how people customize in- terfaces for themselves and others.Proceedings of the ACM on Human-Computer Interaction8, CSCW2 (2024), 446:1–446:33. doi:10.1145/3686985

work page doi:10.1145/3686985 2024
[2]

Sérgio Alves, Carlos Duarte, Kyle Montague, and Tiago Guerreiro. 2026. Ex- ploring the Role of Interaction Data to Empower End-User Decision-Making in UI Personalization. InProceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26). doi:10.1145/3772318.3791022

work page doi:10.1145/3772318.3791022 2026
[3]

Ying Ba, Tianyu Zhang, Yalong Bai, Wenyi Mo, Tao Liang, Bing Su, and Ji- Rong Wen. 2025. Enhancing reward models for high-quality image generation: Beyond text-image alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision. 19022–19031

2025
[4]

Nikola Banovic, Fanny Chevalier, Tovi Grossman, and George Fitzmaurice
[5]

In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’12)

Triggering Triggers and Burying Barriers to Customizing Software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’12). Association for Computing Machinery, New York, NY, USA, 2717–
[6]

doi:10.1145/2207676.2208666

work page doi:10.1145/2207676.2208666
[7]

Andre Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez- Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, and Hugo Larochelle
[8]

https://arxiv.org/abs/2503.17338

Capturing individual human preferences with reward features.arXiv preprint arXiv:2503.17338(2025). https://arxiv.org/abs/2503.17338

work page arXiv 2025
[9]

Doug Beeferman and Nabeel Gillani. 2023. FeedbackMap: A tool for making sense of open-ended survey responses. InCompanion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing. 395–397

2023
[10]

Freddie Bickford Smith, Andreas Kirsch, Sebastian Farquhar, Yarin Gal, Adam Foster, and Tom Rainforth. 2023. Prediction-Oriented Bayesian Active Learning. InProceedings of The 26th International Conference on Artificial Intelligence and Statistics (Proceedings of Machine Learning Research, Vol. 206), Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meen...

2023
[11]

Bahar Boroomand and James R. Wright. 2025. Personalized recommendations via active utility-based pairwise sampling.arXiv preprint arXiv:2508.14911 (2025). https://arxiv.org/abs/2508.14911

work page arXiv 2025
[12]

Jessica Brandenburger and Monique Janneck. 2024. Consideration of people’s design preferences for the development of adaptive user interfaces.i-com23, 3 (2024), 321–334. doi:10.1515/icom-2024-0029

work page doi:10.1515/icom-2024-0029 2024
[13]

Sara Bunian, Kai Li, Chaima Jemmali, Casper Harteveld, Yun Fu, and Magy Seif Seif El-Nasr. 2021. Vins: Visual search for mobile user interface design. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14

2021
[14]

Andrea Bunt, Cristina Conati, and Joanna McGrenere. 2007. Supporting interface customization using a mixed-initiative approach. InProceedings of the 12th International Conference on Intelligent User Interfaces. 92–101. doi:10.1145/1216295.1216317

work page doi:10.1145/1216295.1216317 2007
[15]

Andrea Bunt, Cristina Conati, and Joanna McGrenere. 2009. Mixed-Initiative Interface Personalization as a Case Study in Usable AI.AI Magazine30, 4 (2009), 58–64. doi:10.1609/aimag.v30i4.2264

work page doi:10.1609/aimag.v30i4.2264 2009
[16]

Andrea Bunt, Joanna McGrenere, and Cristina Conati. 2007. Understanding the Utility of Rationale in a Mixed-Initiative System for GUI Customization. InUser Modeling 2007 (Lecture Notes in Computer Science, Vol. 4511). Springer, 147–156. doi:10.1007/978-3-540-73078-1_18

work page doi:10.1007/978-3-540-73078-1_18 2007
[17]

Yining Cao, Peiling Jiang, and Haijun Xia. 2025. Generative and malleable user interfaces with generative and evolving task-driven data model. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 686:1–686:20. doi:10.1145/3706598.3713285

work page doi:10.1145/3706598.3713285 2025
[18]

François Caron and Arnaud Doucet. 2012. Efficient Bayesian Inference for Generalized Bradley–Terry Models.Journal of Computational and Graphical Statistics21, 1 (2012), 174–196. doi:10.1080/10618600.2012.638220

work page doi:10.1080/10618600.2012.638220 2012
[19]

Jiaqi Chen, Yanzhe Zhang, Yutong Zhang, Yijia Shao, and Diyi Yang. 2025. Generative interfaces for language models.arXiv preprint arXiv:2508.19227 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Xi Chen, Paul N Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise ranking aggregation in a crowdsourced setting. InProceedings of the sixth ACM international conference on Web search and data mining. 193–202

2013
[21]

Yunnong Chen, Chengwei Shi, and Liuqing Chen. 2025. SpecifyUI: Supporting iterative UI design intent expression through structured specifications and generative AI.arXiv preprint arXiv:2509.07334(2025). https://arxiv.org/abs/ 2509.07334

work page arXiv 2025
[22]

Youngbin Choi, Seunghyuk Cho, Minjong Lee, MoonJeong Park, Yesong Ko, Jungseul Ok, and Dongwoo Kim. 2025. CoPL: Collaborative preference learning for personalizing LLMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 12875–12893. doi:10.18653/v1/2025. emnlp-main.650

work page doi:10.18653/v1/2025 2025
[23]

A. P. Dawid and A. M. Skene. 1979. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.Journal of the Royal Statistical Society: Series C (Applied Statistics)28, 1 (1979), 20–28

1979
[24]

Design Arena. 2025. Design Arena. https://www.designarena.ai/. Accessed: 2025-11-05

2025
[25]

Sumanth Doddapaneni, Krishna Sayana, Ambarish Jash, Sukhdeep Sodhi, and Dima Kuzmin. 2024. User embedding model for personalized language prompt- ing.arXiv preprint arXiv:2401.04858(2024)

work page arXiv 2024
[26]

Yijiang River Dong, Tiancheng Hu, and Nigel Collier. 2024. Can llm be a personalized judge?arXiv preprint arXiv:2406.11657(2024)

work page arXiv 2024
[27]

Peitong Duan, Chin-Yi Cheng, Gang Li, Bjoern Hartmann, and Yang Li. 2024. Uicrit: Enhancing automated design evaluation with a ui critique dataset. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–17

2024
[28]

Peitong Duan, Jeremy Warner, Yang Li, and Bjoern Hartmann. 2024. Generating automatic feedback on ui mockups with large language models. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20

2024
[29]

Leah Findlater and Krzysztof Z. Gajos. 2009. Design space and evaluation challenges of adaptive graphical user interfaces.AI Magazine30, 4 (2009), 68–73. doi:10.1609/aimag.v30i4.2268

work page doi:10.1609/aimag.v30i4.2268 2009
[30]

Krzysztof Gajos and Daniel S. Weld. 2004. SUPPLE: Automatically generating user interfaces. InProceedings of the 9th International Conference on Intelligent User Interfaces. 93–100. doi:10.1145/964442.964461

work page doi:10.1145/964442.964461 2004
[31]

Krzysztof Gajos and Daniel S. Weld. 2005. Preference elicitation for interface optimization. InProceedings of the 18th Annual ACM Symposium on User Interface Software and Technology. 173–182. doi:10.1145/1095034.1095063

work page doi:10.1145/1095034.1095063 2005
[32]

Gajos, Daniel S

Krzysztof Z. Gajos, Daniel S. Weld, and Jacob O. Wobbrock. 2010. Automatically generating personalized user interfaces with Supple.Artificial Intelligence174, 12–13 (2010), 910–950. doi:10.1016/j.artint.2010.05.005

work page doi:10.1016/j.artint.2010.05.005 2010
[33]

Gajos, Jacob O

Krzysztof Z. Gajos, Jacob O. Wobbrock, and Daniel S. Weld. 2007. Automatically generating user interfaces adapted to users’ motor and vision capabilities. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology. 231–240. doi:10.1145/1294211.1294253

work page doi:10.1145/1294211.1294253 2007
[34]

Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou, Yi-Hao Peng, Sanjay Subramanian, Qinyue Tan, Maarten Sap, Alane Suhr, Daniel Fried, Graham Neubig, et al. 2025. Autopresent: Designing structured visuals from scratch. InProceedings of the Computer Vision and Pattern Recognition Conference. 2902–2911

2025
[35]

Samuel Goree, Weslie Khoo, and David J Crandall. 2023. Correct for whom? subjectivity and the evaluation of personalized image aesthetics assessment models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 11818–11827

2023
[36]

Syrine Haddad, Kayhan Latifzadeh, Saravanakumar Duraisamy, Jean Vander- donckt, Olfa Daassi, Safya Belghith, and Luis A. Leiva. 2024. Good GUIs, Bad GUIs: Affective evaluation of graphical user interfaces. InProceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization. 1–18. doi:10.1145/3627043.3659549

work page doi:10.1145/3627043.3659549 2024
[37]

Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 159– 166

1999
[38]

HsiaoYuan Hsu and Yuxin Peng. 2025. PosterO: Structuring Layout Trees to Enable Language Models in Generalized Content-Aware Layout Generation. In Proceedings of the Computer Vision and Pattern Recognition Conference. 8117– 8127

2025
[39]

Hsiao Yuan Hsu, Xiangteng He, Yuxin Peng, Hao Kong, and Qing Zhang. 2023. Posterlayout: A new benchmark and approach for content-aware visual-textual presentation layout. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6018–6026

2023
[40]

Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Dae Hyun Kim, and Youngjae Yu. 2026. Do MLLMs capture how interfaces guide user behavior? A benchmark for multimodal UI/UX design understanding. arXiv preprint arXiv:2505.05026(2026). https://arxiv.org/abs/2505.05026

work page arXiv 2026
[41]

Saikishore Kalloori, Francesco Ricci, and Rosella Gennari. 2018. Eliciting pair- wise preferences in recommender systems. InProceedings of the 12th ACM Conference on Recommender Systems. 329–337. doi:10.1145/3240323.3240364

work page doi:10.1145/3240323.3240364 2018
[42]

Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata, Sergey Tulyakov, Jian Ren, and Anil Kag. 2025. Scalable ranked preference optimization for text-to- image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 18399–18410

2025
[43]

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. 2023. Pick-a-pic: An open dataset of user preferences for text-to- image generation.Advances in neural information processing systems36 (2023), 36652–36663

2023
[44]

Sih-Pin Lai, Cheng-An Hsieh, Teepob Harutaipree, Shih-Chin Lin, Yi-Hao Peng, Lung-Pan Cheng, and Mike Y Chen. 2019. Fitbird: Improving free-weight train- ing experience using wearable sensors for game control. InExtended Abstracts of the Annual Symposium on Computer-Human Interaction in Play Companion Extended Abstracts. 475–481

2019
[45]

Jun-Tae Lee and Chang-Su Kim. 2019. Image aesthetic assessment based on pair- wise comparison a unified approach to score regression, binary classification, and personalization. InProceedings of the IEEE/CVF International Conference on 11 Computer Vision. 1191–1200

2019
[46]

Luis A Leiva, Asutosh Hota, and Antti Oulasvirta. 2020. Enrico: A dataset for topic modeling of mobile UI designs. In22nd International Conference on Human-Computer Interaction with Mobile Devices and Services. 1–4

2020
[47]

Jia-Nan Li, Jian Guan, Songhao Wu, Wei Wu, and Rui Yan. 2025. From 1,000,000 users to every user: Scaling up personalized preference for user-level alignment. arXiv preprint arXiv:2503.15463(2025). https://arxiv.org/abs/2503.15463

work page arXiv 2025
[48]

Kun Li, Lai-Man Po, Hongzheng Yang, Xuyuan Xu, Kangcheng Liu, and Yuzhi Zhao. 2025. AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment.arXiv preprint arXiv:2509.11620(2025)

work page arXiv 2025
[49]

Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, and Yulia Tsvetkov. [n. d.]. PrefDisco: Benchmarking Proactive Personalized Reasoning. InThe Fourteenth International Conference on Learning Representations
[50]

Shuyue Stella Li, Melanie Sclar, Hunter Lang, Ansong Ni, Jacqueline He, Puxin Xu, Andrew Cohen, Chan Young Park, Yulia Tsvetkov, and Asli Celikyilmaz
[51]

Prefpalette: Personalized preference modeling with latent attributes.arXiv preprint arXiv:2507.13541(2025)

work page arXiv 2025
[52]

Xinyu Li, Ruiyang Zhou, Zachary C Lipton, and Liu Leqi. 2024. Personal- ized language modeling from personalized human feedback.arXiv preprint arXiv:2402.05133(2024)

work page arXiv 2024
[53]

LMSYS Org. 2025. WebDev Arena Leaderboard. https://web.lmarena.ai/ leaderboard

2025
[54]

Reuben A Luera, Ryan Rossi, Franck Dernoncourt, Samyadeep Basu, Sungchul Kim, Subhojyoti Mukherjee, Puneet Mathur, Ruiyi Zhang, Jihyung Kil, Nedim Lipka, et al . 2025. Mllm as a ui judge: Benchmarking multimodal llms for predicting human perception of user interfaces.arXiv preprint arXiv:2510.08783 (2025)

work page arXiv 2025
[55]

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. 2025. Hpsv3: To- wards wide-spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision. 15086–15095

2025
[56]

Wendy E. Mackay. 1991. Triggers and Barriers to Customizing Software. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’91). Association for Computing Machinery, New York, NY, USA, 153–160. doi:10.1145/108844.108867

work page doi:10.1145/108844.108867 1991
[57]

Anne-Sofie Maerten, Li-Wei Chen, Stefanie De Winter, Christophe Bossens, and Johan Wagemans. 2025. LAPIS: A novel dataset for personalized image aes- thetic assessment. InProceedings of the Computer Vision and Pattern Recognition Conference. 6302–6311

2025
[58]

Bryan Min, Allen Chen, Yining Cao, and Haijun Xia. 2025. Malleable overview- detail interfaces. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 688:1–688:25. doi:10.1145/3706598.3714164

work page doi:10.1145/3706598.3714164 2025
[59]

Aliaksei Miniukovich and Antonella De Angeli. 2015. Computation of interface aesthetics. InProceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 1163–1172. doi:10.1145/2702123.2702575

work page doi:10.1145/2702123.2702575 2015
[60]

Aliaksei Miniukovich, Simone Sulpizio, and Antonella De Angeli. 2018. Visual complexity of graphical user interfaces. InProceedings of the 2018 International Conference on Advanced Visual Interfaces. 1–9. doi:10.1145/3206505.3206549

work page doi:10.1145/3206505.3206549 2018
[61]

Kate Moran and Sarah Gibbons. 2024. Generative UI and Outcome-Oriented Design

2024
[62]

Peya Mowar, Yi-Hao Peng, Aaron Steinfeld, and Jeffrey P Bigham. 2024. Tab to autocomplete: The effects of ai coding assistants on web accessibility. In Proceedings of the 26th International ACM SIGACCESS Conference on Computers and Accessibility. 1–6

2024
[63]

Peya Mowar, Yi-Hao Peng, Jason Wu, Aaron Steinfeld, and Jeffrey P Bigham
[64]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems

Codea11y: Making ai coding assistants useful for accessible web develop- ment. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–15

2025
[65]

Minhyeon Oh, Seungjoon Lee, and Jungseul Ok. 2024. Active preference-based learning for multi-dimensional personalization.arXiv preprint arXiv:2411.00524 (2024). https://arxiv.org/abs/2411.00524

work page arXiv 2024
[66]

2025.GPT-5 System Card

OpenAI. 2025.GPT-5 System Card. Technical Report. OpenAI. https://cdn. openai.com/gpt-5-system-card.pdf

2025
[67]

Antti Oulasvirta, Samuli De Pascale, Janin Koch, Thomas Langerak, Jussi Joki- nen, Kashyap Todi, Markku Laine, Manoj Kristhombuge, Yuxi Zhu, Aliaksei Miniukovich, Gregorio Palmas, and Tino Weinkauf. 2018. Aalto Interface Met- rics (AIM): A service and codebase for computational GUI evaluation. InAdjunct Proceedings of the 31st Annual ACM Symposium on User...

work page doi:10.1145/3266037.3266087 2018
[68]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al
[69]

Advances in neural information processing systems35 (2022), 27730–27744

Training language models to follow instructions with human feedback. Advances in neural information processing systems35 (2022), 27730–27744

2022
[70]

Victor-Alexandru Padurean, Parameswaran Kamalaruban, Nachiket Kotalwar, Alkis Gotovos, and Adish Singla. 2025. Inference-time personalized alignment with a few user preference queries.arXiv preprint arXiv:2511.02966(2025). https://arxiv.org/abs/2511.02966

work page arXiv 2025
[71]

Fabio Paternò, Cristiano Mancini, and Silvia Meniconi. 1997. ConcurTaskTrees: A diagrammatic notation for specifying task models. InProceedings of the IFIP TC13 International Conference on Human-Computer Interaction (INTERACT ’97). Chapman & Hall, 362–369

1997
[72]

Yi-Hao Peng, Jeffrey P Bigham, and Jason Wu. 2025. DesignPref: Capturing Per- sonal Preferences in Visual Design Generation.arXiv preprint arXiv:2511.20513 (2025)

work page arXiv 2025
[73]

Yi-Hao Peng, Peggy Chi, Anjuli Kannan, Meredith Ringel Morris, and Irfan Essa
[74]

InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Slide gestalt: Automatic structure extraction in slide decks for non-visual access. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–14

2023
[75]

Yi-Hao Peng, Faria Huq, Yue Jiang, Jason Wu, Xin Yue Li, Jeffrey P Bigham, and Amy Pavel. 2024. Dreamstruct: Understanding slides and user interfaces via synthetic data generation. InEuropean Conference on Computer Vision. Springer, 466–485

2024
[76]

Yi-Hao Peng, JiWoong Jang, Jeffrey P Bigham, and Amy Pavel. 2021. Say it all: Feedback for improving non-visual presentation accessibility. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–12

2021
[77]

Yi-Hao Peng, Dingzeyu Li, Jeffrey P Bigham, and Amy Pavel. 2025. Morae: Proactively Pausing UI Agents for User Choices. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. 1–14

2025
[78]

Yi-Hao Peng, Muh-Tarng Lin, Yi Chen, TzuChuan Chen, Pin Sung Ku, Paul Taele, Chin Guan Lim, and Mike Y Chen. 2019. Personaltouch: Improving touchscreen usability by personalizing accessibility settings based on individual user’s touchscreen interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11

2019
[79]

Yi-Hao Peng, Jason Wu, Jeffrey Bigham, and Amy Pavel. 2022. Diffscriber: Describing visual design changes to support mixed-ability collaborative presen- tation authoring. InProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. 1–13

2022
[80]

Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, and Natasha Jaques. 2024. Personalizing reinforcement learning from human feedback with variational preference learning.arXiv preprint arXiv:2408.10075(2024). https://arxiv.org/abs/2408.10075

work page arXiv 2024

Showing first 80 references.