pith. machine review for the scientific record. sign in

arxiv: 2604.07427 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

Personalizing Text-to-Image Generation to Individual Taste

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords personalized text-to-imageaesthetic reward modelsubjective preferencesmulti-rater datasetprompt optimizationindividual taste predictionvisual quality assessment
0
0 comments X

The pith

A multi-rater dataset lets models predict what one person will like in generated images more accurately than most methods predict average appeal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Text-to-image models currently optimize for what most people like on average, yet aesthetic judgment varies widely between individuals. The authors collected 70,000 ratings on 5,000 images, with each image rated by 15 different users, and trained a reward model jointly on these annotations plus existing aesthetic datasets. This personalized model predicts individual liking scores with higher accuracy than the majority of current state-of-the-art methods achieve when they predict population-level preferences. The same predictor can then be used with simple prompt optimization to shift generated images toward a specific user's taste. The work shows that dense per-image, per-user data is needed to handle the inherent subjectivity of visual appeal.

Core claim

We introduce the PAMELA dataset of 70,000 ratings across 5,000 images generated by state-of-the-art text-to-image models, where each image receives ratings from 15 unique users, and train a personalized reward model jointly on these annotations and existing aesthetic assessment subsets. This model predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences, and enables simple prompt optimization to steer generations toward an individual user's preferences.

What carries the argument

The personalized reward model trained jointly on the new 15-rater-per-image dataset and existing aesthetic subsets to output user-specific liking predictions.

If this is right

  • Simple prompt optimization guided by the personalized predictor can steer text-to-image outputs to better match a specific user's taste.
  • The model achieves higher accuracy on individual predictions than most existing methods achieve on population averages.
  • High-quality, multi-rater annotations per image are required to model subjective preferences instead of averaged scores.
  • Joint training with existing aesthetic datasets supports the personalization task without major performance loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If individual preferences remain consistent over time, the same model could support persistent user profiles for repeated generation tasks.
  • Similar dense multi-rater collection could be applied to personalize other generative domains such as video or 3D content.
  • Population-averaged reward models may systematically fail in settings where taste diversity is large, such as creative tools or design assistance.

Load-bearing premise

That 15 ratings per image from distinct users on 5,000 generated images capture stable individual aesthetic preferences and that joint training with existing subsets does not introduce biases that undermine personalization.

What would settle it

If the same users re-rate the same images after several weeks, the personalized model's accuracy on those new individual ratings drops to the level of population-based models.

Figures

Figures reproduced from arXiv: 2604.07427 by Ameya Prabhu, Anne-Sofie Maerten, Johan Wagemans, Juliane Verwiebe, Matthias Bethge, Shyamgopal Karthik.

Figure 1
Figure 1. Figure 1: Prior work on preference alignment focuses on steering generations towards samples with a global reward. In contrast, we are able to steer samples with prompt optimization [1, 38] to generate different samples tailored to individual taste. reward models cannot capture appeal at current quality levels and may inadvertently steer models toward older generative artifacts. To address these limitations, we prop… view at source ↗
Figure 2
Figure 2. Figure 2: Visual diversity in the PAM∃LA benchmark. The dataset spans two primary domains: Art and Pho￾tography. It comprising 21 distinct thematic categories as shown with examples. This structure isolates a model’s ability to judge stylized artistic compositions from its ability to evaluate real-world, photographic subjects. Preference-Tuning Text-to-Image Models Drawing from the successful application of preferen… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of our personalized aesthetic quality predictor. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of iterative prompt refinement steered with PAM∃LA compared to HPSv3 [36] and Q-align [55]. Unlike global reward models, prompt optimization with PAM∃LA leads to more natural and appealing samples without reward-hacking. Summary. These results demonstrate that explicitly modeling individual user preferences enables PAM∃LA to achieve significant improvements in both population-level and user-leve… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of iterative prompt optimization results for three different users with our PAM∃LA predictor, HPSv3 [36] and Q-Align [55]. Unlike existing reward models, we are able to customize the generation to individual user taste. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of iterative prompt refinement for a fictional image prompt. While other reward models steer towards surreal outputs, PAM∃LA favors photorealism. Initial 0.65 20-29 0.68 40-49 0.61 60-69 Prompt: "Floor-to-ceiling bookshelves where the books share space with rare variegated Philodendrons. A vintage leather armchair in the center." [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of iterative prompt refinement steered with PAM∃LA for different age ranges. Our PAM∃LA predictor remains agnostic to user id and instead relies on learned patterns from the demographic embeddings. steering for darker lighting conditions for User 1. Our results suggest that PAM∃LA successfully captures distinct compositional preferences for different users. Demographic Profiles. Finally, we exte… view at source ↗
Figure 8
Figure 8. Figure 8: Preference ranking of optimized images. We find that users prefer images optimized to their taste using PAM∃LA, and that users prefer images optimized with our method in general. Popular reward models like HPSv3 [36] and Q-Align [55] seem to degrade perceptual quality and are chosen less than the unoptimized image. Setup. We evaluate our approach through a pairwise preference study on Mabyduck, collecting … view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of image steering outcomes across two distinct runs for User 1. We find that the LLM learns consistent patterns in the prompt, leading to a consistent pattern of steered images. across independent runs. To test this, we run two independent optimization passes per user. Concretely, we repeat the optimization with a fixed image generation seed but a stochastic language model, so that the two runs … view at source ↗
Figure 10
Figure 10. Figure 10: Examples of image pairs from our PAM∃LA test set where users differed in their preferences. Each image and user pair shows 2 images where user x preferred image A over image B and user y preferred image B over image A. The ground truth score and model prediction are plotted below each image. 7 Analysis 7.1 Evaluation on diverging user preferences Motivation. People frequently disagree on image aesthetics.… view at source ↗
Figure 11
Figure 11. Figure 11: We vary the margin considered to be a tie while calculating pairwise accuracy. We note that increasing the threshold upto the next anchor point in the annotation scale gives consistent improvements to the pairwise accuracy, suggesting that the relatively low pairwise accuracy stems from the large number of samples with similar scores. exclude the pair. We classify these close ratings as functional ties, f… view at source ↗
Figure 12
Figure 12. Figure 12: Effect of number of context images N and Top-K neighbors on unseen user performance for PAM∃LA. Performance peaks at N = 15, K = 5 (circled) across all metrics, with diminishing returns beyond this point. Temperature = 0.1 throughout. every user from a subset of their rated images using rating-weighted SigLIP embeddings. We then retrieve the K most similar training users by cosine similarity, and compute … view at source ↗
Figure 13
Figure 13. Figure 13: Example trial of our rating study to collect the user annotations in our PAM∃LA dataset. Given the prompt and image, we allow the user to select the exact score using a continuous slider, with 5 anchor points along the scale. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example trial of our user study to validate our PAM∃LA reward model for image steering. Users were shown two images side-by-side and were simply asked to indicate which image they preferred. The user study compared images optimized to the user’s taste with PAM∃LA, images optimized for other users with PAM∃LA, images optimized with generic reward models (HPSv3 and Q-Align) and unoptimized images. Users eva… view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of image steering outcomes across two distinct runs for User 2. We find that the LLM learns consistent patterns in the prompt, leading to a consistent pattern of steered images. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of image steering outcomes across two distinct runs for User 3. We find that the LLM learns consistent patterns in the prompt, leading to a consistent pattern of steered images. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

Modern text-to-image (T2I) models generate high-fidelity visuals but remain indifferent to individual user preferences. While existing reward models optimize for "average" human appeal, they fail to capture the inherent subjectivity of aesthetic judgment. In this work, we introduce a novel dataset and predictive framework, called PAMELA, designed to model personalized image evaluations. Our dataset comprises 70,000 ratings across 5,000 diverse images generated by state-of-the-art models (Flux 2 and Nano Banana). Each image is evaluated by 15 unique users, providing a rich distribution of subjective preferences across domains such as art, design, fashion, and cinematic photography. Leveraging this data, we propose a personalized reward model trained jointly on our high-quality annotations and existing aesthetic assessment subsets. We demonstrate that our model predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences. Using our personalized predictor, we demonstrate how simple prompt optimization methods can be used to steer generations towards individual user preferences. Our results highlight the importance of data quality and personalization to handle the subjectivity of user preferences. We release our dataset and model to facilitate standardized research in personalized T2I alignment and subjective visual quality assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper introduces PAMELA, a dataset of 70,000 ratings on 5,000 T2I-generated images (each rated by 15 users) spanning art, design, fashion, and photography, along with a reward model trained jointly on this data and existing aesthetic assessment subsets. It claims this personalized model predicts individual user liking with higher accuracy than most current SOTA methods achieve on population-level preferences and demonstrates its utility via prompt optimization to steer generations toward personal taste. The dataset and model are released publicly.

Significance. If substantiated, the work would meaningfully advance personalized T2I alignment by supplying a benchmark dataset that explicitly captures inter-user subjectivity rather than averaging it away, plus a practical demonstration of prompt-based steering. The public release of data and model is a clear strength for reproducibility and follow-on research in subjective visual quality assessment.

major comments (3)
  1. [Abstract] Abstract: The central claim states that the model 'predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences,' yet no metrics (accuracy, correlation, etc.), baselines, test splits, cross-validation, or significance tests are described. This information is load-bearing for assessing whether the reported gains reflect true personalization.
  2. [Abstract] Abstract (dataset description): With 5,000 images each rated by 15 users (70k total ratings), the manuscript provides no statistics on the number of unique users or ratings per user. If most users contribute only a handful of ratings, the model cannot reliably separate stable individual taste from noise or image-specific effects, directly threatening the personalization claim.
  3. [Abstract] Abstract: The model is 'trained jointly on our high-quality annotations and existing aesthetic assessment subsets,' but no ablation studies, regularization analysis, or embedding comparisons are reported to show that user-specific signals are preserved rather than collapsed toward population averages. This joint objective is load-bearing for the claim that gains arise from personalization rather than the joint training procedure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the abstract would benefit from greater specificity on metrics, dataset statistics, and training ablations. We will revise the abstract and add supporting material to the manuscript to address these points directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim states that the model 'predicts individual liking with higher accuracy than the majority of current state-of-the-art methods predict population-level preferences,' yet no metrics (accuracy, correlation, etc.), baselines, test splits, cross-validation, or significance tests are described. This information is load-bearing for assessing whether the reported gains reflect true personalization.

    Authors: The abstract summarizes the main findings at a high level. The full manuscript describes the evaluation protocol, including accuracy and correlation metrics, baselines, user-based test splits, cross-validation, and significance tests in the experimental section. We will revise the abstract to incorporate key quantitative results and a brief reference to the evaluation setup so that the central claim is supported within the abstract itself. revision: yes

  2. Referee: [Abstract] Abstract (dataset description): With 5,000 images each rated by 15 users (70k total ratings), the manuscript provides no statistics on the number of unique users or ratings per user. If most users contribute only a handful of ratings, the model cannot reliably separate stable individual taste from noise or image-specific effects, directly threatening the personalization claim.

    Authors: We agree that explicit statistics on unique users and ratings per user are necessary to substantiate the personalization claim. We will add these figures to the revised abstract and expand the dataset description section to report the total number of unique users along with the distribution (minimum, average, and maximum ratings per user). revision: yes

  3. Referee: [Abstract] Abstract: The model is 'trained jointly on our high-quality annotations and existing aesthetic assessment subsets,' but no ablation studies, regularization analysis, or embedding comparisons are reported to show that user-specific signals are preserved rather than collapsed toward population averages. This joint objective is load-bearing for the claim that gains arise from personalization rather than the joint training procedure.

    Authors: We acknowledge that ablations would help isolate the benefit of the joint objective and confirm preservation of user-specific signals. We will add ablation experiments to the revised manuscript that compare the joint model against single-dataset variants, together with analysis of user embeddings to demonstrate that individual signals are retained rather than averaged out. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new dataset of 70,000 ratings from 15 users per image across 5,000 generated images and trains PAMELA jointly on this data plus existing aesthetic subsets. The central claim of superior individual-liking prediction is evaluated against this newly collected data rather than reducing by construction to prior fitted quantities or self-citations. No self-definitional, fitted-input-called-prediction, or load-bearing self-citation steps are identifiable in the provided text; the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that collected subjective ratings reflect stable individual preferences and that prompt optimization based on the predictor produces preferred images. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption User ratings on generated images capture true individual aesthetic preferences
    Invoked implicitly when claiming the model predicts individual liking

pith-pipeline@v0.9.0 · 5536 in / 1098 out tokens · 48732 ms · 2026-05-10T18:02:51.129950+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 29 canonical work pages · 11 internal anchors

  1. [1]

    Ashutosh, K., Gandelsman, Y., Chen, X., Misra, I., Girdhar, R.: Llms can see and hear without any training (2025),https://arxiv.org/abs/2501.180962, 4, 8, 20

  2. [2]

    Llama-Embed-Nemotron-8B: A universal text em- bedding model for multilingual and cross-lingual tasks.arXiv:2511.07025, 2025

    Babakhin, Y., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer, B., Liu, B., Oldridge, E.: Llama-embed- nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks (2025),https://arxiv. org/abs/2511.070253, 6

  3. [3]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al.: Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022) 3

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., et al.: Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022) 3

  5. [5]

    OpenAI Technical Report (2023) 1

    Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. OpenAI Technical Report (2023) 1

  6. [6]

    Black Forest Labs: FLUX.2.https://github.com/black-forest-labs/flux(2025) 1, 4

  7. [7]

    In: ECCV (2024) 4

    Chen, C., Wang, A., Wu, H., Liao, L., Sun, W., Yan, Q., Lin, W.: Enhancing diffusion models with text-encoder reinforcement learning. In: ECCV (2024) 4

  8. [8]

    Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Chen, J., Ge, C., Xie, E., Wu, Y., Yao, L., Ren, X., Wang, Z., Luo, P., Lu, H., Li, Z.: Pixart-sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692 (2024) 1

  9. [9]

    In: CVPR (2024) 3

    Chen, Z., Zhang, L., Weng, F., Pan, L., Lan, Z.: Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting. In: CVPR (2024) 3

  10. [10]

    NIPS (2017) 3

    Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep reinforcement learning from human preferences. NIPS (2017) 3

  11. [11]

    In: ICLR (2024) 4

    Clark, K., Vicol, P., Swersky, K., Fleet, D.J.: Directly fine-tuning diffusion models on differentiable rewards. In: ICLR (2024) 4

  12. [12]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Dang, M., Singh, A., Zhou, L., Ermon, S., Song, J.: Personalized preference fine-tuning of diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8020–8030 (2025) 4

  13. [13]

    Per- sonalized image editing in text-to-image diffusion models via collaborative direct preference optimization.arXiv preprint arXiv:2511.05616,

    Dunlop, C., Zheng, M., Venkatesh, K., Yanardag, P.: Personalized image editing in text-to-image diffusion models via collaborative direct preference optimization. arXiv preprint arXiv:2511.05616 (2025) 4

  14. [14]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 (2024) 1

  15. [15]

    NeurIPS (2024) 4

    Eyring, L., Karthik, S., Roth, K., Dosovitskiy, A., Akata, Z.: Reno: Enhancing one-step text-to-image models through reward-based noise optimization. NeurIPS (2024) 4

  16. [16]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 (2022) 1

  17. [17]

    googleblog.com/en/introducing-gemini-2-5-flash-image/(Aug 2025) 4

    Google DeepMind: Introducing Gemini 2.5 Flash Image, our state-of-the-art image model.https://developers. googleblog.com/en/introducing-gemini-2-5-flash-image/(Aug 2025) 4

  18. [18]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Goree, S., Khoo, W., Crandall, D.J.: Correct for whom? subjectivity and the evaluation of personalized image aesthetics assessment models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 11818–11827 (2023) 3

  19. [19]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024), https://arxiv.org/abs/2407.217838

  20. [20]

    Classifier-Free Diffusion Guidance

    Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022) 4 15

  21. [21]

    Margin-aware pref- erence optimization for aligning diffusion models without reference.arXiv preprint arXiv:2406.06424,

    Hong, J., Paul, S., Lee, N., Rasul, K., Thorne, J., Jeong, J.: Margin-aware preference optimization for aligning diffusion models without reference. arXiv preprint arXiv:2406.06424 (2024) 4

  22. [22]

    IEEE Transactions on Multimedia25, 5263–5278 (2022) 3

    Hou, J., Lin, W., Yue, G., Liu, W., Zhao, B.: Interaction-matrix based personalized image aesthetics assessment. IEEE Transactions on Multimedia25, 5263–5278 (2022) 3

  23. [23]

    Scalable ranked preference optimization for text- to-image generation.arXiv preprint arXiv:2410.18013,

    Karthik, S., Coskun, H., Akata, Z., Tulyakov, S., Ren, J., Kag, A.: Scalable ranked preference optimization for text-to-image generation. arXiv preprint arXiv:2410.18013 (2024) 1, 4

  24. [24]

    If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection, 2023

    Karthik, S., Roth, K., Mancini, M., Akata, Z.: If at first you don’t succeed, try, try again: Faithful diffusion-based text-to-image generation by selection. arXiv preprint arXiv:2305.13308 (2023) 4

  25. [25]

    NeurIPS (2023) 3

    Kirstain, Y., Polyak, A., Singer, U., Matiana, S., Penna, J., Levy, O.: Pick-a-pic: An open dataset of user preferences for text-to-image generation. NeurIPS (2023) 3

  26. [26]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Labs, B.F.: Flux.1 kontext: Flow matching for in-context image generation and editing in latent space. arXiv preprint arXiv:2506.15742 (2025) 1

  27. [27]

    rlhf: Scaling reinforcement learning from human feedback with ai feedback , author=

    Lee, H., Phatale, S., Mansoor, H., Lu, K., Mesnard, T., Bishop, C., Carbune, V., Rastogi, A.: Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267 (2023) 3

  28. [28]

    arXiv preprint arXiv:2511.19458 (2025),https://arxiv.org/abs/2511.194583, 4

    Lee, J., Park, D., Kim, H.J.: Personalized reward modeling for text-to-image generation. arXiv preprint arXiv:2511.19458 (2025),https://arxiv.org/abs/2511.194583, 4

  29. [29]

    IEEE Transactions on Circuits and Systems for Video Technology (2023) 3

    Li, C., Zhang, Z., Wu, H., Sun, W., Min, X., Liu, X., Zhai, G., Lin, W.: AGIQA-3K: An open database for AI-generated image quality assessment. IEEE Transactions on Circuits and Systems for Video Technology (2023) 3

  30. [30]

    IEEE Transactions on Image Processing29, 3898–3910 (2020) 3

    Li, L., Zhu, H., Zhao, S., Ding, G., Lin, W.: Personality-assisted multi-task learning for generic and personalized image aesthetics assessment. IEEE Transactions on Image Processing29, 3898–3910 (2020) 3

  31. [31]

    Aligning diffusion models by optimizing human utility

    Li, S., Kallidromitis, K., Gokul, A., Kato, Y., Kozuka, K.: Aligning diffusion models by optimizing human utility. arXiv preprint arXiv:2404.04465 (2024) 4

  32. [32]

    In: Proceedings of the 11th International Conference on Quality of Multimedia Experience (QoMEX)

    Lin, H., Hosu, V., Saupe, D.: KADID-10k: A large-scale artificially distorted IQA database. In: Proceedings of the 11th International Conference on Quality of Multimedia Experience (QoMEX). pp. 1–3. IEEE (2019). https://doi.org/10.1109/QoMEX.2019.87432523

  33. [33]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Liu, G., Liang, J., Li, Y., Liu, J., Wang, X., Wan, P., Zhang, D., Ouyang, W.: Flow-grpo: Training flow matching models via online rl. arXiv preprint arXiv:2505.05470 (2025) 4

  34. [34]

    In: Proceedings of the 26th ACM international conference on Multimedia

    Lv, P., Wang, M., Xu, Y., Peng, Z., Sun, J., Su, S., Zhou, B., Xu, M.: Usar: An interactive user-specific aesthetic ranking framework for images. In: Proceedings of the 26th ACM international conference on Multimedia. pp. 1328–1336 (2018) 3

  35. [35]

    Inference-time scaling for diffusion models beyond scaling denoising steps,

    Ma, N., Tong, S., Jia, H., Hu, H., Su, Y.C., Zhang, M., Yang, X., Li, Y., Jaakkola, T., Jia, X., et al.: Inference-time scaling for diffusion models beyond scaling denoising steps. arXiv preprint arXiv:2501.09732 (2025) 4

  36. [36]

    In: ICCV (2025) 3, 7, 8, 9, 11

    Ma, Y., Wu, X., Sun, K., Li, H.: Hpsv3: Towards wide-spectrum human preference score. In: ICCV (2025) 3, 7, 8, 9, 11

  37. [37]

    In: CVPR Workshops (2025) 3, 4, 6

    Maerten, A.S., Chen, L.W., De Winter, S., Bossens, C., Wagemans, J.: Lapis: A novel dataset for personalized image aesthetic assessment. In: CVPR Workshops (2025) 3, 4, 6

  38. [38]

    Improving text-to-image consistency via automatic prompt optimization.arXiv preprint arXiv:2403.17804, 2024

    Mañas, O., Astolfi, P., Hall, M., Ross, C., Urbanek, J., Williams, A., Agrawal, A., Romero-Soriano, A., Drozdzal, M.: Improving text-to-image consistency via automatic prompt optimization. arXiv preprint arXiv:2403.17804 (2024) 2, 4

  39. [39]

    In: CVPR (2012) 3

    Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. In: CVPR (2012) 3

  40. [40]

    In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV)

    Park, K., Hong, S., Baek, M., Han, B.: Personalized image aesthetic quality assessment by joint regression and ranking. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1206–1214 (2017). https://doi.org/10.1109/WACV.2017.1393

  41. [41]

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis (2023) 1

  42. [42]

    Aligning text-to- image diffusion models with reward backpropagation (2023).arXiv preprint arXiv:2310.03739,

    Prabhudesai, M., Goyal, A., Pathak, D., Fragkiadaki, K.: Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739 (2023) 1, 4

  43. [43]

    Pressman, J.D., Crowson, K., Simulacra Captions Contributors: Simulacra aesthetic captions. Tech. Rep. Version 1.0, Stability AI (2022),https://github.com/JD-P/simulacra-aesthetic-captions3

  44. [44]

    arXiv preprint arXiv:2602.02388 (2026) 4

    Rajagopalan, R., Dutta, D., Wei, Y.L., Choudhury, R.R.: Personalized image generation via human-in-the-loop bayesian optimization. arXiv preprint arXiv:2602.02388 (2026) 4

  45. [45]

    In: Proceedings of the IEEE international conference on computer vision

    Ren, J., Shen, X., Lin, Z., Mech, R., Foran, D.J.: Personalized image aesthetics. In: Proceedings of the IEEE international conference on computer vision. pp. 638–647 (2017) 3

  46. [46]

    In: CVPR (2023) 1 16

    Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR (2023) 1 16

  47. [47]

    In: European Conference on Computer Vision

    Salehi, S., Shafiei, M., Yeo, T., Bachmann, R., Zamir, A.: Viper: Visual personalization of generative models via individual preference learning. In: European Conference on Computer Vision. pp. 391–406. Springer (2024) 4

  48. [48]

    com / christophschuhmann / improved - aesthetic-predictor(2022) 7

    Schuhmann, C.: Improved aesthetic predictor.https : / / github . com / christophschuhmann / improved - aesthetic-predictor(2022) 7

  49. [49]

    NeurIPS (2022) 2

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS (2022) 2

  50. [50]

    completely blind

    Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing15(11), 3440–3451 (2006).https://doi.org/10.1109/TIP. 2006.8819593

  51. [51]

    Knowledge-Based Systems294, 111749 (2024) 3

    Shi, H., Guo, J., Ke, Y., Wang, K., Yang, S., Qin, F., Chen, L.: Personalized image aesthetics assessment based on graph neural network and collaborative filtering. Knowledge-Based Systems294, 111749 (2024) 3

  52. [52]

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M.F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Mustafa, B., Hénaff, O., Harmsen, J., Steiner, A., Zhai, X.: SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features (2025),https://arxiv.org/ abs/2502.147863, 5

  53. [53]

    L., Tseng, A

    Uehara, M., Zhao, Y., Black, K., Hajiramezanali, E., Scalia, G., Diamant, N.L., Tseng, A.M., Biancalani, T., Levine,S.:Fine-tuningofcontinuous-timediffusionmodelsasentropy-regularizedcontrol(2024),https://arxiv. org/abs/2402.151944

  54. [54]

    In: CVPR (2024) 1, 4

    Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. In: CVPR (2024) 1, 4

  55. [55]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: Teaching lmms for visual scoring via discrete text-defined levels. In: arXiv preprint arXiv:2312.17090 (2024) 3, 7, 8, 9, 11

  56. [56]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023) 3

  57. [57]

    In: ICCV (2023) 3

    Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Better aligning text-to-image models with human preference. In: ICCV (2023) 3

  58. [58]

    NeurIPS (2023) 3, 7

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagereward: Learning and evaluating human preferences for text-to-image generation. NeurIPS (2023) 3, 7

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, Y., Xu, L., Li, L., Qie, N., Li, Y., Zhang, P., Guo, Y.: Personalized image aesthetics assessment with rich attributes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19861–19869 (2022) 3, 6

  60. [60]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    You, Z., Cai, X., Gu, J., Xue, T., Dong, C.: Teaching large language models to regress accurate image quality scores using score distribution. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14483–14494 (2025) 3, 7

  61. [61]

    In: CVPR (2024) 3

    Zhang, S., Wang, B., Wu, J., Li, Y., Gao, T., Zhang, D., Wang, Z.: Learning multi-dimensional human preference for text-to-image generation. In: CVPR (2024) 3

  62. [62]

    In: ECCV (2024) 4

    Zhang, Y., Tzeng, E., Du, Y., Kislyuk, D.: Large-scale reinforcement learning for diffusion models. In: ECCV (2024) 4

  63. [63]

    IEEE Transactions on Multimedia25, 179–190 (2021) 3

    Zhu, H., Zhou, Y., Li, L., Li, Y., Guo, Y.: Learning personalized image aesthetics from subjective and objective attributes. IEEE Transactions on Multimedia25, 179–190 (2021) 3

  64. [64]

    how beautiful the image is, how much they like it, prefer it and are drawn to it

    Zhu, H., Zhou, Y., Shao, Z., Du, W., Wang, G., Li, Q.: Personalized image aesthetics assessment via multi- attribute interactive reasoning. Mathematics10(22), 4181 (2022) 3 17 A PAM∃LA Predictor A.1 Ablations T able 3:Ablation study results on seen and unseen users. Unseen users are matched via kNN (k=15, top-K=5, τ=0.1). The generalization gap is defined...