pith. sign in

arxiv: 2606.05478 · v1 · pith:VV5TZCCKnew · submitted 2026-06-03 · 💻 cs.CV · cs.LG

Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?

Pith reviewed 2026-06-28 06:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords diffusion modelshuman preference metricstext-to-image generationnoise seed predictionimage quality improvementstochastic generation
0
0 comments X

The pith

Scalar human preference scores for text-to-image diffusion outputs can be predicted from the initial noise seed before generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether human preference metric scores for diffusion model images can be forecasted directly from the starting random noise seed. It further checks whether these forecasts can guide seed selection to raise final image quality while adding almost no hardware cost. A sympathetic reader would care because the stochastic nature of diffusion generation often wastes compute on low-preference results that could be avoided by early prediction. The investigation covers multiple models and prompts and finds the approach feasible.

Core claim

Human preference scores can be predicted prior to generation from the initial random noise seed, and this prediction can be leveraged to improve generated image quality with negligible hardware overhead.

What carries the argument

Direct prediction of scalar HPM scores from the initial random noise seed without running the full diffusion process.

If this is right

  • Poor seeds can be rejected before any generation compute is spent.
  • Better seeds can be chosen to produce higher-preference images.
  • The added cost of prediction remains negligible across tested models.
  • Certain HPMs perform better than others when used for this pre-generation filtering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Local deployment of small diffusion models could avoid repeated generation attempts by filtering seeds early.
  • The same noise-based prediction idea might extend to other stochastic sampling processes such as video or 3D generation.
  • End users might achieve more consistent high-preference results without manual trial-and-error on multiple seeds.

Load-bearing premise

The initial random noise seed contains enough information to allow accurate prediction of final HPM scores across different models and prompts without requiring the full generation process.

What would settle it

Measure the correlation between HPM scores predicted from noise seeds and the actual HPM scores of the generated images across many prompts and models; low correlation would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.05478 by Joong Ho Kim, Keith G. Mills.

Figure 1
Figure 1. Figure 1: Illustrating the effect of initial random noise on DM output, both qualitatively and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: HPM predictor design. Given the DM initial conditions, we consider three types [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative visual examples. DM and prompt details provided. We provide further [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative example for the preference-metric analysis on the prompt “Four dogs [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative image examples [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
read the original abstract

Diffusion Models (DM) have revolutionized text-driven generation by enabling the synthesis of high-quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics (HPM) that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper claims that scalar Human Preference Metric (HPM) scores for text-to-image diffusion model outputs can be predicted from the initial random noise seed (plus prompt) prior to running the full denoising process, and that such predictions can be used to select better seeds that improve final image quality while incurring only negligible hardware overhead. The work further examines which HPMs are most suitable for this pre-generation selection task and asserts feasibility across models and prompts.

Significance. If the central empirical claims were supported by rigorous, reproducible experiments, the result would offer a practical way to mitigate the stochastic variability in diffusion sampling without extra generation cost, which is especially relevant for smaller locally-deployed models. The approach could reduce wasted compute on low-HPM outputs and provide a new axis for generation-quality control. However, the manuscript as presented supplies no quantitative evidence, so its potential impact cannot yet be assessed.

major comments (3)
  1. [Abstract] Abstract: the feasibility of accurate pre-generation HPM prediction and the claim of negligible overhead are asserted without any description of the predictor architecture, training regime, dataset, or quantitative metrics (e.g., correlation, MAE, or downstream HPM improvement); the central claim therefore cannot be evaluated from the supplied text.
  2. [(entire manuscript)] No section supplies cross-model or cross-prompt correlation numbers, ablation studies on the information content of the initial Gaussian noise, or error bars that would test the weakest assumption that a cheap function of the seed alone can forecast the outcome of hundreds of conditioned denoising steps.
  3. [(entire manuscript)] The manuscript contains no equations, algorithms, or experimental protocol detailing how the mapping from noise realization to scalar HPM is learned or validated, leaving the required approximation of the UNet-prompt interaction unexamined.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed review. The comments highlight important gaps in the presentation of our empirical results and methodology. We address each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the feasibility of accurate pre-generation HPM prediction and the claim of negligible overhead are asserted without any description of the predictor architecture, training regime, dataset, or quantitative metrics (e.g., correlation, MAE, or downstream HPM improvement); the central claim therefore cannot be evaluated from the supplied text.

    Authors: We agree the abstract is too high-level. In revision we will add one sentence summarizing the lightweight predictor (MLP on flattened noise + prompt embedding), the training set size, and key metrics (Pearson correlation and downstream HPM lift). revision: yes

  2. Referee: [(entire manuscript)] No section supplies cross-model or cross-prompt correlation numbers, ablation studies on the information content of the initial Gaussian noise, or error bars that would test the weakest assumption that a cheap function of the seed alone can forecast the outcome of hundreds of conditioned denoising steps.

    Authors: The current draft indeed omits these quantitative elements. We will insert a dedicated experimental subsection reporting cross-model and cross-prompt correlations, noise-only ablations, and error bars with statistical tests. revision: yes

  3. Referee: [(entire manuscript)] The manuscript contains no equations, algorithms, or experimental protocol detailing how the mapping from noise realization to scalar HPM is learned or validated, leaving the required approximation of the UNet-prompt interaction unexamined.

    Authors: We acknowledge the absence of formal notation and pseudocode. Revision will add an algorithm box, the explicit loss and architecture equations, and a short protocol subsection describing data collection and validation splits. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical investigation of noise-to-HPM mapping contains no self-referential derivation

full rationale

The provided abstract and context describe an empirical study: the authors train or evaluate a predictor that maps initial noise (plus prompt) to scalar HPM values and then test whether selecting high-predicted seeds improves output quality. No equations, fitted parameters renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatzes are quoted or referenced. The central claim is a factual assertion about feasibility across models, which stands or falls on external validation data rather than reducing to its own inputs by construction. Because the manuscript supplies no derivation chain that collapses to a tautology, the circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities; ledger left empty.

pith-pipeline@v0.9.1-grok · 5721 in / 1029 out tokens · 24638 ms · 2026-06-28T06:11:43.839961+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 2 canonical work pages

  1. [1]

    Stable diffusion webui.https://github.com/ AUTOMATIC1111/stable-diffusion-webui, 2022

    AUTOMATIC1111. Stable diffusion webui.https://github.com/ AUTOMATIC1111/stable-diffusion-webui, 2022

  2. [2]

    Flux.https://github.com/black-forest-labs/ flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/ flux, 2024

  3. [3]

    Fast differen- tiable sorting and ranking

    Mathieu Blondel, Olivier Teboul, Quentin Berthet, and Josip Djolonga. Fast differen- tiable sorting and ranking. InInternational Conference on Machine Learning, pages 950–959. PMLR, 2020

  4. [4]

    Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024

  5. [5]

    Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Zhongdao Wang, James T. Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. InThe Twelfth Inter- national Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024...

  6. [6]

    PixArt-Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt-Σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2025. 16KIM AND MILLS: HUMAN PREFERENCE PREDICTION FOR T2I GENERA TION

  7. [7]

    Sana-sprint: One-step diffusion with continuous-time consistency distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

  8. [8]

    Mills, Liyao Jiang, Chao Gao, and Di Niu

    Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, and Di Niu. Re-ttention: Ultra sparse visual generation via attention statistical reshape. InNeurIPS, 2025

  9. [9]

    Fp4dit: Towards effective floating point quantization for diffusion transformers.arXiv preprint arXiv:2503.15465, 2025

    Ruichen Chen, Keith G Mills, and Di Niu. Fp4dit: Towards effective floating point quantization for diffusion transformers.arXiv preprint arXiv:2503.15465, 2025

  10. [10]

    Comfyui: A node-based gui for stable diffusion, 2025

    ComfyUI Contributors. Comfyui: A node-based gui for stable diffusion, 2025. URL https://github.com/Comfy-Org/ComfyUI

  11. [11]

    CyberRealistic Pony - | Stable Diffusion XL Checkpoint | Civitai — civitai.com.https://civitai.com/models/443821/ cyberrealistic-pony, 2026

    Cyberdelia. CyberRealistic Pony - | Stable Diffusion XL Checkpoint | Civitai — civitai.com.https://civitai.com/models/443821/ cyberrealistic-pony, 2026. [Accessed 02-03-2026]

  12. [12]

    A survey on diffusion models for inverse problems.arXiv preprint arXiv:2410.00083, 2024

    Giannis Daras, Hyungjin Chung, Chieh-Hsin Lai, Yuki Mitsufuji, Jong Chul Ye, Pey- man Milanfar, Alexandros G Dimakis, and Mauricio Delbracio. A survey on diffusion models for inverse problems.arXiv preprint arXiv:2410.00083, 2024

  13. [13]

    Accelerating the super-resolution convolutional neural network

    Chao Dong, Chen Change Loy, and Xiaoou Tang. Accelerating the super-resolution convolutional neural network. InEuropean conference on computer vision, pages 391–

  14. [14]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Trans- formers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 202...

  15. [15]

    URLhttps://openreview.net/forum?id=YicbFdNTTy

  16. [16]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, ICML 2024, Vienn...

  17. [17]

    Reno: Enhancing one-step text-to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems, 37:125487– 125519, 2024

    Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems, 37:125487– 125519, 2024

  18. [18]

    Noise hypernetworks: Amortizing test-time compute in diffusion models.arXiv preprint arXiv:2508.09968, 2025

    Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, and Zeynep Akata. Noise hypernetworks: Amortizing test-time compute in diffusion models.arXiv preprint arXiv:2508.09968, 2025

  19. [19]

    Guided by gut: Efficient test-time scaling with reinforced intrinsic confidence.arXiv preprint arXiv:2505.20325, 2025

    Amirhosein Ghasemabadi, Keith G Mills, Baochun Li, and Di Niu. Guided by gut: Efficient test-time scaling with reinforced intrinsic confidence.arXiv preprint arXiv:2505.20325, 2025. KIM AND MILLS: HUMAN PREFERENCE PREDICTION FOR T2I GENERA TION17

  20. [20]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

  21. [21]

    Initno: Boosting text-to-image diffusion models via initial noise optimization

    Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image diffusion models via initial noise optimization. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9380–9389, 2024

  22. [22]

    Han, Keith G

    Fred X. Han, Keith G. Mills, Fabian Chudak, Parsa Riahi, Mohammad Salameh, Jialin Zhang, Wei Lu, Shangling Jui, and Di Niu.A General-Purpose Transferable Predictor for Neural Architecture Search, pages 721–729. Society for Industrial and Applied Mathematics, 2023. doi: 10.1137/1.9781611977653.ch81. URLhttps://epubs. siam.org/doi/abs/10.1137/1.9781611977653.ch81

  23. [23]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. InEuropean Conference on Computer Vision, pages 630–645. Springer, 2016

  24. [24]

    CLIPS core: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIP- Score: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana,...

  25. [25]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

  26. [26]

    Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  27. [27]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  28. [28]

    Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

  29. [29]

    Make-an-audio: Text-to-audio genera- tion with prompt-enhanced diffusion models

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhen- hui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio genera- tion with prompt-enhanced diffusion models. InInternational Conference on Machine Learning, pages 13916–13932. PMLR, 2023

  30. [30]

    Raise: Requirement-adaptive evolutionary refinement for training-free text-to-image alignment, 2026

    Liyao Jiang, Ruichen Chen, Chao Gao, and Di Niu. Raise: Requirement-adaptive evolutionary refinement for training-free text-to-image alignment, 2026. URLhttps: //arxiv.org/abs/2603.00483. 18KIM AND MILLS: HUMAN PREFERENCE PREDICTION FOR T2I GENERA TION

  31. [31]

    Ess-flow: Training-free guidance of flow-based models as inference in source space.arXiv preprint arXiv:2510.05849, 2025

    Adhithyan Kalaivanan, Zheng Zhao, Jens Sjölund, and Fredrik Lindsten. Ess-flow: Training-free guidance of flow-based models as inference in source space.arXiv preprint arXiv:2510.05849, 2025

  32. [32]

    Juggernaut XL - Ragnarok_by_RunDiffusion | Stable Diffusion XL Check- point | Civitai — civitai.com.https://civitai.com/models/133005/ juggernaut-xl, 2025

    KandooAI. Juggernaut XL - Ragnarok_by_RunDiffusion | Stable Diffusion XL Check- point | Civitai — civitai.com.https://civitai.com/models/133005/ juggernaut-xl, 2025. [Accessed 27-01-2026]

  33. [33]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing sys- tems, 35:26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing sys- tems, 35:26565–26577, 2022

  34. [34]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Repre- sentations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Pro- ceedings, 2014. URLhttp://arxiv.org/abs/1312.6114

  35. [35]

    Panoptic feature pyramid networks

    Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6399–6408, 2019

  36. [36]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems, 36:36652–36663, 2023

  37. [37]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  38. [38]

    Noisear: Autoregressing initial noise prior for diffusion models.arXiv preprint arXiv:2506.01337, 2025

    Zeming Li, Xiangyue Liu, Xiangyu Zhang, Ping Tan, and Heung-Yeung Shum. Noisear: Autoregressing initial noise prior for diffusion models.arXiv preprint arXiv:2506.01337, 2025

  39. [39]

    Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

    Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Xiong, Yanxin Long, Xinchi Deng, Yingfang Zhang, Xingchao Liu, Minbin Huang, Zedong Xiao, Dayou Chen, Jiajun He, Jiahao Li, Wenyue Li, Chen Zhang, Rongwei Quan, Jianxiang Lu, Jiabin Huang, Xiaoyan Yuan, Xiaoxiao Zheng, Yixuan Li, Jihong Zhang, Chao Zhang, Meng Chen, Jie Liu, Zheng Fang, Weiyan Wang, Jinbao Xue,...

  40. [40]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/ forum?id=Bkg6RiCqY7

  41. [41]

    Pinat: A permutation invariance augmented transformer for nas predictor

    Shun Lu, Yu Hu, Peihao Wang, Yan Han, Jianchao Tan, Jixiang Li, Sen Yang, and Ji Liu. Pinat: A permutation invariance augmented transformer for nas predictor. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023. KIM AND MILLS: HUMAN PREFERENCE PREDICTION FOR T2I GENERA TION19

  42. [42]

    Dreamshaper - stable diffusion fine-tune.https://civitai.com/ models/4384/dreamshaper, 2023

    Lykon. Dreamshaper - stable diffusion fine-tune.https://civitai.com/ models/4384/dreamshaper, 2023

  43. [43]

    Hpsv3: Towards wide- spectrum human preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide- spectrum human preference score. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 15086–15095, 2025

  44. [44]

    Mills, Mohammad Salameh, Ruichen Chen, Wei Hassanpour, Negar Lu, and Di Niu

    Keith G. Mills, Mohammad Salameh, Ruichen Chen, Wei Hassanpour, Negar Lu, and Di Niu. Qua 2sedimo: Quantifiable quantization sensitivity of diffusion models. In AAAI, 2025

  45. [45]

    Poste- rior inference in latent space for scalable constrained black-box optimization.arXiv preprint arXiv:2507.00480, 2025

    Kiyoung Om, Kyuil Sim, Taeyoung Yun, Hyeongyu Kang, and Jinkyoo Park. Poste- rior inference in latent space for scalable constrained black-box optimization.arXiv preprint arXiv:2507.00480, 2025

  46. [46]

    Illustrious XL 2.0 - v2.0 | Illustrious Checkpoint | Civitai — civitai.com

    ONOMAAI. Illustrious XL 2.0 - v2.0 | Illustrious Checkpoint | Civitai — civitai.com. https://civitai.com/models/1369089/illustrious-xl-20, 2025. [Accessed 27-01-2026]

  47. [47]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  48. [48]

    SDXL: improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview.net/fo...

  49. [49]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Om- mer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

  50. [50]

    Mills, Negar Hassanpour, Fred Han, Shuting Zhang, Wei Lu, Shangling Jui, Chunhua Zhou, Fengyu Sun, and Di Niu

    Mohammad Salameh, Keith G. Mills, Negar Hassanpour, Fred Han, Shuting Zhang, Wei Lu, Shangling Jui, Chunhua Zhou, Fengyu Sun, and Di Niu. Autogo: Automated computation graph optimization for neural network evolution. InAdvances in Neural Information Processing Systems, 2023

  51. [51]

    The surprising effectiveness of diffusion models for op- tical flow and monocular depth estimation.Advances in Neural Information Processing Systems, 36:39443–39469, 2023

    Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek Kar, Mohammad Norouzi, Deqing Sun, and David J Fleet. The surprising effectiveness of diffusion models for op- tical flow and monocular depth estimation.Advances in Neural Information Processing Systems, 36:39443–39469, 2023

  52. [52]

    A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

    Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

  53. [53]

    Calibrating generative models.arXiv preprint arXiv:2510.10020, 2025

    Henry D Smith, Nathaniel L Diamant, and Brian L Trippe. Calibrating generative models.arXiv preprint arXiv:2510.10020, 2025

  54. [54]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. PMLR, 2015. 20KIM AND MILLS: HUMAN PREFERENCE PREDICTION FOR T2I GENERA TION

  55. [55]

    Denoising diffusion implicit mod- els.arXiv preprint arXiv:2010.02502, 2020

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit mod- els.arXiv preprint arXiv:2010.02502, 2020

  56. [56]

    Diffusion-based neural network weights generation.arXiv preprint arXiv:2402.18153, 2024

    Bedionita Soro, Bruno Andreis, Hayeon Lee, Wonyong Jeong, Song Chong, Frank Hutter, and Sung Ju Hwang. Diffusion-based neural network weights generation.arXiv preprint arXiv:2402.18153, 2024

  57. [57]

    Pixart-alpha prompt list.https://github.com/ PixArt-alpha/PixArt-alpha/blob/master/asset/samples.txt,

    PixArt-Alpha Team. Pixart-alpha prompt list.https://github.com/ PixArt-alpha/PixArt-alpha/blob/master/asset/samples.txt,

  58. [58]

    Accessed: 2026-01-15

  59. [59]

    Deep learning and the information bottleneck principle

    Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In2015 ieee information theory workshop (itw), pages 1–5. Ieee, 2015

  60. [60]

    The information bottleneck method.arXiv preprint physics/0004057, 2000

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

  61. [61]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.,

  62. [62]

    URLhttps://proceedings.neurips.cc/paper_files/paper/ 2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  63. [63]

    Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models.arXiv preprint arXiv:2502.06999, 2025

    Siddarth Venkatraman, Mohsin Hasan, Minsu Kim, Luca Scimeca, Marcin Sendera, Yoshua Bengio, Glen Berseth, and Nikolay Malkin. Outsourced diffusion sampling: Efficient posterior inference in latent spaces of generative models.arXiv preprint arXiv:2502.06999, 2025

  64. [64]

    Diffusers: State-of-the-art diffusion models

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022

  65. [65]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Pu- rushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, June 2024

  66. [66]

    Picking winning tickets before training by preserving gradient flow.arXiv preprint arXiv:2002.07376, 2020

    Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow.arXiv preprint arXiv:2002.07376, 2020

  67. [67]

    Dif- fusion models for molecules: A survey of methods and tasks.arXiv preprint arXiv:2502.09511, 2025

    Liang Wang, Chao Song, Zhiyuan Liu, Yu Rong, Qiang Liu, and Shu Wu. Dif- fusion models for molecules: A survey of methods and tasks.arXiv preprint arXiv:2502.09511, 2025

  68. [68]

    The silent as- sistant: Noisequery as implicit guidance for goal-driven image generation

    Ruoyu Wang, Huayang Huang, Ye Zhu, Olga Russakovsky, and Yu Wu. The silent as- sistant: Noisequery as implicit guidance for goal-driven image generation. InProceed- ings of the IEEE/CVF International Conference on Computer Vision, pages 17618– 17628, 2025. KIM AND MILLS: HUMAN PREFERENCE PREDICTION FOR T2I GENERA TION21

  69. [69]

    The lambdaloss framework for ranking metric optimization

    Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. The lambdaloss framework for ranking metric optimization. InProceedings of the 27th ACM international conference on information and knowledge management, pages 1313–1322, 2018

  70. [70]

    Source-guided flow matching.arXiv preprint arXiv:2508.14807, 2025

    Zifan Wang, Alice Harting, Matthieu Barreau, Michael M Zavlanos, and Karl H Jo- hansson. Source-guided flow matching.arXiv preprint arXiv:2508.14807, 2025

  71. [71]

    How powerful are performance predictors in neural architecture search?NeurIPS, 2021

    Colin White, Arber Zela, Robin Ru, Yang Liu, and Frank Hutter. How powerful are performance predictors in neural architecture search?NeurIPS, 2021

  72. [72]

    Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  73. [73]

    Densedpo: Fine-grained temporal preference optimization for video diffusion models.Advances in Neural In- formation Processing Systems, 38:171632–171668, 2026

    Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, and Aliaksandr Siarohin. Densedpo: Fine-grained temporal preference optimization for video diffusion models.Advances in Neural In- formation Processing Systems, 38:171632–171668, 2026

  74. [74]

    SANA: efficient high- resolution text-to-image synthesis with linear diffusion transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: efficient high- resolution text-to-image synthesis with linear diffusion transformers. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. U...

  75. [75]

    Imagereward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Processing Systems, 36:15903– 15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Processing Systems, 36:15903– 15935, 2023

  76. [76]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. 2025. URL https://openreview.net/forum?id=LQzN6TRFg9

  77. [77]

    Neural meth- ods for amortized inference.Annual Review of Statistics and Its Application, 12(1): 311–335, 2025

    Andrew Zammit-Mangion, Matthew Sainsbury-Dale, and Raphaël Huser. Neural meth- ods for amortized inference.Annual Review of Statistics and Its Application, 12(1): 311–335, 2025

  78. [78]

    Learning multi-dimensional human preference for text-to-image generation

    Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, and Zhongyuan Wang. Learning multi-dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 8018–8027, June 2024

  79. [79]

    A photo of a male android, half robot and half humanoid, resembling actor Liam Hemsworth, posing stoically on display at a museum

    Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17688–17697, 2025. 22KIM AND MILLS: HUMAN PREFERENCE PREDICTION FOR T2I GENERA TION Supplementary Materials We provide additional im...