pith. machine review for the scientific record. sign in

arxiv: 2605.06070 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelspreference optimizationoffline rewardArena scoresdirect preference optimizationtext-to-imagefine-grained feedbackpairwise preferences
0
0 comments X

The pith

ArenaPO estimates absolute quality gaps from pairwise preferences using Gaussian model capabilities to serve as fine-grained offline rewards for diffusion model optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes ArenaPO to enhance preference alignment in text-to-image diffusion models beyond the limitations of binary direct preference optimization. It constructs a model Arena where each model's capability is modeled as a Gaussian distribution, inferred from annotated pairwise preferences. For each image pair, the absolute quality gap is estimated via latent-variable inference with a truncated normal distribution, providing refined feedback. This approach delivers the benefits of rich rewards from RLHF combined with the efficiency of DPO, all computed offline without a reward model. Training on Pick-a-Pic v2 and HPD v3 datasets demonstrates consistent outperformance over baselines.

Core claim

ArenaPO treats each image as a sample from its model's capability distribution and uses truncated normal inference conditioned on observed preferences to compute the absolute quality gap between paired images, which then acts as the reward signal in preference optimization training without requiring an explicit reward model.

What carries the argument

Model Arena with Gaussian capability distributions and truncated normal latent-variable inference to estimate absolute quality gaps from pairwise preferences.

If this is right

  • ArenaPO achieves more precise optimization than binary DPO by using continuous quality gap rewards.
  • The method requires no online reward model training or inference, reducing computational overhead.
  • Models trained with ArenaPO show improved performance on standard preference datasets like Pick-a-Pic v2 and HPD v3.
  • Offline computation of rewards allows seamless integration into existing DPO pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique might generalize to other domains where pairwise preferences are available, such as text generation, by applying similar Gaussian modeling.
  • The probabilistic nature of the inference could provide uncertainty estimates for rewards, potentially improving training stability.
  • If the quality gap estimates align with human perception across diverse prompts, it could scale preference learning with less annotation effort.

Load-bearing premise

Modeling each model's capability as a Gaussian distribution and estimating quality gaps with truncated normal inference on pairwise preferences produces accurate absolute rewards.

What would settle it

A direct comparison where humans rate images generated by ArenaPO-trained models versus DPO-trained models on the same prompts, checking if the preference rates match the predicted quality gaps.

Figures

Figures reproduced from arXiv: 2605.06070 by Edward Zhongwei Zhang, Jing Zhang, Qingyi Gu, Xuewen Liu, Yue Zhao, Zhen Dong, Zhikai Li.

Figure 1
Figure 1. Figure 1: Comparison of RLHF, DPO, and Are￾naPO. The proposed ArenaPO extracts and lever￾ages rich offline rewards from model Arena, with￾out requiring a reward model, enabling both fine￾grained and efficient preference alignment. In this work, we aim to explore how fine￾grained preference can be incorporated into DPO in an offline manner without relying on an explicit reward model. Following Thurstone’s Case V theo… view at source ↗
Figure 2
Figure 2. Figure 2: Motivation of the proposed ArenaPO. Each output image can be viewed as one sam￾ple drawn from the model’s capability distribution N (µ, σ2 ). Thus, for two output images, their qual￾ity gap can be modeled as a function with respect to their source models’ µ w − µ l , σ w, and σ l . With the above insight, we propose ArenaPO, which leverages Arena information to provide fine-grained preference in an offline… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed ArenaPO. First, we build a model Arena that converts image view at source ↗
Figure 4
Figure 4. Figure 4: Win-rates of models trained using different methods on Pick-a-Pic v2 dataset versus the view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of different methods for training Stable Diffusion 1.5 on Pick-a-Pic v2 dataset. Prompt: 1) Cute grey cat, digital oil painting by Monet. 2) A pale half-elf in a dark, silver-trimmed robe, with long hair partly tied back. 3) An astronaut floating in space, with Earth dazzling starlight shining in the background. Step 0 Step 500 Step 1000 Step 1500 Step 2000 view at source ↗
Figure 7
Figure 7. Figure 7: Grid search of the hyperparameter γ. Effect of Scaling Coefficient γ To align the scale when incorporating QD, we intro￾duce a scaling coefficient γ. We perform a grid search over γ, as shown in view at source ↗
Figure 8
Figure 8. Figure 8: More results of qualitative comparison of different methods for training Stable Diffusion view at source ↗
read the original abstract

Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes ArenaPO for fine-grained preference optimization of text-to-image diffusion models. It constructs a model Arena in which each model's capability is represented as a Gaussian distribution inferred by traversing annotated pairwise preferences. Each output image is treated as a sample from its model's capability distribution. For any image pair, the absolute quality gap is then estimated via latent-variable inference under a truncated normal distribution conditioned on the two capability distributions and the observed preference; this gap serves as an offline reward signal in a DPO-style training procedure. The method is evaluated on the Pick-a-Pic v2 and HPD v3 datasets and is reported to outperform existing baselines while requiring no separate reward model and incurring no additional training overhead.

Significance. If the inferred quality gaps prove faithful to underlying human-perceived differences and supply information beyond the binary preference labels, ArenaPO would usefully combine the richness of RLHF-style rewards with the computational efficiency of DPO. The offline, reward-model-free design is a concrete practical strength that could scale to larger diffusion-model alignment pipelines.

major comments (3)
  1. [§3.2] §3.2 (latent-variable inference): The absolute quality gap is obtained by conditioning a truncated normal on the same pairwise preference labels that were already used to fit the per-model Gaussian parameters. No held-out human correlation study or ablation that isolates the contribution of the continuous gap versus the binary label is reported; without such evidence the claim that the procedure supplies genuinely fine-grained, non-circular feedback remains unverified and load-bearing for the central advantage over standard DPO.
  2. [§4.1] §4.1 (experimental protocol): Both Arena construction and subsequent ArenaPO training are performed on the identical Pick-a-Pic v2 and HPD v3 splits. This shared data usage risks inflating apparent gains; a clear separation (e.g., Arena built on a disjoint subset) or cross-dataset transfer results are needed to substantiate that the inferred rewards generalize.
  3. [§3.1] §3.1 (Gaussian capability model): The assumption that model capabilities follow symmetric Gaussian distributions is introduced without diagnostic checks against the empirical distribution of pairwise outcomes. If the true capability distribution is skewed or multimodal, the downstream truncated-normal gap estimates become biased; a sensitivity analysis or alternative distributional assumption would be required to bound the robustness of the reward signal.
minor comments (3)
  1. [Abstract] Abstract: 'a image pair' should read 'an image pair'.
  2. [Figure 1] Figure 1 (Arena diagram): The arrow labeled 'traversing pairwise preferences' does not indicate whether the inference is performed once globally or per training batch; clarify the computational graph.
  3. [§4.3] §4.3 (baseline comparison): The reported metrics lack error bars or the number of random seeds; adding these would strengthen the claim of consistent outperformance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive feedback. We appreciate the opportunity to clarify and strengthen our manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (latent-variable inference): The absolute quality gap is obtained by conditioning a truncated normal on the same pairwise preference labels that were already used to fit the per-model Gaussian parameters. No held-out human correlation study or ablation that isolates the contribution of the continuous gap versus the binary label is reported; without such evidence the claim that the procedure supplies genuinely fine-grained, non-circular feedback remains unverified and load-bearing for the central advantage over standard DPO.

    Authors: We thank the referee for highlighting this important point. The inference procedure is designed to extract a continuous quality gap from the binary preference by modeling the latent capabilities as Gaussians and using truncated normal conditioning, which is not merely reusing the label but deriving a magnitude estimate based on the distributional assumptions. However, we acknowledge that empirical validation isolating this contribution is necessary. In the revised version, we will add an ablation study that compares ArenaPO using the inferred continuous gaps against a variant that uses only binary preferences (equivalent to standard DPO). We will also attempt to correlate the inferred gaps with any available human rating data on held-out pairs from the datasets to provide additional evidence of faithfulness. revision: yes

  2. Referee: [§4.1] §4.1 (experimental protocol): Both Arena construction and subsequent ArenaPO training are performed on the identical Pick-a-Pic v2 and HPD v3 splits. This shared data usage risks inflating apparent gains; a clear separation (e.g., Arena built on a disjoint subset) or cross-dataset transfer results are needed to substantiate that the inferred rewards generalize.

    Authors: This is a valid concern regarding potential data leakage or overfitting to the specific splits. To address it, we will revise the experimental protocol to construct the Arena model using a disjoint subset of the training data (e.g., 50% split for Arena construction and the remaining for preference optimization). We will report the performance under this separated setting. Additionally, we will include cross-dataset transfer experiments, such as building the Arena on Pick-a-Pic v2 and evaluating ArenaPO on HPD v3, and vice versa, to demonstrate generalization of the inferred rewards. revision: yes

  3. Referee: [§3.1] §3.1 (Gaussian capability model): The assumption that model capabilities follow symmetric Gaussian distributions is introduced without diagnostic checks against the empirical distribution of pairwise outcomes. If the true capability distribution is skewed or multimodal, the downstream truncated-normal gap estimates become biased; a sensitivity analysis or alternative distributional assumption would be required to bound the robustness of the reward signal.

    Authors: We agree that the Gaussian assumption should be validated. In the revised manuscript, we will include diagnostic plots comparing the empirical distribution of pairwise preference outcomes to the fitted Gaussian model. Furthermore, we will conduct a sensitivity analysis by experimenting with alternative distributions, such as a Laplace distribution or a Gaussian mixture model, and report the impact on the final performance metrics to bound the robustness of our approach. revision: yes

Circularity Check

0 steps flagged

No circularity: parametric inference produces continuous rewards from binary preferences without reducing to input by construction

full rationale

The paper's core chain models each model's capability as a Gaussian inferred from annotated pairwise preferences, treats images as samples from those distributions, and then applies latent-variable inference under a truncated normal (conditioned on the observed preference) to estimate an absolute quality gap used as the offline reward. This is an explicit statistical estimation procedure that maps binary labels to continuous values via modeling assumptions; it does not equate the output gap to the input preference by definition or by fitting a parameter that is then renamed as the target. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the steps. The subsequent DPO-style training on the diffusion model uses these precomputed gaps as an independent objective, making the derivation self-contained against external validation of the Gaussian/truncated-normal model.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on two modeling choices whose justification is not supplied in the abstract: representing model capability as a Gaussian and treating the quality gap as a latent variable under a truncated normal. Both introduce fitted parameters and domain assumptions whose validity cannot be checked from the given text.

free parameters (2)
  • per-model Gaussian mean and variance
    Inferred by traversing the annotated pairwise preferences; exact fitting procedure and any regularization not stated in abstract.
  • parameters of the truncated-normal inference
    Used to estimate absolute quality gap from capability distributions and observed preference.
axioms (2)
  • domain assumption Each model's capability can be faithfully represented by a single Gaussian distribution over image quality.
    Invoked when constructing the model Arena from pairwise data.
  • domain assumption The absolute quality gap between two images is a latent variable whose posterior can be recovered via truncated-normal inference conditioned on the observed preference.
    Central step that converts binary labels into continuous rewards.
invented entities (1)
  • Model Arena with per-model Gaussian capability distributions no independent evidence
    purpose: To serve as an offline source of fine-grained reward signals for preference optimization.
    New construct introduced to bridge pairwise data and continuous rewards; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5553 in / 1667 out tokens · 75780 ms · 2026-05-08T14:12:26.064671+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 12 canonical work pages · 9 internal anchors

  1. [1]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

  2. [2]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  3. [3]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  4. [4]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  5. [5]

    Getting it right: Improving spatial consistency in text-to-image models

    Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, et al. Getting it right: Improving spatial consistency in text-to-image models. InEuropean Conference on Computer Vision, pages 204–222. Springer, 2024

  6. [6]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

  7. [7]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  8. [8]

    Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45 (9):10850–10869, 2023

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45 (9):10850–10869, 2023

  9. [9]

    Diffusion-sdpo: Safeguarded direct preference optimization for diffusion models.arXiv preprint arXiv:2511.03317, 2025

    Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu, Weihua Luo, and Kaifu Zhang. Diffusion-sdpo: Safeguarded direct preference optimization for diffusion models.arXiv preprint arXiv:2511.03317, 2025

  10. [10]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  11. [11]

    Margin- aware preference optimization for aligning diffusion models without reference

    Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin- aware preference optimization for aligning diffusion models without reference. InFirst Work- shop on Scalable Optimization for Efficient and Adaptive Foundation Models, 2024

  12. [12]

    John wiley & sons, 1995

    Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan.Continuous univariate distributions, volume 2, volume 2. John wiley & sons, 1995

  13. [13]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

  14. [14]

    Scalable ranked preference optimization for text-to-image generation

    Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata, Sergey Tulyakov, Jian Ren, and Anil Kag. Scalable ranked preference optimization for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18399–18410, 2025

  15. [15]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  16. [16]

    Divergence minimization preference optimization for diffusion model alignment.arXiv preprint arXiv:2507.07510,

    Binxu Li, Minkai Xu, Jiaqi Han, Meihua Dang, and Stefano Ermon. Divergence minimization preference optimization for diffusion model alignment.arXiv preprint arXiv:2507.07510, 2025

  17. [17]

    Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024. 10

  18. [18]

    K-sort eval: Efficient preference evaluation for visual genera- tion via corrected vlm-as-a-judge

    Zhikai Li, Xuewen Liu, Wangbo Zhao, Pan Du, Kaicheng Zhou, Qingyi Gu, Yang You, Zhen Dong, Kurt Keutzer, et al. K-sort eval: Efficient preference evaluation for visual genera- tion via corrected vlm-as-a-judge. InThe Fourteenth International Conference on Learning Representations

  19. [19]

    K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences

    Zhikai Li, Xuewen Liu, Dongrong Joe Fu, Jianquan Li, Qingyi Gu, Kurt Keutzer, and Zhen Dong. K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9131–9141, 2025

  20. [20]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  21. [21]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  22. [22]

    Hpsv3: Towards wide-spectrum hu- man preference score

    Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

  23. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  24. [24]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  25. [25]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  26. [26]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  27. [27]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  28. [28]

    LAION-Aesthetics: Predicting the aesthetic quality of images.https://laion.ai/blog/laion-aesthetics/, 2022

    Christoph Schuhmann, Romain Vencu, Romai Beaumont, et al. LAION-Aesthetics: Predicting the aesthetic quality of images.https://laion.ai/blog/laion-aesthetics/, 2022

  29. [29]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  30. [30]

    Freeu: Free lunch in diffusion u-net

    Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2024

  31. [31]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

  32. [32]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  33. [33]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

  34. [34]

    A law of comparative judgment.Psychological Review, 34(4):273–286, 1927

    LL Thurstone. A law of comparative judgment.Psychological Review, 34(4):273–286, 1927. 11

  35. [35]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  36. [36]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  37. [37]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  38. [38]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

  39. [39]

    Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys, 56(4):1–39, 2023

    Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys, 56(4):1–39, 2023

  40. [40]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

  41. [41]

    Self-rewarding language models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

  42. [42]

    Dspo: Direct score preference optimization for diffusion model alignment

    Huaisheng Zhu, Teng Xiao, and Vasant G Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InThe Thirteenth International Conference on Learning Representations, 2025. 12 A Derivation of Bayesian Updating Suppose that the observation D indicates that model M1 outperforms model M2. The latent per- formance of model follows a ...

  43. [43]

    we did not observe stable improvement when training from public implementations on Pick-a-Pic

    and θ2 ∼ N(µ 2, σ2 2). β represents the uncer- tainty in a model’s single-instance performance. By Bayes’ theorem, the joint posterior distribution can be written as: P(θ 1, θ2 |D)∝ϕ θ1 −µ 1 σ1 ϕ θ2 −µ 2 σ2 Φ θ1 −θ 2p β2 1 +β 2 2 ! . Marginalizing outθ 2 yields the posterior ofθ 1: P(θ 1 |D)∝ϕ θ1 −µ 1 σ1 Φ θ1 −µ 2p β2 1 +β 2 2 +σ 2 2 ! . The posterior mea...