arxiv: 2605.06070 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

Zhikai Li , Yue Zhao , Edward Zhongwei Zhang , Xuewen Liu , Jing Zhang , Qingyi Gu , Zhen Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelspreference optimizationoffline rewardArena scoresdirect preference optimizationtext-to-imagefine-grained feedbackpairwise preferences

0 comments

The pith

ArenaPO estimates absolute quality gaps from pairwise preferences using Gaussian model capabilities to serve as fine-grained offline rewards for diffusion model optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes ArenaPO to enhance preference alignment in text-to-image diffusion models beyond the limitations of binary direct preference optimization. It constructs a model Arena where each model's capability is modeled as a Gaussian distribution, inferred from annotated pairwise preferences. For each image pair, the absolute quality gap is estimated via latent-variable inference with a truncated normal distribution, providing refined feedback. This approach delivers the benefits of rich rewards from RLHF combined with the efficiency of DPO, all computed offline without a reward model. Training on Pick-a-Pic v2 and HPD v3 datasets demonstrates consistent outperformance over baselines.

Core claim

ArenaPO treats each image as a sample from its model's capability distribution and uses truncated normal inference conditioned on observed preferences to compute the absolute quality gap between paired images, which then acts as the reward signal in preference optimization training without requiring an explicit reward model.

What carries the argument

Model Arena with Gaussian capability distributions and truncated normal latent-variable inference to estimate absolute quality gaps from pairwise preferences.

If this is right

ArenaPO achieves more precise optimization than binary DPO by using continuous quality gap rewards.
The method requires no online reward model training or inference, reducing computational overhead.
Models trained with ArenaPO show improved performance on standard preference datasets like Pick-a-Pic v2 and HPD v3.
Offline computation of rewards allows seamless integration into existing DPO pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique might generalize to other domains where pairwise preferences are available, such as text generation, by applying similar Gaussian modeling.
The probabilistic nature of the inference could provide uncertainty estimates for rewards, potentially improving training stability.
If the quality gap estimates align with human perception across diverse prompts, it could scale preference learning with less annotation effort.

Load-bearing premise

Modeling each model's capability as a Gaussian distribution and estimating quality gaps with truncated normal inference on pairwise preferences produces accurate absolute rewards.

What would settle it

A direct comparison where humans rate images generated by ArenaPO-trained models versus DPO-trained models on the same prompts, checking if the preference rates match the predicted quality gaps.

Figures

Figures reproduced from arXiv: 2605.06070 by Edward Zhongwei Zhang, Jing Zhang, Qingyi Gu, Xuewen Liu, Yue Zhao, Zhen Dong, Zhikai Li.

**Figure 1.** Figure 1: Comparison of RLHF, DPO, and ArenaPO. The proposed ArenaPO extracts and leverages rich offline rewards from model Arena, without requiring a reward model, enabling both finegrained and efficient preference alignment. In this work, we aim to explore how finegrained preference can be incorporated into DPO in an offline manner without relying on an explicit reward model. Following Thurstone’s Case V theo… view at source ↗

**Figure 2.** Figure 2: Motivation of the proposed ArenaPO. Each output image can be viewed as one sample drawn from the model’s capability distribution N (µ, σ2 ). Thus, for two output images, their quality gap can be modeled as a function with respect to their source models’ µ w − µ l , σ w, and σ l . With the above insight, we propose ArenaPO, which leverages Arena information to provide fine-grained preference in an offline… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed ArenaPO. First, we build a model Arena that converts image view at source ↗

**Figure 4.** Figure 4: Win-rates of models trained using different methods on Pick-a-Pic v2 dataset versus the view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of different methods for training Stable Diffusion 1.5 on Pick-a-Pic v2 dataset. Prompt: 1) Cute grey cat, digital oil painting by Monet. 2) A pale half-elf in a dark, silver-trimmed robe, with long hair partly tied back. 3) An astronaut floating in space, with Earth dazzling starlight shining in the background. Step 0 Step 500 Step 1000 Step 1500 Step 2000 view at source ↗

**Figure 7.** Figure 7: Grid search of the hyperparameter γ. Effect of Scaling Coefficient γ To align the scale when incorporating QD, we introduce a scaling coefficient γ. We perform a grid search over γ, as shown in view at source ↗

**Figure 8.** Figure 8: More results of qualitative comparison of different methods for training Stable Diffusion view at source ↗

read the original abstract

Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ArenaPO tries to turn pairwise preferences into continuous offline rewards via per-model Gaussians and truncated-normal inference, but the core assumption that this recovers accurate quality gaps looks unproven and potentially circular.

read the letter

The paper's main move is to build a model Arena where each model's capability is a Gaussian, fit those from the pairwise labels, then infer an absolute quality gap for every image pair using truncated-normal latent inference. That gap becomes the reward signal for a DPO-style update. It is new in this specific construction and does avoid training a separate reward model while staying fully offline, which is a practical plus for diffusion alignment work. The claim of consistent gains on Pick-a-Pic v2 and HPD v3 is stated clearly, and the framing that binary DPO is too coarse while full RLHF is too heavy is reasonable. Credit for shipping a concrete pipeline that tries to extract more signal from existing preference data without extra training cost. The soft spots sit right at the center. The quality gaps are derived from the same pairwise data used to estimate the Gaussians, so there is a real risk the procedure is mostly reparameterizing the input labels rather than discovering independent information. If capabilities are not well modeled by symmetric Gaussians or if pairwise outcomes depend on other factors, the inferred rewards could be noisy or biased, which would remove the supposed fine-grained advantage. The abstract gives no ablations on the distributional assumptions, no correlation checks against human ratings for the gaps, and no quantitative numbers, so it is impossible to judge whether the method actually delivers what it promises. This is the kind of paper that belongs in a reading group focused on preference optimization for generative models. Readers working on efficient alignment for text-to-image systems could pick up the Arena construction and the latent-inference step even if they end up modifying or discarding the Gaussian part. It is coherent enough on its own terms to deserve a serious referee who can check the inference details and demand the missing verification that the gaps add real signal beyond the binary labels.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes ArenaPO for fine-grained preference optimization of text-to-image diffusion models. It constructs a model Arena in which each model's capability is represented as a Gaussian distribution inferred by traversing annotated pairwise preferences. Each output image is treated as a sample from its model's capability distribution. For any image pair, the absolute quality gap is then estimated via latent-variable inference under a truncated normal distribution conditioned on the two capability distributions and the observed preference; this gap serves as an offline reward signal in a DPO-style training procedure. The method is evaluated on the Pick-a-Pic v2 and HPD v3 datasets and is reported to outperform existing baselines while requiring no separate reward model and incurring no additional training overhead.

Significance. If the inferred quality gaps prove faithful to underlying human-perceived differences and supply information beyond the binary preference labels, ArenaPO would usefully combine the richness of RLHF-style rewards with the computational efficiency of DPO. The offline, reward-model-free design is a concrete practical strength that could scale to larger diffusion-model alignment pipelines.

major comments (3)

[§3.2] §3.2 (latent-variable inference): The absolute quality gap is obtained by conditioning a truncated normal on the same pairwise preference labels that were already used to fit the per-model Gaussian parameters. No held-out human correlation study or ablation that isolates the contribution of the continuous gap versus the binary label is reported; without such evidence the claim that the procedure supplies genuinely fine-grained, non-circular feedback remains unverified and load-bearing for the central advantage over standard DPO.
[§4.1] §4.1 (experimental protocol): Both Arena construction and subsequent ArenaPO training are performed on the identical Pick-a-Pic v2 and HPD v3 splits. This shared data usage risks inflating apparent gains; a clear separation (e.g., Arena built on a disjoint subset) or cross-dataset transfer results are needed to substantiate that the inferred rewards generalize.
[§3.1] §3.1 (Gaussian capability model): The assumption that model capabilities follow symmetric Gaussian distributions is introduced without diagnostic checks against the empirical distribution of pairwise outcomes. If the true capability distribution is skewed or multimodal, the downstream truncated-normal gap estimates become biased; a sensitivity analysis or alternative distributional assumption would be required to bound the robustness of the reward signal.

minor comments (3)

[Abstract] Abstract: 'a image pair' should read 'an image pair'.
[Figure 1] Figure 1 (Arena diagram): The arrow labeled 'traversing pairwise preferences' does not indicate whether the inference is performed once globally or per training batch; clarify the computational graph.
[§4.3] §4.3 (baseline comparison): The reported metrics lack error bars or the number of random seeds; adding these would strengthen the claim of consistent outperformance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive feedback. We appreciate the opportunity to clarify and strengthen our manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will make.

read point-by-point responses

Referee: [§3.2] §3.2 (latent-variable inference): The absolute quality gap is obtained by conditioning a truncated normal on the same pairwise preference labels that were already used to fit the per-model Gaussian parameters. No held-out human correlation study or ablation that isolates the contribution of the continuous gap versus the binary label is reported; without such evidence the claim that the procedure supplies genuinely fine-grained, non-circular feedback remains unverified and load-bearing for the central advantage over standard DPO.

Authors: We thank the referee for highlighting this important point. The inference procedure is designed to extract a continuous quality gap from the binary preference by modeling the latent capabilities as Gaussians and using truncated normal conditioning, which is not merely reusing the label but deriving a magnitude estimate based on the distributional assumptions. However, we acknowledge that empirical validation isolating this contribution is necessary. In the revised version, we will add an ablation study that compares ArenaPO using the inferred continuous gaps against a variant that uses only binary preferences (equivalent to standard DPO). We will also attempt to correlate the inferred gaps with any available human rating data on held-out pairs from the datasets to provide additional evidence of faithfulness. revision: yes
Referee: [§4.1] §4.1 (experimental protocol): Both Arena construction and subsequent ArenaPO training are performed on the identical Pick-a-Pic v2 and HPD v3 splits. This shared data usage risks inflating apparent gains; a clear separation (e.g., Arena built on a disjoint subset) or cross-dataset transfer results are needed to substantiate that the inferred rewards generalize.

Authors: This is a valid concern regarding potential data leakage or overfitting to the specific splits. To address it, we will revise the experimental protocol to construct the Arena model using a disjoint subset of the training data (e.g., 50% split for Arena construction and the remaining for preference optimization). We will report the performance under this separated setting. Additionally, we will include cross-dataset transfer experiments, such as building the Arena on Pick-a-Pic v2 and evaluating ArenaPO on HPD v3, and vice versa, to demonstrate generalization of the inferred rewards. revision: yes
Referee: [§3.1] §3.1 (Gaussian capability model): The assumption that model capabilities follow symmetric Gaussian distributions is introduced without diagnostic checks against the empirical distribution of pairwise outcomes. If the true capability distribution is skewed or multimodal, the downstream truncated-normal gap estimates become biased; a sensitivity analysis or alternative distributional assumption would be required to bound the robustness of the reward signal.

Authors: We agree that the Gaussian assumption should be validated. In the revised manuscript, we will include diagnostic plots comparing the empirical distribution of pairwise preference outcomes to the fitted Gaussian model. Furthermore, we will conduct a sensitivity analysis by experimenting with alternative distributions, such as a Laplace distribution or a Gaussian mixture model, and report the impact on the final performance metrics to bound the robustness of our approach. revision: yes

Circularity Check

0 steps flagged

No circularity: parametric inference produces continuous rewards from binary preferences without reducing to input by construction

full rationale

The paper's core chain models each model's capability as a Gaussian inferred from annotated pairwise preferences, treats images as samples from those distributions, and then applies latent-variable inference under a truncated normal (conditioned on the observed preference) to estimate an absolute quality gap used as the offline reward. This is an explicit statistical estimation procedure that maps binary labels to continuous values via modeling assumptions; it does not equate the output gap to the input preference by definition or by fitting a parameter that is then renamed as the target. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the steps. The subsequent DPO-style training on the diffusion model uses these precomputed gaps as an independent objective, making the derivation self-contained against external validation of the Gaussian/truncated-normal model.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on two modeling choices whose justification is not supplied in the abstract: representing model capability as a Gaussian and treating the quality gap as a latent variable under a truncated normal. Both introduce fitted parameters and domain assumptions whose validity cannot be checked from the given text.

free parameters (2)

per-model Gaussian mean and variance
Inferred by traversing the annotated pairwise preferences; exact fitting procedure and any regularization not stated in abstract.
parameters of the truncated-normal inference
Used to estimate absolute quality gap from capability distributions and observed preference.

axioms (2)

domain assumption Each model's capability can be faithfully represented by a single Gaussian distribution over image quality.
Invoked when constructing the model Arena from pairwise data.
domain assumption The absolute quality gap between two images is a latent variable whose posterior can be recovered via truncated-normal inference conditioned on the observed preference.
Central step that converts binary labels into continuous rewards.

invented entities (1)

Model Arena with per-model Gaussian capability distributions no independent evidence
purpose: To serve as an offline source of fine-grained reward signals for preference optimization.
New construct introduced to bridge pairwise data and continuous rewards; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5553 in / 1667 out tokens · 75780 ms · 2026-05-08T14:12:26.064671+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 12 canonical work pages · 9 internal anchors

[1]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024

2024
[2]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023
[3]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[4]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[5]

Getting it right: Improving spatial consistency in text-to-image models

Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, et al. Getting it right: Improving spatial consistency in text-to-image models. InEuropean Conference on Computer Vision, pages 204–222. Springer, 2024

2024
[6]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023

work page internal anchor Pith review arXiv 2023
[7]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[8]

Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45 (9):10850–10869, 2023

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 45 (9):10850–10869, 2023

2023
[9]

Diffusion-sdpo: Safeguarded direct preference optimization for diffusion models.arXiv preprint arXiv:2511.03317, 2025

Minghao Fu, Guo-Hua Wang, Tianyu Cui, Qing-Guo Chen, Zhao Xu, Weihua Luo, and Kaifu Zhang. Diffusion-sdpo: Safeguarded direct preference optimization for diffusion models.arXiv preprint arXiv:2511.03317, 2025

work page arXiv 2025
[10]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[11]

Margin- aware preference optimization for aligning diffusion models without reference

Jiwoo Hong, Sayak Paul, Noah Lee, Kashif Rasul, James Thorne, and Jongheon Jeong. Margin- aware preference optimization for aligning diffusion models without reference. InFirst Work- shop on Scalable Optimization for Efficient and Adaptive Foundation Models, 2024

2024
[12]

John wiley & sons, 1995

Norman L Johnson, Samuel Kotz, and Narayanaswamy Balakrishnan.Continuous univariate distributions, volume 2, volume 2. John wiley & sons, 1995

1995
[13]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35: 26565–26577, 2022

2022
[14]

Scalable ranked preference optimization for text-to-image generation

Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata, Sergey Tulyakov, Jian Ren, and Anil Kag. Scalable ranked preference optimization for text-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18399–18410, 2025

2025
[15]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

2023
[16]

Divergence minimization preference optimization for diffusion model alignment.arXiv preprint arXiv:2507.07510,

Binxu Li, Minkai Xu, Jiaqi Han, Meihua Dang, and Stefano Ermon. Divergence minimization preference optimization for diffusion model alignment.arXiv preprint arXiv:2507.07510, 2025

work page arXiv 2025
[17]

Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Align- ing diffusion models by optimizing human utility.Advances in Neural Information Processing Systems, 37:24897–24925, 2024. 10

2024
[18]

K-sort eval: Efficient preference evaluation for visual genera- tion via corrected vlm-as-a-judge

Zhikai Li, Xuewen Liu, Wangbo Zhao, Pan Du, Kaicheng Zhou, Qingyi Gu, Yang You, Zhen Dong, Kurt Keutzer, et al. K-sort eval: Efficient preference evaluation for visual genera- tion via corrected vlm-as-a-judge. InThe Fourteenth International Conference on Learning Representations
[19]

K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences

Zhikai Li, Xuewen Liu, Dongrong Joe Fu, Jianquan Li, Qingyi Gu, Kurt Keutzer, and Zhen Dong. K-sort arena: Efficient and reliable benchmarking for generative models via k-wise human preferences. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9131–9141, 2025

2025
[20]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review arXiv 2022
[21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review arXiv 2022
[22]

Hpsv3: Towards wide-spectrum hu- man preference score

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum hu- man preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15086–15095, 2025

2025
[23]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[24]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review arXiv 2023
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[26]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[27]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[28]

LAION-Aesthetics: Predicting the aesthetic quality of images.https://laion.ai/blog/laion-aesthetics/, 2022

Christoph Schuhmann, Romain Vencu, Romai Beaumont, et al. LAION-Aesthetics: Predicting the aesthetic quality of images.https://laion.ai/blog/laion-aesthetics/, 2022

2022
[29]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[30]

Freeu: Free lunch in diffusion u-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2024

2024
[31]

Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

2019
[32]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review arXiv 2011
[33]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in neural information processing systems, 33:3008–3021, 2020

2020
[34]

A law of comparative judgment.Psychological Review, 34(4):273–286, 1927

LL Thurstone. A law of comparative judgment.Psychological Review, 34(4):273–286, 1927. 11

1927
[35]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024
[36]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review arXiv 2023
[37]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[38]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review arXiv 2025
[39]

Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys, 56(4):1–39, 2023

Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications.ACM computing surveys, 56(4):1–39, 2023

2023
[40]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

work page internal anchor Pith review arXiv 2022
[41]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason E Weston. Self-rewarding language models. InForty-first International Conference on Machine Learning, 2024

2024
[42]

Dspo: Direct score preference optimization for diffusion model alignment

Huaisheng Zhu, Teng Xiao, and Vasant G Honavar. Dspo: Direct score preference optimization for diffusion model alignment. InThe Thirteenth International Conference on Learning Representations, 2025. 12 A Derivation of Bayesian Updating Suppose that the observation D indicates that model M1 outperforms model M2. The latent per- formance of model follows a ...

2025
[43]

we did not observe stable improvement when training from public implementations on Pick-a-Pic

and θ2 ∼ N(µ 2, σ2 2). β represents the uncer- tainty in a model’s single-instance performance. By Bayes’ theorem, the joint posterior distribution can be written as: P(θ 1, θ2 |D)∝ϕ θ1 −µ 1 σ1 ϕ θ2 −µ 2 σ2 Φ θ1 −θ 2p β2 1 +β 2 2 ! . Marginalizing outθ 2 yields the posterior ofθ 1: P(θ 1 |D)∝ϕ θ1 −µ 1 σ1 Φ θ1 −µ 2p β2 1 +β 2 2 +σ 2 2 ! . The posterior mea...

work page arXiv 2088