arxiv: 2605.04494 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.CV

Recognition: unknown

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Jiaming Hu , Jiamu Bai , Haoyu Wang , Debarghya Mukherjee , Ioannis Ch. Paschalidis

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:48 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords diffusion modelspreference alignmentNash equilibriumself-playtext-to-imageRLHFDPO

0 comments

The pith

Diffusion models can align with human preferences by competing against themselves in a Nash game.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing approaches to aligning text-to-image diffusion models with preferences depend on reward models and the assumption that preferences follow a simple pairwise ranking rule. The paper instead casts alignment as a two-player game in which the current policy must reach equilibrium against a copy of itself. This self-play produces training signals that drive iterative improvement without extra parameters or explicit reward fitting. If the approach holds, it supplies a more direct route to handling richer preference structures on generative tasks.

Core claim

We formulate diffusion alignment from a game-theoretic perspective and propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics, where it consistently outperforms existing preference-based diffusion alignment methods.

What carries the argument

Diffusion Nash Preference Optimization (Diff.-NPO), a self-play mechanism in which the diffusion policy is trained to reach Nash equilibrium against a frozen copy of itself, generating its own preference signals.

Load-bearing premise

That a self-play Nash game between a policy and its own copy can represent the full complexity of human preferences without needing reward models or additional selection steps.

What would settle it

A head-to-head test on a preference dataset containing clear intransitive or multi-way choices where Diff.-NPO produces no measurable gain over standard direct preference optimization would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.04494 by Debarghya Mukherjee, Haoyu Wang, Ioannis Ch. Paschalidis, Jiaming Hu, Jiamu Bai.

**Figure 1.** Figure 1: Winrates of different τ /η values for PickScore and ImageReward. Full results in view at source ↗

**Figure 2.** Figure 2: Qualitative comparison on SDXL across six representative prompts. Rows correspond to view at source ↗

**Figure 3.** Figure 3: Full win-rate comparison for the ablation study of different view at source ↗

**Figure 4.** Figure 4: Additional qualitative comparison on SDXL. Compared with the baselines, Diff.-NPO view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on SD1.5. Diff.-NPO consistently improves the realism, semantic view at source ↗

read the original abstract

Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment. However, existing preference-based methods for diffusion alignment still rely on reward-induced preference signals and typically assume that human preferences can be adequately modeled by the Bradley--Terry (BT) model, which may fail to capture the full complexity of human preferences. In this paper, we formulate diffusion alignment from a game-theoretic perspective. We propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics. Diff.-NPO consistently outperforms existing preference-based diffusion alignment methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Diff.-NPO recasts diffusion alignment as self-play Nash equilibrium to skip explicit rewards and BT assumptions, but the gains look incremental and the generality still hinges on how payoffs are pulled from preference data.

read the letter

The main point is that this paper treats preference alignment for diffusion models as a two-player game in which the policy plays against a copy of itself until it reaches a Nash equilibrium. That self-play setup is the new framing, and it lets them drop the separate reward model while claiming to handle preferences more generally than Bradley-Terry methods do. They report consistent improvements over prior diffusion DPO variants on text-to-image tasks across standard metrics, which is the practical evidence they offer. The self-play objective is simple enough that it could be adopted without much extra machinery once the loss is written down. That counts as a modest but real engineering win for people already running preference fine-tuning on diffusion backbones. The soft spot is exactly the one the stress-test note flags. Even though the abstract says the method avoids explicit reward modeling, the game still needs a payoff defined from the same preference pairs. If that payoff ends up being computed through a latent utility or a sampling rule that looks a lot like the BT likelihood, then the claimed generality shrinks to a reparameterization rather than a genuine escape from current limitations. The experiments do not include controls that would show the method handling non-transitive or context-dependent preferences any better than existing approaches, so the generality claim rests more on the story than on the data. This is the sort of paper that fits in a reading group on alignment techniques for generative models. Readers who follow DPO extensions or game-theoretic RLHF will find the angle useful to discuss, even if they end up implementing something closer to the original DPO loss. It has enough formal structure and empirical results to deserve a serious referee rather than a desk reject, though I would expect the reviews to focus on tightening the derivation and adding ablations that isolate the Nash component.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing diffusion alignment methods rely on reward-induced signals and the Bradley-Terry model, which may not capture complex human preferences. It proposes Diffusion Nash Preference Optimization (Diff.-NPO), a self-play Nash equilibrium framework in which the current policy competes against itself to achieve self-improvement and general preference alignment without explicit reward modeling or BT assumptions. The authors report that this yields better alignment and empirically outperforms prior preference-based diffusion methods on text-to-image generation across various metrics.

Significance. If the central claim holds—that the Nash self-play objective provides a genuinely general preference framework without reintroducing latent utility functions, auxiliary models, or post-hoc selection rules—it would offer a meaningful advance over DPO-style methods by potentially handling non-transitive or context-dependent preferences. The reported empirical gains on T2I tasks add practical value, though significance hinges on whether the formulation is parameter-free and directly derived from preference data.

major comments (2)

[§3] §3 (Nash formulation and loss derivation): the payoff function for the self-play game must be shown to be computed directly from raw preference pairs without deriving a latent utility or applying any selection rule that effectively recreates a reward signal; otherwise the generality claim reduces to a reparameterized DPO variant. This is load-bearing for the central contribution.
[§4] §4 (experimental setup): the outperformance claims require explicit confirmation that no post-hoc hyperparameter tuning or baseline selection was performed after seeing results, and that all compared methods used identical preference data and evaluation protocols.

minor comments (2)

[§3.2] Notation for the policy and opponent in the self-play game should be introduced with a clear table or diagram to avoid ambiguity when the same network is used for both roles.
[Abstract] The abstract and introduction should cite the specific prior diffusion DPO papers being outperformed so readers can immediately locate the baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications on the formulation and experiments, and will incorporate explicit additions in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Nash formulation and loss derivation): the payoff function for the self-play game must be shown to be computed directly from raw preference pairs without deriving a latent utility or applying any selection rule that effectively recreates a reward signal; otherwise the generality claim reduces to a reparameterized DPO variant. This is load-bearing for the central contribution.

Authors: In Diff.-NPO the payoff is defined directly on raw preference pairs: for any pair of generations sampled from the current policy and its copy, the payoff is the binary preference label from the data (1 if the first is preferred, 0 otherwise). No latent utility or reward is estimated; the Nash objective is obtained by setting the expected payoff gradient to zero under the self-play distribution. This yields a loss that does not invoke the Bradley-Terry model. We will insert a step-by-step derivation in the revised §3 that starts from the preference pairs and arrives at the equilibrium condition without intermediate reward modeling. revision: yes
Referee: [§4] §4 (experimental setup): the outperformance claims require explicit confirmation that no post-hoc hyperparameter tuning or baseline selection was performed after seeing results, and that all compared methods used identical preference data and evaluation protocols.

Authors: All hyperparameters were chosen on a held-out validation split before the final test runs; no values were altered after inspecting the reported numbers. Every baseline was re-run on the identical preference dataset splits, with the same sampling settings, number of generations, and evaluation metrics. We will add a short paragraph in the revised §4 that states these facts explicitly and lists the shared data and protocol details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper formulates diffusion alignment as a self-play Nash game and introduces Diff.-NPO as a general preference framework that avoids explicit reward modeling and BT assumptions. No equations or derivation steps are provided in the abstract that reduce by construction to fitted inputs, self-definitions, or self-citations. The central claim introduces independent game-theoretic structure (policy playing against itself for self-improvement) that is not shown to be equivalent to prior DPO-style losses via renaming or ansatz smuggling. The method is positioned against external benchmarks (existing preference-based diffusion methods), making the derivation self-contained rather than circular. No load-bearing self-citation chains or uniqueness theorems imported from the authors' prior work appear in the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be extracted. The Nash equilibrium formulation itself may implicitly introduce an equilibrium-finding procedure whose computational cost or convergence assumptions are not stated.

pith-pipeline@v0.9.0 · 5493 in / 1093 out tokens · 20555 ms · 2026-05-08T16:48:33.578501+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 27 canonical work pages · 10 internal anchors

[1]

Fine-tuning language models to find agreement among humans with diverse preferences

Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell- Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in neural information processing systems, 35:38176–38189, 2022

2022
[2]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

2023
[3]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review arXiv 2023
[4]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[5]

Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400,

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

work page arXiv 2023
[6]

Emu: Enhanc- ing image generation models using photogenic needles in a haystack

Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image gen- eration models using photogenic needles in a haystack.arXiv preprint arXiv:2309.15807, 2023

work page arXiv 2023
[7]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Pro- cessing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Pro- cessing Systems, 36:79858–79885, 2023

2023
[8]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

1997
[9]

Diffusion-rpo: Aligning diffusion models through relative preference optimization.arXiv preprint arXiv:2406.06382, 2024

Yi Gu, Zhendong Wang, Yueqin Yin, Yujia Xie, and Mingyuan Zhou. Diffusion-rpo: Aligning diffusion models through relative preference optimization.arXiv preprint arXiv:2406.06382, 2024

work page arXiv 2024
[10]

Sail: Self-amplified iterative learning for diffusion model alignment with minimal human feedback.arXiv preprint arXiv:2602.05380, 2026

Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, and Bo Zhang. Sail: Self-amplified iterative learning for diffusion model alignment with minimal human feedback.arXiv preprint arXiv:2602.05380, 2026

work page arXiv 2026
[11]

Improving generative ad text on facebook using reinforcement learning.arXiv preprint arXiv:2507.21983, 2025

Daniel R Jiang, Alex Nikulkov, Yu-Chia Chen, Yang Bai, and Zheqing Zhu. Improving gen- erative ad text on facebook using reinforcement learning.arXiv preprint arXiv:2507.21983, 2025

work page arXiv 2025
[12]

Inference-time alignment control for diffusion models with reinforcement learning guidance

Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, and Xipeng Qiu. Inference-time alignment control for diffusion models with reinforcement learning guid- ance.arXiv preprint arXiv:2508.21016, 2025

work page arXiv 2025
[13]

Test-time alignment of diffusion models without reward over-optimization

Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InThe Thirteenth International Conference on Learning Representations
[14]

Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569

work page arXiv 2023
[15]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesv ´ari.Bandit algorithms. Cambridge University Press, 2020. 10

2020
[16]

Calibrated multi-preference optimization for aligning diffusion models

Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18465–18475, 2025

2025
[17]

Aligning diffusion models by optimizing human utility.Advances in Neural Information Pro- cessing Systems, 37:24897–24925, 2024

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Pro- cessing Systems, 37:24897–24925, 2024

2024
[18]

Aegpo: Adaptive entropy-guided policy optimization for diffusion models.arXiv preprint arXiv:2602.06825, 2026

Yuming Li, Qingyu Li, Chengyu Bai, Xiangyang Luo, Zeyue Xue, Wenyu Qin, Meng Wang, Yikai Wang, and Shanghang Zhang. Aegpo: Adaptive entropy-guided policy optimization for diffusion models.arXiv preprint arXiv:2602.06825, 2026

work page arXiv 2026
[19]

Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffu- sion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

work page arXiv 2025
[20]

Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024

work page arXiv 2024
[21]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review arXiv 2025
[22]

Intransitivity, utility, and the aggregation of preference patterns.Economet- rica: Journal of the Econometric Society, pages 1–13, 1954

Kenneth O May. Intransitivity, utility, and the aggregation of preference patterns.Economet- rica: Journal of the Econometric Society, pages 1–13, 1954

1954
[23]

Nash learning from human feedback

R ´emi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Row- land, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, C ˆome Fiegel, et al. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024

2024
[24]

GPT Image 2.https://developers.openai.com/api/docs/models/ gpt-image-2, 2026

OpenAI. GPT Image 2.https://developers.openai.com/api/docs/models/ gpt-image-2, 2026. Accessed: 2026-04-29

2026
[25]

The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

1975
[26]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review arXiv 2023
[27]

Aligning text- to-image diffusion models with reward backpropagation

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text- to-image diffusion models with reward backpropagation. 2023

2023
[28]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[29]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,
[30]

URLhttps://arxiv.org/abs/2305.18290

work page internal anchor Pith review arXiv
[31]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

work page internal anchor Pith review arXiv 2022
[32]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 11

2022
[33]

Laion-aesthetics.https://laion.ai/blog/laion-aesthetics/, 2022

Christoph Schuhmann. Laion-aesthetics.https://laion.ai/blog/laion-aesthetics/, 2022

2022
[34]

Tuning-free alignment of diffusion models with direct noise optimization

Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Tuning-free alignment of diffusion models with direct noise optimization. InICML 2024 Workshop on Structured Probabilistic Inference{\&}Generative Modeling, 2024

2024
[35]

Intransitivity of preferences.Psychological Review, 76(1):31–48, 1969

Amos Tversky. Intransitivity of preferences.Psychological Review, 76(1):31–48, 1969. doi: 10.1037/h0026750

work page doi:10.1037/h0026750 1969
[36]

Diffusion model alignment us- ing direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment us- ing direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024
[37]

Diffusion- npo: Negative preference optimization for better pref- erence aligned generation of diffusion models.arXiv preprint arXiv:2505.11245, 2025a

Fu-Yun Wang, Yunhao Shui, Jingtan Piao, Keqiang Sun, and Hongsheng Li. Diffusion-npo: Negative preference optimization for better preference aligned generation of diffusion models. arXiv preprint arXiv:2505.11245, 2025

work page arXiv 2025
[38]

Multiplayer nash preference optimization,

Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, and Yejin Choi. Multiplayer nash preference optimization,
[39]

URLhttps://arxiv.org/abs/2509.23102

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis, 2023. URLhttps://arxiv.org/abs/2306.09341

work page internal anchor Pith review arXiv 2023
[41]

Rethinking dpo-style diffusion aligning frameworks

Xun Wu, Shaohan Huang, Lingjie Jiang, and Furu Wei. Rethinking dpo-style diffusion aligning frameworks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18068–18077, 2025

2025
[42]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

2023
[43]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review arXiv 2025
[45]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

2024
[46]

Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024

Po-Hung Yeh, Kuang-Huei Lee, and Jun-Cheng Chen. Training-free diffusion model align- ment with sampling demons.arXiv preprint arXiv:2410.05760, 2024

work page arXiv 2024
[47]

Scaling autoregressive mod- els for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive mod- els for content-rich text-to-image generation, 2022. URLhttps://arxiv.org/abs/2206. 10789

2022
[48]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 12

work page internal anchor Pith review arXiv 2022
[49]

Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

2024
[50]

Mira: Towards mitigating reward hacking in inference-time alignment of t2i diffusion models.arXiv preprint arXiv:2510.01549, 2025

Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, and Mubarak Shah. Mira: Towards mitigating reward hacking in inference-time alignment of t2i diffusion models.arXiv preprint arXiv:2510.01549, 2025

work page arXiv 2025
[51]

Seppo: Semi-policy preference optimization for diffusion alignment

Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Dong Yu, Christopher Brinton, Jiebo Luo, et al. Seppo: Semi-policy preference optimization for diffusion alignment. 2024

2024
[52]

arXiv preprint arXiv:2407.00617 , year=

Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, and Dong Yu. Iterative nash policy optimization: Aligning llms with general pref- erences via no-regret learning.arXiv preprint arXiv:2407.00617, 2024

work page arXiv 2024
[53]

Improving llm general preference alignment via optimistic online mirror descent

Yuheng Zhang, Dian Yu, Tao Ge, Linfeng Song, Zhichen Zeng, Haitao Mi, Nan Jiang, and Dong Yu. Improving llm general preference alignment via optimistic online mirror descent. arXiv preprint arXiv:2502.16852, 2025

work page arXiv 2025
[54]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review arXiv 2025
[55]

Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942, 2025

Runlong Zhou, Maryam Fazel, and Simon S Du. Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942, 2025

work page arXiv 2025
[56]

"vase""

Huaisheng Zhu, Teng Xiao, and Vasant G Honavar. Dspo: Direct score preference optimization for diffusion model alignment. International Conference on Learning Representations (ICLR 2025), 2025. 13 A Diff.-NPO Pseudo-code The Pseudo-code of Diff.-NPO is shown in Algorithm 1. Algorithm 1Diffusion-NPO 1:Input:stepsS, KL regularizationτ, OMD parameterη, refer...

2025