pith. machine review for the scientific record. sign in

arxiv: 2605.04494 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.CV

Recognition: unknown

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:48 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords diffusion modelspreference alignmentNash equilibriumself-playtext-to-imageRLHFDPO
0
0 comments X

The pith

Diffusion models can align with human preferences by competing against themselves in a Nash game.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing approaches to aligning text-to-image diffusion models with preferences depend on reward models and the assumption that preferences follow a simple pairwise ranking rule. The paper instead casts alignment as a two-player game in which the current policy must reach equilibrium against a copy of itself. This self-play produces training signals that drive iterative improvement without extra parameters or explicit reward fitting. If the approach holds, it supplies a more direct route to handling richer preference structures on generative tasks.

Core claim

We formulate diffusion alignment from a game-theoretic perspective and propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics, where it consistently outperforms existing preference-based diffusion alignment methods.

What carries the argument

Diffusion Nash Preference Optimization (Diff.-NPO), a self-play mechanism in which the diffusion policy is trained to reach Nash equilibrium against a frozen copy of itself, generating its own preference signals.

Load-bearing premise

That a self-play Nash game between a policy and its own copy can represent the full complexity of human preferences without needing reward models or additional selection steps.

What would settle it

A head-to-head test on a preference dataset containing clear intransitive or multi-way choices where Diff.-NPO produces no measurable gain over standard direct preference optimization would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.04494 by Debarghya Mukherjee, Haoyu Wang, Ioannis Ch. Paschalidis, Jiaming Hu, Jiamu Bai.

Figure 1
Figure 1. Figure 1: Winrates of different τ /η values for PickScore and ImageReward. Full results in view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison on SDXL across six representative prompts. Rows correspond to view at source ↗
Figure 3
Figure 3. Figure 3: Full win-rate comparison for the ablation study of different view at source ↗
Figure 4
Figure 4. Figure 4: Additional qualitative comparison on SDXL. Compared with the baselines, Diff.-NPO view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on SD1.5. Diff.-NPO consistently improves the realism, semantic view at source ↗
read the original abstract

Reinforcement learning from human feedback (RLHF) has been popular for aligning text-to-image (T2I) diffusion models with human preferences. As a mainstream branch of RLHF, Direct Preference Optimization (DPO) offers a computationally efficient alternative that avoids explicit reward modeling and has been widely adopted in diffusion alignment. However, existing preference-based methods for diffusion alignment still rely on reward-induced preference signals and typically assume that human preferences can be adequately modeled by the Bradley--Terry (BT) model, which may fail to capture the full complexity of human preferences. In this paper, we formulate diffusion alignment from a game-theoretic perspective. We propose Diffusion Nash Preference Optimization (Diff.-NPO), an intuitive general preference framework for diffusion alignment. Diff.-NPO encourages the current policy to play against itself to achieve self improvement and lead to a better alignment. Empirically, we demonstrate the effectiveness of Diff.-NPO on the text-to-image generation task via various metrics. Diff.-NPO consistently outperforms existing preference-based diffusion alignment methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing diffusion alignment methods rely on reward-induced signals and the Bradley-Terry model, which may not capture complex human preferences. It proposes Diffusion Nash Preference Optimization (Diff.-NPO), a self-play Nash equilibrium framework in which the current policy competes against itself to achieve self-improvement and general preference alignment without explicit reward modeling or BT assumptions. The authors report that this yields better alignment and empirically outperforms prior preference-based diffusion methods on text-to-image generation across various metrics.

Significance. If the central claim holds—that the Nash self-play objective provides a genuinely general preference framework without reintroducing latent utility functions, auxiliary models, or post-hoc selection rules—it would offer a meaningful advance over DPO-style methods by potentially handling non-transitive or context-dependent preferences. The reported empirical gains on T2I tasks add practical value, though significance hinges on whether the formulation is parameter-free and directly derived from preference data.

major comments (2)
  1. [§3] §3 (Nash formulation and loss derivation): the payoff function for the self-play game must be shown to be computed directly from raw preference pairs without deriving a latent utility or applying any selection rule that effectively recreates a reward signal; otherwise the generality claim reduces to a reparameterized DPO variant. This is load-bearing for the central contribution.
  2. [§4] §4 (experimental setup): the outperformance claims require explicit confirmation that no post-hoc hyperparameter tuning or baseline selection was performed after seeing results, and that all compared methods used identical preference data and evaluation protocols.
minor comments (2)
  1. [§3.2] Notation for the policy and opponent in the self-play game should be introduced with a clear table or diagram to avoid ambiguity when the same network is used for both roles.
  2. [Abstract] The abstract and introduction should cite the specific prior diffusion DPO papers being outperformed so readers can immediately locate the baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications on the formulation and experiments, and will incorporate explicit additions in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Nash formulation and loss derivation): the payoff function for the self-play game must be shown to be computed directly from raw preference pairs without deriving a latent utility or applying any selection rule that effectively recreates a reward signal; otherwise the generality claim reduces to a reparameterized DPO variant. This is load-bearing for the central contribution.

    Authors: In Diff.-NPO the payoff is defined directly on raw preference pairs: for any pair of generations sampled from the current policy and its copy, the payoff is the binary preference label from the data (1 if the first is preferred, 0 otherwise). No latent utility or reward is estimated; the Nash objective is obtained by setting the expected payoff gradient to zero under the self-play distribution. This yields a loss that does not invoke the Bradley-Terry model. We will insert a step-by-step derivation in the revised §3 that starts from the preference pairs and arrives at the equilibrium condition without intermediate reward modeling. revision: yes

  2. Referee: [§4] §4 (experimental setup): the outperformance claims require explicit confirmation that no post-hoc hyperparameter tuning or baseline selection was performed after seeing results, and that all compared methods used identical preference data and evaluation protocols.

    Authors: All hyperparameters were chosen on a held-out validation split before the final test runs; no values were altered after inspecting the reported numbers. Every baseline was re-run on the identical preference dataset splits, with the same sampling settings, number of generations, and evaluation metrics. We will add a short paragraph in the revised §4 that states these facts explicitly and lists the shared data and protocol details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper formulates diffusion alignment as a self-play Nash game and introduces Diff.-NPO as a general preference framework that avoids explicit reward modeling and BT assumptions. No equations or derivation steps are provided in the abstract that reduce by construction to fitted inputs, self-definitions, or self-citations. The central claim introduces independent game-theoretic structure (policy playing against itself for self-improvement) that is not shown to be equivalent to prior DPO-style losses via renaming or ansatz smuggling. The method is positioned against external benchmarks (existing preference-based diffusion methods), making the derivation self-contained rather than circular. No load-bearing self-citation chains or uniqueness theorems imported from the authors' prior work appear in the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities can be extracted. The Nash equilibrium formulation itself may implicitly introduce an equilibrium-finding procedure whose computational cost or convergence assumptions are not stated.

pith-pipeline@v0.9.0 · 5493 in / 1093 out tokens · 20555 ms · 2026-05-08T16:48:33.578501+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 27 canonical work pages · 10 internal anchors

  1. [1]

    Fine-tuning language models to find agreement among humans with diverse preferences

    Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell- Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in neural information processing systems, 35:38176–38189, 2022

  2. [2]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023

  3. [3]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

  4. [4]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  5. [5]

    Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400,

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

  6. [6]

    Emu: Enhanc- ing image generation models using photogenic needles in a haystack

    Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, et al. Emu: Enhancing image gen- eration models using photogenic needles in a haystack.arXiv preprint arXiv:2309.15807, 2023

  7. [7]

    Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Pro- cessing Systems, 36:79858–79885, 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Pro- cessing Systems, 36:79858–79885, 2023

  8. [8]

    A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

    Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55(1):119–139, 1997

  9. [9]

    Diffusion-rpo: Aligning diffusion models through relative preference optimization.arXiv preprint arXiv:2406.06382, 2024

    Yi Gu, Zhendong Wang, Yueqin Yin, Yujia Xie, and Mingyuan Zhou. Diffusion-rpo: Aligning diffusion models through relative preference optimization.arXiv preprint arXiv:2406.06382, 2024

  10. [10]

    Sail: Self-amplified iterative learning for diffusion model alignment with minimal human feedback.arXiv preprint arXiv:2602.05380, 2026

    Xiaoxuan He, Siming Fu, Wanli Li, Zhiyuan Li, Dacheng Yin, Kang Rong, Fengyun Rao, and Bo Zhang. Sail: Self-amplified iterative learning for diffusion model alignment with minimal human feedback.arXiv preprint arXiv:2602.05380, 2026

  11. [11]

    Improving generative ad text on facebook using reinforcement learning.arXiv preprint arXiv:2507.21983, 2025

    Daniel R Jiang, Alex Nikulkov, Yu-Chia Chen, Yang Bai, and Zheqing Zhu. Improving gen- erative ad text on facebook using reinforcement learning.arXiv preprint arXiv:2507.21983, 2025

  12. [12]

    Inference-time alignment control for diffusion models with reinforcement learning guidance

    Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, and Xipeng Qiu. Inference-time alignment control for diffusion models with reinforcement learning guid- ance.arXiv preprint arXiv:2508.21016, 2025

  13. [13]

    Test-time alignment of diffusion models without reward over-optimization

    Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test-time alignment of diffusion models without reward over-optimization. InThe Thirteenth International Conference on Learning Representations

  14. [14]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023. URL https://arxiv.org/abs/2305.01569

  15. [15]

    Cambridge University Press, 2020

    Tor Lattimore and Csaba Szepesv ´ari.Bandit algorithms. Cambridge University Press, 2020. 10

  16. [16]

    Calibrated multi-preference optimization for aligning diffusion models

    Kyungmin Lee, Xiahong Li, Qifei Wang, Junfeng He, Junjie Ke, Ming-Hsuan Yang, Irfan Essa, Jinwoo Shin, Feng Yang, and Yinxiao Li. Calibrated multi-preference optimization for aligning diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18465–18475, 2025

  17. [17]

    Aligning diffusion models by optimizing human utility.Advances in Neural Information Pro- cessing Systems, 37:24897–24925, 2024

    Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion models by optimizing human utility.Advances in Neural Information Pro- cessing Systems, 37:24897–24925, 2024

  18. [18]

    Aegpo: Adaptive entropy-guided policy optimization for diffusion models.arXiv preprint arXiv:2602.06825, 2026

    Yuming Li, Qingyu Li, Chengyu Bai, Xiangyang Luo, Zeyue Xue, Wenyu Qin, Meng Wang, Yikai Wang, and Shanghang Zhang. Aegpo: Adaptive entropy-guided policy optimization for diffusion models.arXiv preprint arXiv:2602.06825, 2026

  19. [19]

    Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

    Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, et al. Uniworld-v2: Reinforce image editing with diffu- sion negative-aware finetuning and mllm implicit feedback.arXiv preprint arXiv:2510.16888, 2025

  20. [20]

    Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2024

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware preference optimization: Aligning preference with denoising performance at each step.arXiv preprint arXiv:2406.04314, 2(5):7, 2024

  21. [21]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  22. [22]

    Intransitivity, utility, and the aggregation of preference patterns.Economet- rica: Journal of the Econometric Society, pages 1–13, 1954

    Kenneth O May. Intransitivity, utility, and the aggregation of preference patterns.Economet- rica: Journal of the Econometric Society, pages 1–13, 1954

  23. [23]

    Nash learning from human feedback

    R ´emi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Row- land, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, C ˆome Fiegel, et al. Nash learning from human feedback. InForty-first International Conference on Machine Learning, 2024

  24. [24]

    GPT Image 2.https://developers.openai.com/api/docs/models/ gpt-image-2, 2026

    OpenAI. GPT Image 2.https://developers.openai.com/api/docs/models/ gpt-image-2, 2026. Accessed: 2026-04-29

  25. [25]

    The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

    Robin L Plackett. The analysis of permutations.Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975

  26. [26]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

  27. [27]

    Aligning text- to-image diffusion models with reward backpropagation

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text- to-image diffusion models with reward backpropagation. 2023

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  29. [29]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model,

  30. [30]

    URLhttps://arxiv.org/abs/2305.18290

  31. [31]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  32. [32]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 11

  33. [33]

    Laion-aesthetics.https://laion.ai/blog/laion-aesthetics/, 2022

    Christoph Schuhmann. Laion-aesthetics.https://laion.ai/blog/laion-aesthetics/, 2022

  34. [34]

    Tuning-free alignment of diffusion models with direct noise optimization

    Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Tuning-free alignment of diffusion models with direct noise optimization. InICML 2024 Workshop on Structured Probabilistic Inference{\&}Generative Modeling, 2024

  35. [35]

    Intransitivity of preferences.Psychological Review, 76(1):31–48, 1969

    Amos Tversky. Intransitivity of preferences.Psychological Review, 76(1):31–48, 1969. doi: 10.1037/h0026750

  36. [36]

    Diffusion model alignment us- ing direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment us- ing direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

  37. [37]

    Diffusion- npo: Negative preference optimization for better pref- erence aligned generation of diffusion models.arXiv preprint arXiv:2505.11245, 2025a

    Fu-Yun Wang, Yunhao Shui, Jingtan Piao, Keqiang Sun, and Hongsheng Li. Diffusion-npo: Negative preference optimization for better preference aligned generation of diffusion models. arXiv preprint arXiv:2505.11245, 2025

  38. [38]

    Multiplayer nash preference optimization,

    Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, and Yejin Choi. Multiplayer nash preference optimization,

  39. [39]

    URLhttps://arxiv.org/abs/2509.23102

  40. [40]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text- to-image synthesis, 2023. URLhttps://arxiv.org/abs/2306.09341

  41. [41]

    Rethinking dpo-style diffusion aligning frameworks

    Xun Wu, Shaohan Huang, Lingjie Jiang, and Furu Wei. Rethinking dpo-style diffusion aligning frameworks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18068–18077, 2025

  42. [42]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  43. [43]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

  44. [45]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

  45. [46]

    Training-free diffusion model alignment with sampling demons.arXiv preprint arXiv:2410.05760, 2024

    Po-Hung Yeh, Kuang-Huei Lee, and Jun-Cheng Chen. Training-free diffusion model align- ment with sampling demons.arXiv preprint arXiv:2410.05760, 2024

  46. [47]

    Scaling autoregressive mod- els for content-rich text-to-image generation, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive mod- els for content-rich text-to-image generation, 2022. URLhttps://arxiv.org/abs/2206. 10789

  47. [48]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Va- sudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 12

  48. [49]

    Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

    Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Systems, 37: 73366–73398, 2024

  49. [50]

    Mira: Towards mitigating reward hacking in inference-time alignment of t2i diffusion models.arXiv preprint arXiv:2510.01549, 2025

    Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, and Mubarak Shah. Mira: Towards mitigating reward hacking in inference-time alignment of t2i diffusion models.arXiv preprint arXiv:2510.01549, 2025

  50. [51]

    Seppo: Semi-policy preference optimization for diffusion alignment

    Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Dong Yu, Christopher Brinton, Jiebo Luo, et al. Seppo: Semi-policy preference optimization for diffusion alignment. 2024

  51. [52]

    arXiv preprint arXiv:2407.00617 , year=

    Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, and Dong Yu. Iterative nash policy optimization: Aligning llms with general pref- erences via no-regret learning.arXiv preprint arXiv:2407.00617, 2024

  52. [53]

    Improving llm general preference alignment via optimistic online mirror descent

    Yuheng Zhang, Dian Yu, Tao Ge, Linfeng Song, Zhichen Zeng, Haitao Mi, Nan Jiang, and Dong Yu. Improving llm general preference alignment via optimistic online mirror descent. arXiv preprint arXiv:2502.16852, 2025

  53. [54]

    DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

  54. [55]

    Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942, 2025

    Runlong Zhou, Maryam Fazel, and Simon S Du. Extragradient preference optimization (egpo): Beyond last-iterate convergence for nash learning from human feedback.arXiv preprint arXiv:2503.08942, 2025

  55. [56]

    "vase""

    Huaisheng Zhu, Teng Xiao, and Vasant G Honavar. Dspo: Direct score preference optimization for diffusion model alignment. International Conference on Learning Representations (ICLR 2025), 2025. 13 A Diff.-NPO Pseudo-code The Pseudo-code of Diff.-NPO is shown in Algorithm 1. Algorithm 1Diffusion-NPO 1:Input:stepsS, KL regularizationτ, OMD parameterη, refer...