Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey
Pith reviewed 2026-05-22 02:24 UTC · model grok-4.3
The pith
A survey organizes diffusion model alignment literature along five axes to map methods and highlight gaps for safe deployment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The literature on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning can be organized along five axes: the source of feedback, the form of the reward or preference signal, the optimization mechanism, the treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint rather than a generic preference. This framework covers techniques including reinforcement learning from human feedback, KL-regularized policy optimization, direct preference optimization, binary utility optimization, and safety-oriented variants while synthesizing open
What carries the argument
Five axes for organizing the alignment literature: source of feedback, form of reward or preference signal, optimization mechanism, treatment of distribution shift and reward overoptimization, and explicit safety constraints.
Load-bearing premise
The five chosen axes are sufficient to organize the literature without leaving out major approaches or creating artificial boundaries between methods.
What would settle it
A new alignment method for diffusion models that cannot be placed under any of the five axes or that exposes a large unaddressed category would show the organizational map is incomplete.
read the original abstract
Diffusion models have become a central paradigm for image and multimodal generation, yet their deployment raises persistent questions about alignment, safety, preference satisfaction, and robustness to misuse. This survey reviews recent progress on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning. We organize the literature along five axes: the source of feedback, the form of the reward or preference signal, the optimization mechanism, the treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint rather than a generic preference. The review covers reinforcement learning from human feedback, KL-regularized policy optimization, direct preference optimization, binary utility optimization, differentiable reward fine-tuning, surrogate reward learning, region-aware fine-tuning, and safety-oriented DPO variants. To make the survey accessible, we include tutorial explanations of diffusion sampling, reward modeling, and preference optimization, and briefly connect image diffusion alignment to emerging text and masked language diffusion models. We also compare representative methods in terms of feedback requirements, computational cost, scalability, susceptibility to reward hacking, and suitability for safety-critical deployment. Finally, we synthesize the literature into a set of open challenges: multi-objective alignment, feedback-efficient preference learning, adversarially robust safety alignment, continual alignment under changing norms, and interpretable reward modeling. The goal of this survey is to provide a coherent technical map of the emerging area of diffusion model alignment and to identify the methodological gaps that must be addressed before aligned generative models can be reliably deployed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey reviews recent progress on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning. It organizes the literature along five axes: source of feedback, form of the reward or preference signal, optimization mechanism, treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint. The review covers RLHF, KL-regularized policy optimization, DPO, binary utility optimization, differentiable reward fine-tuning, surrogate reward learning, region-aware fine-tuning, and safety-oriented DPO variants. It includes tutorial explanations of diffusion sampling and preference optimization, compares representative methods on feedback requirements, cost, scalability, reward hacking susceptibility, and safety suitability, and synthesizes open challenges including multi-objective alignment, feedback-efficient learning, adversarially robust safety, continual alignment, and interpretable rewards.
Significance. If the five-axis taxonomy organizes the literature without major omissions, the survey would offer a useful technical map of an emerging area and clearly identify gaps that must be closed before reliable deployment of aligned generative models. The systematic comparisons across feedback needs, computational cost, and safety suitability, together with the tutorial sections, add practical value for readers entering the field. The manuscript correctly credits the coverage of external work on RLHF, DPO variants, and safety-oriented methods rather than claiming new derivations.
major comments (1)
- [Abstract and organization section] Abstract and organization section: The central claim that the five axes (source of feedback, form of reward signal, optimization mechanism, treatment of distribution shift, and explicit safety constraints) provide a coherent map rests on these axes being sufficient and non-artificial. No explicit justification or comparison to alternative organizing principles (e.g., model scale, data modality, or architectural constraints) is supplied, raising the risk that methods combining multiple signals, operating in multimodal/continual settings, or addressing safety only implicitly fall outside the taxonomy or require stretching definitions.
minor comments (2)
- Ensure consistent terminology when referring to 'reward overoptimization' versus 'reward hacking' across the comparison paragraphs and open-challenge list.
- [Comparison section] Add a short table or bullet list in the comparison section that directly maps each reviewed method to the five axes for quick reference.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. The single major comment identifies a presentational gap in the justification of the taxonomy, which we address directly below.
read point-by-point responses
-
Referee: [Abstract and organization section] Abstract and organization section: The central claim that the five axes (source of feedback, form of reward signal, optimization mechanism, treatment of distribution shift, and explicit safety constraints) provide a coherent map rests on these axes being sufficient and non-artificial. No explicit justification or comparison to alternative organizing principles (e.g., model scale, data modality, or architectural constraints) is supplied, raising the risk that methods combining multiple signals, operating in multimodal/continual settings, or addressing safety only implicitly fall outside the taxonomy or require stretching definitions.
Authors: We agree that the manuscript would benefit from an explicit justification of the five axes. These axes were selected because they map directly onto the core design decisions that differentiate practical alignment methods for diffusion models: how supervision is sourced, how it is encoded as a learning signal, how the model parameters are updated, how distributional mismatch is controlled, and whether safety is treated as a distinct constraint. This choice supports the survey's goal of comparing methods on concrete deployment-relevant dimensions such as feedback cost, scalability, and robustness. While the full text motivates the organization through the surveyed literature, we acknowledge that a meta-level discussion of alternatives is absent from the abstract and early sections. In the revised manuscript we will add a concise paragraph in the introduction that (i) states the rationale for the chosen axes, (ii) briefly contrasts them with alternatives such as organization by model scale or architectural family, and (iii) clarifies that hybrid or multimodal methods are classified according to their dominant characteristics, with explicit cross-references to the relevant sections. This change will be incorporated in the next version. revision: yes
Circularity Check
Survey taxonomy is an organizational choice with no self-referential derivation
full rationale
This is a literature review that cites and categorizes external papers along five explicitly chosen axes (source of feedback, form of reward signal, optimization mechanism, distribution shift treatment, and explicit safety constraints). No equations, fitted parameters, or new derivations appear; the abstract and structure present the five-axis map as a descriptive framework for accessibility rather than a result obtained from the paper's own inputs. All listed techniques (RLHF, DPO, etc.) are referenced to prior work. The organization does not reduce to self-definition or self-citation chains, satisfying the criteria for a self-contained survey against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47....
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2310.03739. Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Bou tilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-im age models using human feedback,
-
[2]
Aligning Text-to-Image Models using Human Feedback
URL https://arxiv.org/abs/2302.12192. 22 Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward vie w on aligning text-to-image diffusion with preference. In International Conference on Machine Learning (ICML) , 2024a. Jiazheng Xu, Ziyang Yang, Weizhe Shi, Yujun Wang, Xiaoxiao Wang, Li Zhang, et al. Imagere- ward: Learning and evaluating human prefer...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Hierarchical Text-Conditional Image Generation with CLIP Latents
URL https://openreview.net/forum?id=1vmSEVL19f. Xiaoying Xing, Sihui Chen, and Yang Song. Focus-n-fix: Region-awa re fine-tuning for text-to- image generation. In CVPR, 2025a. Ziyi Zhang et al. Confronting reward overoptimization for diffusion m odels: A perspective of inductive and primacy biases. In ICML, 2024a. Masatoshi Uehara et al. Feedback efficient onl...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Score-Based Generative Modeling through Stochastic Differential Equations
URL https://arxiv.org/abs/2011.13456. Aditya Ramesh, Alex Nichol, Prafulla Dhariwal, et al. Hierarchical text -conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125 , 2022b. Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Bo Du, and Dacheng Tao. Aligning few-step diffusion models with dense reward difference le...
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[5]
Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al
URL https://arxiv.org/abs/2405.17247. Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Ro y, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective reinforce ment learning method for text-to-image diffusion fine-tuning,
-
[6]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter
URL https://arxiv.org/abs/2503.00897. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nas h equilibrium. Advances in neural information processing systems , 30,
-
[7]
URL https://arxiv.org/abs/2306.09341. Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-im age generation,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
URL https://arxiv.org/abs/2305.01569. Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion m odel alignment using direct preference optimization,
-
[9]
Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,
URL https://arxiv.org/abs/2311.12908. Tao Li, Ziyang Wang, Tianyi Zhang, et al. Aligning text-to-image diffus ion models with human utility. In Advances in Neural Information Processing Systems (NeurIP S), 2024a. Fan Yang, Yujing Zhang, Lei Yu, et al. D3po: Direct preference fo r denoising diffusion policy optimization. In Proceedings of the IEEE/CVF Confer...
-
[10]
ISSN 1935-8245. doi: 10.1561/2200000056. URL http://dx.doi.org/10.1561/2200000056. Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, and Zeke Xie. Alignment of diffusion models: Fundamentals, challe nges, and future, 2024a. URL https://arxiv.org/abs/2409.07253. 24 Wei Jia, Rui Zhao, Lucas Smith, et al. Lasro: La...
-
[11]
Safetydpo: Scalable safety alignment for text-to-image generation,
Sangwoo Lee, Juhong Kim, Seojin Park, and Youngjin Choi. Diffusion- kto: Utility-guided diffusion model alignment without reward models. In Advances in Neural Information Processing Systems (NeurIPS), 2024b. Xiaolin Zhang, Yifan Chen, Jiuxiang Gu, and Jia-Bin Huang. Learning su rrogate rewards for preference optimization in diffusion models. In International...
-
[12]
Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, and Tingbo Hou . Prdp: Proximal re- ward difference prediction for large-scale reward finetuning of diffu sion models, 2024b. URL https://arxiv.org/abs/2402.08714. Ruihan Zhang, Yitao Huang, Wei Yang, and Yichong Li. Tdpo: Tempor al diffusion preference optimization for stepwise reward alignment. In ICML, 202...
-
[13]
URL https://arxiv.org/abs/2304.02643. Shuhao Liao, Xuxin Lv, Yuhong Cao, Jeric Lew, Wenjun Wu, and Guillau me Sar- toretti. Helm: Human-preferred exploration with language models,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Masashi Uehara, Ryo Takahashi, and Yuta Shibata
URL https://arxiv.org/abs/2503.07006. Masashi Uehara, Ryo Takahashi, and Yuta Shibata. A tutorial on r einforcement learning and preference optimization for diffusion models. arXiv preprint arXiv:2404.01234 , 2024b. Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Br idging the gap between value and policy based reinforcement learning, 201
-
[15]
Bridging the Gap Between Value and Policy Based Reinforcement Learning
URL https://arxiv.org/abs/1702.08892. Erick Mendes, Shibani Santurkar, David Krueger, C. Lawrence Zit nick, Deep Ganguli, Joshua B Tenenbaum, and Dimitris Tsipras. Evolving and aligning language models with ongoing human feedback. In Advances in Neural Information Processing Systems ,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.