Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey

Ankita Kushwaha; Kiran Ravish; Pawan Kumar; Preeti Lamba

arxiv: 2505.17352 · v2 · pith:DOOWQ4BPnew · submitted 2025-05-23 · 💻 cs.CV

Alignment and Safety of Diffusion Models via Reinforcement Learning and Reward Modeling: A Survey

Preeti Lamba , Kiran Ravish , Ankita Kushwaha , Pawan Kumar This is my paper

Pith reviewed 2026-05-22 02:24 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsalignmentreinforcement learningreward modelingsafetypreference optimizationtext-to-imagefine-tuning

0 comments

The pith

A survey organizes diffusion model alignment literature along five axes to map methods and highlight gaps for safe deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper reviews recent progress on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning. It organizes the work along five axes to create a coherent technical map and to point out where methods fall short for real-world use. A reader would care because current generative models still risk producing misaligned or unsafe outputs, and this structure shows concrete paths toward more reliable alignment. The review also includes tutorial sections on sampling and optimization plus direct comparisons of methods on cost and hacking risks.

Core claim

The literature on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning can be organized along five axes: the source of feedback, the form of the reward or preference signal, the optimization mechanism, the treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint rather than a generic preference. This framework covers techniques including reinforcement learning from human feedback, KL-regularized policy optimization, direct preference optimization, binary utility optimization, and safety-oriented variants while synthesizing open

What carries the argument

Five axes for organizing the alignment literature: source of feedback, form of reward or preference signal, optimization mechanism, treatment of distribution shift and reward overoptimization, and explicit safety constraints.

Load-bearing premise

The five chosen axes are sufficient to organize the literature without leaving out major approaches or creating artificial boundaries between methods.

What would settle it

A new alignment method for diffusion models that cannot be placed under any of the five axes or that exposes a large unaddressed category would show the organizational map is incomplete.

read the original abstract

Diffusion models have become a central paradigm for image and multimodal generation, yet their deployment raises persistent questions about alignment, safety, preference satisfaction, and robustness to misuse. This survey reviews recent progress on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning. We organize the literature along five axes: the source of feedback, the form of the reward or preference signal, the optimization mechanism, the treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint rather than a generic preference. The review covers reinforcement learning from human feedback, KL-regularized policy optimization, direct preference optimization, binary utility optimization, differentiable reward fine-tuning, surrogate reward learning, region-aware fine-tuning, and safety-oriented DPO variants. To make the survey accessible, we include tutorial explanations of diffusion sampling, reward modeling, and preference optimization, and briefly connect image diffusion alignment to emerging text and masked language diffusion models. We also compare representative methods in terms of feedback requirements, computational cost, scalability, susceptibility to reward hacking, and suitability for safety-critical deployment. Finally, we synthesize the literature into a set of open challenges: multi-objective alignment, feedback-efficient preference learning, adversarially robust safety alignment, continual alignment under changing norms, and interpretable reward modeling. The goal of this survey is to provide a coherent technical map of the emerging area of diffusion model alignment and to identify the methodological gaps that must be addressed before aligned generative models can be reliably deployed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey maps diffusion alignment work under a five-axis taxonomy and adds tutorials but introduces no new results and may leave coverage gaps.

read the letter

This survey pulls together recent work on aligning text-to-image diffusion models through reinforcement learning, reward modeling, and preference optimization. It groups the methods along five axes—feedback source, reward signal form, optimization mechanism, handling of distribution shift, and explicit safety constraints—and includes tutorial sections on sampling and preference learning plus a comparison table on cost and safety fit. The open challenges section on multi-objective alignment and continual updates under shifting norms is a reasonable summary of current limits. Those elements make the paper a practical reference for someone entering the area rather than a source of fresh theorems or experiments. The taxonomy choice is the main soft spot. The abstract lists techniques like RLHF, DPO, and safety variants but gives no explicit defense for why these five axes are better than alternatives such as modality or scale, and hybrid or implicitly safe methods could require stretching the categories. If the full text does not demonstrate broad coverage without forcing fits, the claimed coherent map weakens. The paper is aimed at researchers and engineers already working on safe generative models who want a structured overview of the literature. It is not for readers seeking original empirical results or formal proofs. It deserves a serious referee because a well-checked survey can serve as a useful entry point for the subfield even if revisions are needed on completeness. I would send it to review with instructions to verify the taxonomy against the full set of cited papers.

Referee Report

1 major / 2 minor

Summary. This survey reviews recent progress on aligning text-to-image diffusion models through reinforcement learning, reward modeling, preference optimization, and safety-specific fine-tuning. It organizes the literature along five axes: source of feedback, form of the reward or preference signal, optimization mechanism, treatment of distribution shift and reward overoptimization, and the extent to which safety is addressed as an explicit constraint. The review covers RLHF, KL-regularized policy optimization, DPO, binary utility optimization, differentiable reward fine-tuning, surrogate reward learning, region-aware fine-tuning, and safety-oriented DPO variants. It includes tutorial explanations of diffusion sampling and preference optimization, compares representative methods on feedback requirements, cost, scalability, reward hacking susceptibility, and safety suitability, and synthesizes open challenges including multi-objective alignment, feedback-efficient learning, adversarially robust safety, continual alignment, and interpretable rewards.

Significance. If the five-axis taxonomy organizes the literature without major omissions, the survey would offer a useful technical map of an emerging area and clearly identify gaps that must be closed before reliable deployment of aligned generative models. The systematic comparisons across feedback needs, computational cost, and safety suitability, together with the tutorial sections, add practical value for readers entering the field. The manuscript correctly credits the coverage of external work on RLHF, DPO variants, and safety-oriented methods rather than claiming new derivations.

major comments (1)

[Abstract and organization section] Abstract and organization section: The central claim that the five axes (source of feedback, form of reward signal, optimization mechanism, treatment of distribution shift, and explicit safety constraints) provide a coherent map rests on these axes being sufficient and non-artificial. No explicit justification or comparison to alternative organizing principles (e.g., model scale, data modality, or architectural constraints) is supplied, raising the risk that methods combining multiple signals, operating in multimodal/continual settings, or addressing safety only implicitly fall outside the taxonomy or require stretching definitions.

minor comments (2)

Ensure consistent terminology when referring to 'reward overoptimization' versus 'reward hacking' across the comparison paragraphs and open-challenge list.
[Comparison section] Add a short table or bullet list in the comparison section that directly maps each reviewed method to the five axes for quick reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The single major comment identifies a presentational gap in the justification of the taxonomy, which we address directly below.

read point-by-point responses

Referee: [Abstract and organization section] Abstract and organization section: The central claim that the five axes (source of feedback, form of reward signal, optimization mechanism, treatment of distribution shift, and explicit safety constraints) provide a coherent map rests on these axes being sufficient and non-artificial. No explicit justification or comparison to alternative organizing principles (e.g., model scale, data modality, or architectural constraints) is supplied, raising the risk that methods combining multiple signals, operating in multimodal/continual settings, or addressing safety only implicitly fall outside the taxonomy or require stretching definitions.

Authors: We agree that the manuscript would benefit from an explicit justification of the five axes. These axes were selected because they map directly onto the core design decisions that differentiate practical alignment methods for diffusion models: how supervision is sourced, how it is encoded as a learning signal, how the model parameters are updated, how distributional mismatch is controlled, and whether safety is treated as a distinct constraint. This choice supports the survey's goal of comparing methods on concrete deployment-relevant dimensions such as feedback cost, scalability, and robustness. While the full text motivates the organization through the surveyed literature, we acknowledge that a meta-level discussion of alternatives is absent from the abstract and early sections. In the revised manuscript we will add a concise paragraph in the introduction that (i) states the rationale for the chosen axes, (ii) briefly contrasts them with alternatives such as organization by model scale or architectural family, and (iii) clarifies that hybrid or multimodal methods are classified according to their dominant characteristics, with explicit cross-references to the relevant sections. This change will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

Survey taxonomy is an organizational choice with no self-referential derivation

full rationale

This is a literature review that cites and categorizes external papers along five explicitly chosen axes (source of feedback, form of reward signal, optimization mechanism, distribution shift treatment, and explicit safety constraints). No equations, fitted parameters, or new derivations appear; the abstract and structure present the five-axis map as a descriptive framework for accessibility rather than a result obtained from the paper's own inputs. All listed techniques (RLHF, DPO, etc.) are referenced to prior work. The organization does not reduce to self-definition or self-citation chains, satisfying the criteria for a self-contained survey against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper does not introduce new free parameters, axioms, or invented entities; it reviews methods from the cited literature.

pith-pipeline@v0.9.0 · 5816 in / 1066 out tokens · 34744 ms · 2026-05-22T02:24:02.186564+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
cs.CV 2026-05 unverdicted novelty 6.0

SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47....

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

arXiv:2310.03739, 2023

URL https://arxiv.org/abs/2310.03739. Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Bou tilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-im age models using human feedback,

work page arXiv
[2]

Aligning Text-to-Image Models using Human Feedback

URL https://arxiv.org/abs/2302.12192. 22 Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward vie w on aligning text-to-image diﬀusion with preference. In International Conference on Machine Learning (ICML) , 2024a. Jiazheng Xu, Ziyang Yang, Weizhe Shi, Yujun Wang, Xiaoxiao Wang, Li Zhang, et al. Imagere- ward: Learning and evaluating human prefer...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Hierarchical Text-Conditional Image Generation with CLIP Latents

URL https://openreview.net/forum?id=1vmSEVL19f. Xiaoying Xing, Sihui Chen, and Yang Song. Focus-n-ﬁx: Region-awa re ﬁne-tuning for text-to- image generation. In CVPR, 2025a. Ziyi Zhang et al. Confronting reward overoptimization for diﬀusion m odels: A perspective of inductive and primacy biases. In ICML, 2024a. Masatoshi Uehara et al. Feedback eﬃcient onl...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Score-Based Generative Modeling through Stochastic Differential Equations

URL https://arxiv.org/abs/2011.13456. Aditya Ramesh, Alex Nichol, Prafulla Dhariwal, et al. Hierarchical text -conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125 , 2022b. Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Bo Du, and Dacheng Tao. Aligning few-step diﬀusion models with dense reward diﬀerence le...

work page internal anchor Pith review Pith/arXiv arXiv 2011
[5]

Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

URL https://arxiv.org/abs/2405.17247. Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Ro y, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and eﬀective reinforce ment learning method for text-to-image diﬀusion ﬁne-tuning,

work page arXiv
[6]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

URL https://arxiv.org/abs/2503.00897. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nas h equilibrium. Advances in neural information processing systems , 30,

work page arXiv
[7]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

URL https://arxiv.org/abs/2306.09341. Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-im age generation,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shaﬁq Joty, and Nikhil Naik

URL https://arxiv.org/abs/2305.01569. Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shaﬁq Joty, and Nikhil Naik. Diﬀusion m odel alignment using direct preference optimization,

work page arXiv
[9]

Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,

URL https://arxiv.org/abs/2311.12908. Tao Li, Ziyang Wang, Tianyi Zhang, et al. Aligning text-to-image diﬀus ion models with human utility. In Advances in Neural Information Processing Systems (NeurIP S), 2024a. Fan Yang, Yujing Zhang, Lei Yu, et al. D3po: Direct preference fo r denoising diﬀusion policy optimization. In Proceedings of the IEEE/CVF Confer...

work page arXiv
[10]

Kingma and Max Welling

ISSN 1935-8245. doi: 10.1561/2200000056. URL http://dx.doi.org/10.1561/2200000056. Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, and Zeke Xie. Alignment of diﬀusion models: Fundamentals, challe nges, and future, 2024a. URL https://arxiv.org/abs/2409.07253. 24 Wei Jia, Rui Zhao, Lucas Smith, et al. Lasro: La...

work page doi:10.1561/2200000056 1935
[11]

Safetydpo: Scalable safety alignment for text-to-image generation,

Sangwoo Lee, Juhong Kim, Seojin Park, and Youngjin Choi. Diﬀusion- kto: Utility-guided diﬀusion model alignment without reward models. In Advances in Neural Information Processing Systems (NeurIPS), 2024b. Xiaolin Zhang, Yifan Chen, Jiuxiang Gu, and Jia-Bin Huang. Learning su rrogate rewards for preference optimization in diﬀusion models. In International...

work page arXiv
[12]

Prdp: Proximal re- ward diﬀerence prediction for large-scale reward ﬁnetuning of diﬀu sion models, 2024b

Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, and Tingbo Hou . Prdp: Proximal re- ward diﬀerence prediction for large-scale reward ﬁnetuning of diﬀu sion models, 2024b. URL https://arxiv.org/abs/2402.08714. Ruihan Zhang, Yitao Huang, Wei Yang, and Yichong Li. Tdpo: Tempor al diﬀusion preference optimization for stepwise reward alignment. In ICML, 202...

work page arXiv
[13]

Segment Anything

URL https://arxiv.org/abs/2304.02643. Shuhao Liao, Xuxin Lv, Yuhong Cao, Jeric Lew, Wenjun Wu, and Guillau me Sar- toretti. Helm: Human-preferred exploration with language models,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Masashi Uehara, Ryo Takahashi, and Yuta Shibata

URL https://arxiv.org/abs/2503.07006. Masashi Uehara, Ryo Takahashi, and Yuta Shibata. A tutorial on r einforcement learning and preference optimization for diﬀusion models. arXiv preprint arXiv:2404.01234 , 2024b. Oﬁr Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Br idging the gap between value and policy based reinforcement learning, 201

work page arXiv
[15]

Bridging the Gap Between Value and Policy Based Reinforcement Learning

URL https://arxiv.org/abs/1702.08892. Erick Mendes, Shibani Santurkar, David Krueger, C. Lawrence Zit nick, Deep Ganguli, Joshua B Tenenbaum, and Dimitris Tsipras. Evolving and aligning language models with ongoing human feedback. In Advances in Neural Information Processing Systems ,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

arXiv:2310.03739, 2023

URL https://arxiv.org/abs/2310.03739. Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Bou tilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-im age models using human feedback,

work page arXiv

[2] [2]

Aligning Text-to-Image Models using Human Feedback

URL https://arxiv.org/abs/2302.12192. 22 Shentao Yang, Tianqi Chen, and Mingyuan Zhou. A dense reward vie w on aligning text-to-image diﬀusion with preference. In International Conference on Machine Learning (ICML) , 2024a. Jiazheng Xu, Ziyang Yang, Weizhe Shi, Yujun Wang, Xiaoxiao Wang, Li Zhang, et al. Imagere- ward: Learning and evaluating human prefer...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Hierarchical Text-Conditional Image Generation with CLIP Latents

URL https://openreview.net/forum?id=1vmSEVL19f. Xiaoying Xing, Sihui Chen, and Yang Song. Focus-n-ﬁx: Region-awa re ﬁne-tuning for text-to- image generation. In CVPR, 2025a. Ziyi Zhang et al. Confronting reward overoptimization for diﬀusion m odels: A perspective of inductive and primacy biases. In ICML, 2024a. Masatoshi Uehara et al. Feedback eﬃcient onl...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Score-Based Generative Modeling through Stochastic Differential Equations

URL https://arxiv.org/abs/2011.13456. Aditya Ramesh, Alex Nichol, Prafulla Dhariwal, et al. Hierarchical text -conditional image gener- ation with clip latents. arXiv preprint arXiv:2204.06125 , 2022b. Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Bo Du, and Dacheng Tao. Aligning few-step diﬀusion models with dense reward diﬀerence le...

work page internal anchor Pith review Pith/arXiv arXiv 2011

[5] [5]

Li, Adrien Bardes, Suzanne Petryk, Oscar Ma ˜nas, et al

URL https://arxiv.org/abs/2405.17247. Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Ro y, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and eﬀective reinforce ment learning method for text-to-image diﬀusion ﬁne-tuning,

work page arXiv

[6] [6]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter

URL https://arxiv.org/abs/2503.00897. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nas h equilibrium. Advances in neural information processing systems , 30,

work page arXiv

[7] [7]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

URL https://arxiv.org/abs/2306.09341. Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-im age generation,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shaﬁq Joty, and Nikhil Naik

URL https://arxiv.org/abs/2305.01569. Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shaﬁq Joty, and Nikhil Naik. Diﬀusion m odel alignment using direct preference optimization,

work page arXiv

[9] [9]

Diffusion model alignment using direct preference optimization.arXiv preprint arXiv:2311.12908,

URL https://arxiv.org/abs/2311.12908. Tao Li, Ziyang Wang, Tianyi Zhang, et al. Aligning text-to-image diﬀus ion models with human utility. In Advances in Neural Information Processing Systems (NeurIP S), 2024a. Fan Yang, Yujing Zhang, Lei Yu, et al. D3po: Direct preference fo r denoising diﬀusion policy optimization. In Proceedings of the IEEE/CVF Confer...

work page arXiv

[10] [10]

Kingma and Max Welling

ISSN 1935-8245. doi: 10.1561/2200000056. URL http://dx.doi.org/10.1561/2200000056. Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, and Zeke Xie. Alignment of diﬀusion models: Fundamentals, challe nges, and future, 2024a. URL https://arxiv.org/abs/2409.07253. 24 Wei Jia, Rui Zhao, Lucas Smith, et al. Lasro: La...

work page doi:10.1561/2200000056 1935

[11] [11]

Safetydpo: Scalable safety alignment for text-to-image generation,

Sangwoo Lee, Juhong Kim, Seojin Park, and Youngjin Choi. Diﬀusion- kto: Utility-guided diﬀusion model alignment without reward models. In Advances in Neural Information Processing Systems (NeurIPS), 2024b. Xiaolin Zhang, Yifan Chen, Jiuxiang Gu, and Jia-Bin Huang. Learning su rrogate rewards for preference optimization in diﬀusion models. In International...

work page arXiv

[12] [12]

Prdp: Proximal re- ward diﬀerence prediction for large-scale reward ﬁnetuning of diﬀu sion models, 2024b

Fei Deng, Qifei Wang, Wei Wei, Matthias Grundmann, and Tingbo Hou . Prdp: Proximal re- ward diﬀerence prediction for large-scale reward ﬁnetuning of diﬀu sion models, 2024b. URL https://arxiv.org/abs/2402.08714. Ruihan Zhang, Yitao Huang, Wei Yang, and Yichong Li. Tdpo: Tempor al diﬀusion preference optimization for stepwise reward alignment. In ICML, 202...

work page arXiv

[13] [13]

Segment Anything

URL https://arxiv.org/abs/2304.02643. Shuhao Liao, Xuxin Lv, Yuhong Cao, Jeric Lew, Wenjun Wu, and Guillau me Sar- toretti. Helm: Human-preferred exploration with language models,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Masashi Uehara, Ryo Takahashi, and Yuta Shibata

URL https://arxiv.org/abs/2503.07006. Masashi Uehara, Ryo Takahashi, and Yuta Shibata. A tutorial on r einforcement learning and preference optimization for diﬀusion models. arXiv preprint arXiv:2404.01234 , 2024b. Oﬁr Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Br idging the gap between value and policy based reinforcement learning, 201

work page arXiv

[15] [15]

Bridging the Gap Between Value and Policy Based Reinforcement Learning

URL https://arxiv.org/abs/1702.08892. Erick Mendes, Shibani Santurkar, David Krueger, C. Lawrence Zit nick, Deep Ganguli, Joshua B Tenenbaum, and Dimitris Tsipras. Evolving and aligning language models with ongoing human feedback. In Advances in Neural Information Processing Systems ,

work page internal anchor Pith review Pith/arXiv arXiv