Recognition: no theorem link
Training Diffusion Models with Reinforcement Learning
Pith reviewed 2026-05-11 20:11 UTC · model grok-4.3
The pith
Diffusion models can be optimized directly for human feedback and practical objectives like compressibility by treating denoising as a multi-step decision process and applying policy gradients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By posing the denoising process as a multi-step decision-making problem, a class of policy gradient algorithms called denoising diffusion policy optimization (DDPO) can be used to directly optimize diffusion models for objectives such as image compressibility and aesthetic quality derived from human feedback, proving more effective than reward-weighted likelihood approaches. DDPO also improves prompt-image alignment when a vision-language model supplies the reward signal.
What carries the argument
Denoising diffusion policy optimization (DDPO), a policy gradient method that treats the full denoising trajectory as an MDP and updates the diffusion policy to maximize expected reward.
If this is right
- Text-to-image diffusion models can be fine-tuned to produce more compressible images without any change to the original training data or prompts.
- Aesthetic quality can be directly maximized using scalar rewards from human raters or pretrained scorers.
- Prompt-image alignment can be improved by using a fixed vision-language model to generate rewards, eliminating the need for additional human annotation.
- The same policy-gradient machinery applies to any downstream objective that can be expressed as a scalar reward over generated images.
Where Pith is reading between the lines
- The MDP framing could be tested on other iterative generative processes such as autoregressive sampling or score-based models in non-image domains.
- Reward functions derived from safety classifiers could be plugged in to reduce generation of harmful content without retraining from scratch.
- Variance-reduction techniques standard in RL might further stabilize DDPO when rewards are sparse or delayed across many denoising steps.
Load-bearing premise
The multi-step denoising process can be treated as a Markov decision process whose policy gradients remain stable and effective without prohibitive variance or credit assignment issues.
What would settle it
If DDPO produces no measurable improvement over reward-weighted likelihood training when optimizing a text-to-image model for compressibility on a fixed set of prompts and images, the claim of superior effectiveness would be falsified.
read the original abstract
Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation. The project's website can be found at http://rl-diffusion.github.io .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes denoising diffusion policy optimization (DDPO), which casts the multi-step reverse diffusion process as a Markov decision process and applies policy gradient methods to directly optimize diffusion models for non-likelihood objectives such as image compressibility and aesthetic quality derived from human feedback or vision-language models. It claims DDPO outperforms reward-weighted likelihood baselines and enables adaptation of text-to-image models without additional data collection.
Significance. If the results hold, this provides a practical route to fine-tune diffusion models for objectives that are hard to encode in prompts or likelihoods, with potential impact on alignment and downstream utility in generative modeling. The public website with code and examples is a strength for reproducibility.
major comments (2)
- [Abstract] Abstract: the claim of empirical superiority over reward-weighted likelihood baselines is asserted without any quantitative results, controls, or ablation details, preventing assessment of effect sizes or statistical reliability.
- [Method] The multi-step denoising MDP formulation (with terminal rewards only and trajectories of 50–1000 steps): the REINFORCE-style policy gradient estimator faces severe credit assignment and variance issues; the manuscript must show that variance remains controlled (e.g., via baselines, variance reduction techniques, or empirical variance plots) rather than relying on the assumption that gradients remain stable.
minor comments (1)
- [None] The project website link is helpful; ensure all experimental details (hyperparameters, exact reward models, seed reporting) are also included in the main text or appendix for full reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below with clarifications from the manuscript and indicate revisions where they strengthen the presentation without misrepresenting the existing results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of empirical superiority over reward-weighted likelihood baselines is asserted without any quantitative results, controls, or ablation details, preventing assessment of effect sizes or statistical reliability.
Authors: The abstract provides a high-level summary of the contribution. Quantitative comparisons to reward-weighted likelihood baselines, including effect sizes on compressibility and aesthetic quality metrics, controls across objectives, and results aggregated over multiple seeds, appear in Section 4 and the associated figures/tables. We will revise the abstract to include a concise quantitative highlight of the observed improvements to better support the claim at the summary level. revision: yes
-
Referee: [Method] The multi-step denoising MDP formulation (with terminal rewards only and trajectories of 50–1000 steps): the REINFORCE-style policy gradient estimator faces severe credit assignment and variance issues; the manuscript must show that variance remains controlled (e.g., via baselines, variance reduction techniques, or empirical variance plots) rather than relying on the assumption that gradients remain stable.
Authors: We agree that long trajectories introduce credit-assignment and variance challenges for REINFORCE. The DDPO formulation in Section 3 incorporates a learned baseline for variance reduction, and the empirical results in Section 4 demonstrate reliable convergence across 50–1000 step trajectories on multiple tasks. We will expand the method section to explicitly describe the baseline and add a brief discussion (with supporting analysis) of observed gradient stability; if space permits, we will include variance-related plots in the appendix. revision: partial
Circularity Check
No significant circularity; method introduces independent optimization procedure
full rationale
The paper frames the denoising process as an MDP to enable policy gradient methods (DDPO) and compares them empirically to reward-weighted likelihood baselines. No derivation reduces by construction to fitted inputs, self-referential definitions, or load-bearing self-citations; the central claims rest on experimental adaptation to compressibility and aesthetic objectives rather than algebraic equivalence to prior parameters. The approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Denoising diffusion can be cast as a multi-step Markov decision process.
- domain assumption Policy gradient methods can be applied directly to the denoising trajectory without prohibitive variance.
invented entities (1)
-
DDPO algorithm
no independent evidence
Forward citations
Cited by 47 Pith papers
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
Muninn: Your Trajectory Diffusion Model But Faster
Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
-
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...
-
Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation
Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
-
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...
-
HP-Edit: A Human-Preference Post-Training Framework for Image Editing
HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.
-
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
-
Generative Texture Filtering
A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
Step-level Denoising-time Diffusion Alignment with Multiple Objectives
MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.
-
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
-
Discrete Flow Matching Policy Optimization
DoMinO reformulates discrete flow matching sampling as an MDP for unbiased RL fine-tuning with new TV regularizers, yielding better enhancer activity and naturalness on DNA design tasks.
-
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
-
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.
-
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
-
SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
-
Response Time Enhances Alignment with Heterogeneous Preferences
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
-
Threshold-Guided Optimization for Visual Generative Models
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
ANO: A Principled Approach to Robust Policy Optimization
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...
-
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
-
CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.
-
DanceGRPO: Unleashing GRPO on Visual Generation
DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study
DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...
-
MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery
MotionGRPO applies GRPO with noise injection and hybrid rewards to diffusion-based egocentric motion recovery, overcoming vanishing gradients from low intra-group diversity to reach state-of-the-art performance.
-
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Reward-Aware Trajectory Shaping for Few-step Visual Generation
RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.
-
Improved Baselines with Visual Instruction Tuning
Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
Reference graph
Works this paper leans on
-
[1]
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is con- ditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657,
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
URL http://github.com/google/jax. 10 Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv preprint arXiv:2303.04137,
work page internal anchor Pith review arXiv
-
[4]
Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc
Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. arXiv preprint arXiv:2302.11552,
-
[5]
Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,
Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362,
-
[6]
DPOK: Reinforcement Learning for Fine- tuning Text-to-Image Diffusion Models, November 2023
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381,
-
[7]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760,
-
[9]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
https://distill.pub/2021/multimodal-neurons. Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573,
work page internal anchor Pith review arXiv 2021
-
[10]
URL http://github.com/google/flax. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,
work page 2021
-
[12]
LoRA: Low-Rank Adaptation of Large Language Models
11 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192,
work page internal anchor Pith review arXiv
- [14]
-
[15]
Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,
Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. Teach- ing language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,
-
[16]
AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,
work page internal anchor Pith review arXiv 2006
-
[17]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
URL https://arxiv.org/abs/1910.00177. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[21]
Learning Transferable Visual Models From Natural Language Supervision
12 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Scott Gray Gabriel Goh, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092,
work page internal anchor Pith review arXiv
-
[23]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242,
-
[24]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,
work page internal anchor Pith review arXiv
-
[25]
Imagen Video: High Definition Video Generation with Diffusion Models
Arne Schneuing, Yuanqi Du, Arian Jamasb Charles Harris, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Michael Bronstein Max Welling, and Bruno Correia. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.02303,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
URL https://proceedings.neurips.cc/paper_files/paper/1999/file/ 464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf. Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https: //github.com/huggingface/diffusers,
work page 1999
-
[29]
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193,
-
[30]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...
work page 2020
-
[31]
URL https://www.aclweb.org/anthology/2020.emnlp-demos
Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos
work page 2020
- [32]
-
[33]
Lion: Latent point diffusion models for 3d shape generation
Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978,
-
[34]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[35]
14 APPENDIX A O VEROPTIMIZATION Incompressibility DDPODDPO RWRRWR Counting Animals Figure 7 (Reward model overoptimization) Examples of RL overoptimizing reward functions. (L) The diffusion model eventually loses all recognizable semantic content and produces noise when optimizing for incompressibility. (R) When optimized for prompts of the form “ n anima...
work page 2021
-
[36]
When optimizing the incompressibility objective, the model eventually stops producing semantically meaningful content, degenerating into high-frequency noise. Similarly, we observed that LLaV A is susceptible to typographic attacks (Goh et al., 2021). When optimizing for alignment with respect to prompts of the form “n animals”, DDPO exploited deficiencie...
work page 2021
-
[37]
was originally introduced as a way to improve sample quality for conditional generation using the gradients from an image classifier. For a differentiable reward function such as the LAION aesthetics predictor (Schuhmann, 2022), one could naturally imagine an extension to classifier guidance that uses gradients from such a predictor to improve aesthetic s...
work page 2022
-
[38]
We used the official implementation of universal guidance 1 with the recommended hyperparameters for style transfer, substituting the guidance network with the LAION aesthetics predictor. While universal guidance is able to produce a statistically significant improvement in aesthetic score, the change is small compared to DDPO. We only report results aver...
work page 2023
-
[39]
as the reward function. We evaluate the model using ImageReward and the LAION aesthetics predictor (Schuhmann, 2022). • Unlike DPOK, we do not employ KL regularization. 0 5k 10k 15k 20k 25k Reward Queries 0.0 0.5 1.0 1.5 2.0 ImageReward Score ImageReward Color Count Composition Location 0 5k 10k 15k 20k 25k Reward Queries 5.2 5.4 5.6 5.8 6.0 6.2 LAION Aes...
work page 2022
-
[40]
D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration
as the base model and finetune only the UNet weights while keeping the text encoder and autoencoder weights frozen. D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration. For DDPOSF, we accumulate gradients across all 256 samples and perform one gradient update. For DDPOIS, we split the samples into 4 minibatches and perform 4 gradient up...
work page 2017
-
[41]
and ˜ϵθ is the guided ϵ-prediction that is used to compute the next denoised sample. For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on the context. However, we found that when only training on the conditional objective, performance rapidly deteriorated after the first round of finetun...
work page 1999
-
[42]
and is known to underperform other algorithms in more online settings (Duan et al., 2016). However, we can isolate the effect of the data distribution by varying how interleaved the sampling and training are in RWR. At one extreme is a single-round algorithm (Lee et al., 2023), in which N samples are collected from the pretrained model and used for finetu...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.