pith. machine review for the scientific record. sign in

arxiv: 2305.13301 · v4 · submitted 2023-05-22 · 💻 cs.LG · cs.AI· cs.CV

Recognition: no theorem link

Training Diffusion Models with Reinforcement Learning

Ilya Kostrikov, Kevin Black, Michael Janner, Sergey Levine, Yilun Du

Authors on Pith no claims yet

Pith reviewed 2026-05-11 20:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords diffusion modelsreinforcement learningpolicy gradientstext-to-image generationDDPOhuman feedbackgenerative modelingreward optimization
0
0 comments X

The pith

Diffusion models can be optimized directly for human feedback and practical objectives like compressibility by treating denoising as a multi-step decision process and applying policy gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that diffusion models, usually trained only to approximate log-likelihood, can be adapted using reinforcement learning to target downstream goals that are hard to specify in prompts. Framing the iterative denoising steps as actions in a Markov decision process enables a family of policy gradient methods called DDPO. These methods outperform simpler reward-weighted likelihood training on tasks such as maximizing image compressibility and aesthetic quality scores from human preferences. A reader should care because the approach allows fine-tuning generative models on rewards from vision-language models without collecting new labeled data, directly improving alignment.

Core claim

By posing the denoising process as a multi-step decision-making problem, a class of policy gradient algorithms called denoising diffusion policy optimization (DDPO) can be used to directly optimize diffusion models for objectives such as image compressibility and aesthetic quality derived from human feedback, proving more effective than reward-weighted likelihood approaches. DDPO also improves prompt-image alignment when a vision-language model supplies the reward signal.

What carries the argument

Denoising diffusion policy optimization (DDPO), a policy gradient method that treats the full denoising trajectory as an MDP and updates the diffusion policy to maximize expected reward.

If this is right

  • Text-to-image diffusion models can be fine-tuned to produce more compressible images without any change to the original training data or prompts.
  • Aesthetic quality can be directly maximized using scalar rewards from human raters or pretrained scorers.
  • Prompt-image alignment can be improved by using a fixed vision-language model to generate rewards, eliminating the need for additional human annotation.
  • The same policy-gradient machinery applies to any downstream objective that can be expressed as a scalar reward over generated images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The MDP framing could be tested on other iterative generative processes such as autoregressive sampling or score-based models in non-image domains.
  • Reward functions derived from safety classifiers could be plugged in to reduce generation of harmful content without retraining from scratch.
  • Variance-reduction techniques standard in RL might further stabilize DDPO when rewards are sparse or delayed across many denoising steps.

Load-bearing premise

The multi-step denoising process can be treated as a Markov decision process whose policy gradients remain stable and effective without prohibitive variance or credit assignment issues.

What would settle it

If DDPO produces no measurable improvement over reward-weighted likelihood training when optimizing a text-to-image model for compressibility on a fixed set of prompts and images, the claim of superior effectiveness would be falsified.

read the original abstract

Diffusion models are a class of flexible generative models trained with an approximation to the log-likelihood objective. However, most use cases of diffusion models are not concerned with likelihoods, but instead with downstream objectives such as human-perceived image quality or drug effectiveness. In this paper, we investigate reinforcement learning methods for directly optimizing diffusion models for such objectives. We describe how posing denoising as a multi-step decision-making problem enables a class of policy gradient algorithms, which we refer to as denoising diffusion policy optimization (DDPO), that are more effective than alternative reward-weighted likelihood approaches. Empirically, DDPO is able to adapt text-to-image diffusion models to objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Finally, we show that DDPO can improve prompt-image alignment using feedback from a vision-language model without the need for additional data collection or human annotation. The project's website can be found at http://rl-diffusion.github.io .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes denoising diffusion policy optimization (DDPO), which casts the multi-step reverse diffusion process as a Markov decision process and applies policy gradient methods to directly optimize diffusion models for non-likelihood objectives such as image compressibility and aesthetic quality derived from human feedback or vision-language models. It claims DDPO outperforms reward-weighted likelihood baselines and enables adaptation of text-to-image models without additional data collection.

Significance. If the results hold, this provides a practical route to fine-tune diffusion models for objectives that are hard to encode in prompts or likelihoods, with potential impact on alignment and downstream utility in generative modeling. The public website with code and examples is a strength for reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim of empirical superiority over reward-weighted likelihood baselines is asserted without any quantitative results, controls, or ablation details, preventing assessment of effect sizes or statistical reliability.
  2. [Method] The multi-step denoising MDP formulation (with terminal rewards only and trajectories of 50–1000 steps): the REINFORCE-style policy gradient estimator faces severe credit assignment and variance issues; the manuscript must show that variance remains controlled (e.g., via baselines, variance reduction techniques, or empirical variance plots) rather than relying on the assumption that gradients remain stable.
minor comments (1)
  1. [None] The project website link is helpful; ensure all experimental details (hyperparameters, exact reward models, seed reporting) are also included in the main text or appendix for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below with clarifications from the manuscript and indicate revisions where they strengthen the presentation without misrepresenting the existing results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of empirical superiority over reward-weighted likelihood baselines is asserted without any quantitative results, controls, or ablation details, preventing assessment of effect sizes or statistical reliability.

    Authors: The abstract provides a high-level summary of the contribution. Quantitative comparisons to reward-weighted likelihood baselines, including effect sizes on compressibility and aesthetic quality metrics, controls across objectives, and results aggregated over multiple seeds, appear in Section 4 and the associated figures/tables. We will revise the abstract to include a concise quantitative highlight of the observed improvements to better support the claim at the summary level. revision: yes

  2. Referee: [Method] The multi-step denoising MDP formulation (with terminal rewards only and trajectories of 50–1000 steps): the REINFORCE-style policy gradient estimator faces severe credit assignment and variance issues; the manuscript must show that variance remains controlled (e.g., via baselines, variance reduction techniques, or empirical variance plots) rather than relying on the assumption that gradients remain stable.

    Authors: We agree that long trajectories introduce credit-assignment and variance challenges for REINFORCE. The DDPO formulation in Section 3 incorporates a learned baseline for variance reduction, and the empirical results in Section 4 demonstrate reliable convergence across 50–1000 step trajectories on multiple tasks. We will expand the method section to explicitly describe the baseline and add a brief discussion (with supporting analysis) of observed gradient stability; if space permits, we will include variance-related plots in the appendix. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method introduces independent optimization procedure

full rationale

The paper frames the denoising process as an MDP to enable policy gradient methods (DDPO) and compares them empirically to reward-weighted likelihood baselines. No derivation reduces by construction to fitted inputs, self-referential definitions, or load-bearing self-citations; the central claims rest on experimental adaptation to compressibility and aesthetic objectives rather than algebraic equivalence to prior parameters. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that denoising steps form a valid MDP amenable to policy gradients and that the chosen reward functions (compressibility, aesthetics, VLM alignment) are well-defined and stable.

axioms (2)
  • domain assumption Denoising diffusion can be cast as a multi-step Markov decision process.
    Stated in the abstract as the enabling step for policy gradient algorithms.
  • domain assumption Policy gradient methods can be applied directly to the denoising trajectory without prohibitive variance.
    Implicit in the claim that DDPO is effective.
invented entities (1)
  • DDPO algorithm no independent evidence
    purpose: Policy optimization for diffusion denoising steps
    New named procedure introduced in the paper.

pith-pipeline@v0.9.0 · 5478 in / 1237 out tokens · 31448 ms · 2026-05-11T20:11:23.197353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  2. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  3. Muninn: Your Trajectory Diffusion Model But Faster

    cs.RO 2026-05 unverdicted novelty 7.0

    Muninn accelerates diffusion trajectory planners up to 4.6x by spending an uncertainty budget to decide when to cache denoiser outputs, preserving performance and certifying bounded deviation from full computation.

  4. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  5. SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

    cs.LG 2026-05 unverdicted novelty 7.0

    SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.

  6. MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

    cs.CV 2026-05 unverdicted novelty 7.0

    MotionGRPO models diffusion sampling as a Markov decision process optimized with Group Relative Policy Optimization, using hybrid rewards and noise injection to boost sample diversity and local joint precision in egoc...

  7. Stream-R1: Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream-R1 improves distillation of autoregressive streaming video diffusion models by adaptively weighting supervision with a reward model at both rollout and per-pixel levels.

  8. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  9. Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x highe...

  10. HP-Edit: A Human-Preference Post-Training Framework for Image Editing

    cs.CV 2026-04 unverdicted novelty 7.0

    HP-Edit introduces a post-training framework and RealPref-50K dataset that uses a VLM-based HP-Scorer to align diffusion image editing models with human preferences, improving outputs on Qwen-Image-Edit-2509.

  11. Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.

  12. Generative Texture Filtering

    cs.CV 2026-04 unverdicted novelty 7.0

    A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.

  13. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  14. UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...

  15. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  16. Step-level Denoising-time Diffusion Alignment with Multiple Objectives

    cs.LG 2026-04 unverdicted novelty 7.0

    MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.

  17. ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

    cs.RO 2026-04 unverdicted novelty 7.0

    ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...

  18. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 7.0

    FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.

  19. Discrete Flow Matching Policy Optimization

    cs.LG 2026-04 unverdicted novelty 7.0

    DoMinO reformulates discrete flow matching sampling as an MDP for unbiased RL fine-tuning with new TV regularizers, yielding better enhancer activity and naturalness on DNA design tasks.

  20. DiffusionNFT: Online Diffusion Reinforcement with Forward Process

    cs.LG 2025-09 unverdicted novelty 7.0

    DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...

  21. MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    cs.AI 2025-07 unverdicted novelty 7.0

    MixGRPO speeds up GRPO for flow-based image generators by restricting SDE sampling and optimization to a sliding window while using ODE elsewhere, cutting training time by up to 71% with better alignment performance.

  22. CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.

  23. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).

  24. SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

  25. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  26. dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

    cs.LG 2026-05 unverdicted novelty 6.0

    dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

  27. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

    cs.AI 2026-05 unverdicted novelty 6.0

    Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...

  28. From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

    cs.CV 2026-05 unverdicted novelty 6.0

    The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...

  29. Response Time Enhances Alignment with Heterogeneous Preferences

    cs.LG 2026-05 unverdicted novelty 6.0

    Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

  30. Threshold-Guided Optimization for Visual Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.

  31. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  32. ANO: A Principled Approach to Robust Policy Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...

  33. Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...

  34. V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

    cs.LG 2026-04 unverdicted novelty 6.0

    V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.

  35. Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.

  36. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  37. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 6.0

    VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...

  38. CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

  39. DanceGRPO: Unleashing GRPO on Visual Generation

    cs.CV 2025-05 unverdicted novelty 6.0

    DanceGRPO applies GRPO to visual generation tasks to achieve stable policy optimization across diffusion models, rectified flows, multiple tasks, and diverse reward models, outperforming prior RL methods.

  40. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  41. Improving Video Generation with Human Feedback

    cs.CV 2025-01 unverdicted novelty 6.0

    A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

  42. Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

    cs.CV 2026-05 unverdicted novelty 5.0

    DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-H...

  43. MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

    cs.CV 2026-05 unverdicted novelty 5.0

    MotionGRPO applies GRPO with noise injection and hybrid rewards to diffusion-based egocentric motion recovery, overcoming vanishing gradients from low intra-group diversity to reach state-of-the-art performance.

  44. Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

    cs.LG 2026-05 unverdicted novelty 5.0

    Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.

  45. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  46. Reward-Aware Trajectory Shaping for Few-step Visual Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

  47. Improved Baselines with Visual Instruction Tuning

    cs.CV 2023-10 conditional novelty 4.0

    Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 45 Pith papers · 17 internal anchors

  1. [1]

    Tenenbaum, T

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is con- ditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657,

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  3. [3]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    URL http://github.com/google/jax. 10 Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv preprint arXiv:2303.04137,

  4. [4]

    Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc

    Yilun Du, Conor Durkan, Robin Strudel, Joshua B Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, and Will Grathwohl. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. arXiv preprint arXiv:2302.11552,

  5. [5]

    Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

    Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362,

  6. [6]

    DPOK: Reinforcement Learning for Fine- tuning Text-to-Image Diffusion Models, November 2023

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models. arXiv preprint arXiv:2305.16381,

  7. [7]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618,

  8. [8]

    2022 , month = oct, journal =

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760,

  9. [9]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    https://distill.pub/2021/multimodal-neurons. Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573,

  10. [10]

    Jonathan Ho and Tim Salimans

    URL http://github.com/google/flax. Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,

  11. [12]

    LoRA: Low-Rank Adaptation of Large Language Models

    11 Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

  12. [13]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192,

  13. [14]

    Tenenbaum

    Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714,

  14. [15]

    Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,

    Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. Teach- ing language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,

  15. [16]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

  16. [17]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXi...

  17. [18]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  18. [20]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    URL https://arxiv.org/abs/1910.00177. Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine learning,

  19. [21]

    Learning Transferable Visual Models From Natural Language Supervision

    12 Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020,

  20. [22]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Scott Gray Gabriel Goh, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092,

  21. [23]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242,

  22. [24]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487,

  23. [25]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Arne Schneuing, Yuanqi Du, Arian Jamasb Charles Harris, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Michael Bronstein Max Welling, and Bruno Correia. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.02303,

  24. [26]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  25. [27]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792,

  26. [28]

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf

    URL https://proceedings.neurips.cc/paper_files/paper/1999/file/ 464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf. Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https: //github.com/huggingface/diffusers,

  27. [29]

    Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193,

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193,

  28. [30]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  29. [31]

    URL https://www.aclweb.org/anthology/2020.emnlp-demos

    Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos

  30. [32]

    (page 13)

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977,

  31. [33]

    Lion: Latent point diffusion models for 3d shape generation

    Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. Lion: Latent point diffusion models for 3d shape generation. arXiv preprint arXiv:2210.06978,

  32. [34]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

  33. [35]

    n animals

    14 APPENDIX A O VEROPTIMIZATION Incompressibility DDPODDPO RWRRWR Counting Animals Figure 7 (Reward model overoptimization) Examples of RL overoptimizing reward functions. (L) The diffusion model eventually loses all recognizable semantic content and produces noise when optimizing for incompressibility. (R) When optimized for prompts of the form “ n anima...

  34. [36]

    n animals

    When optimizing the incompressibility objective, the model eventually stops producing semantically meaningful content, degenerating into high-frequency noise. Similarly, we observed that LLaV A is susceptible to typographic attacks (Goh et al., 2021). When optimizing for alignment with respect to prompts of the form “n animals”, DDPO exploited deficiencie...

  35. [37]

    was originally introduced as a way to improve sample quality for conditional generation using the gradients from an image classifier. For a differentiable reward function such as the LAION aesthetics predictor (Schuhmann, 2022), one could naturally imagine an extension to classifier guidance that uses gradients from such a predictor to improve aesthetic s...

  36. [38]

    a green colored rabbit

    We used the official implementation of universal guidance 1 with the recommended hyperparameters for style transfer, substituting the guidance network with the LAION aesthetics predictor. While universal guidance is able to produce a statistically significant improvement in aesthetic score, the change is small compared to DDPO. We only report results aver...

  37. [39]

    a green colored rabbit

    as the reward function. We evaluate the model using ImageReward and the LAION aesthetics predictor (Schuhmann, 2022). • Unlike DPOK, we do not employ KL regularization. 0 5k 10k 15k 20k 25k Reward Queries 0.0 0.5 1.0 1.5 2.0 ImageReward Score ImageReward Color Count Composition Location 0 5k 10k 15k 20k 25k Reward Queries 5.2 5.4 5.6 5.8 6.0 6.2 LAION Aes...

  38. [40]

    D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration

    as the base model and finetune only the UNet weights while keeping the text encoder and autoencoder weights frozen. D.1 DDPO I MPLEMENTATION We collect 256 samples per training iteration. For DDPOSF, we accumulate gradients across all 256 samples and perform one gradient update. For DDPOIS, we split the samples into 4 minibatches and perform 4 gradient up...

  39. [41]

    For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on the context

    and ˜ϵθ is the guided ϵ-prediction that is used to compute the next denoised sample. For reinforcement learning, it does not make sense to train on the unconditional objective since the reward may depend on the context. However, we found that when only training on the conditional objective, performance rapidly deteriorated after the first round of finetun...

  40. [42]

    unseen animals

    and is known to underperform other algorithms in more online settings (Duan et al., 2016). However, we can isolate the effect of the data distribution by varying how interleaved the sampling and training are in RWR. At one extreme is a single-round algorithm (Lee et al., 2023), in which N samples are collected from the pretrained model and used for finetu...