Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design
Pith reviewed 2026-05-21 13:37 UTC · model grok-4.3
The pith
An ELBO likelihood estimator from the final sample dominates policy-gradient loss choice for stable RL in diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional.
What carries the argument
ELBO-based likelihood estimator computed only from the final generated sample, which approximates the intractable log-likelihood to provide a low-variance signal for policy-gradient updates.
If this is right
- RL training reaches GenEval scores of 0.95 in 90 GPU hours on SD 3.5 Medium, which is 4.6 times more efficient than FlowGRPO.
- The particular functional form of the policy-gradient loss becomes secondary once the ELBO estimator is fixed.
- Performance gains hold consistently across multiple reward benchmarks without reward hacking.
- The same estimator choice improves efficiency by a factor of two relative to the prior state-of-the-art method.
Where Pith is reading between the lines
- The same final-sample ELBO approach could be tested on flow-based or autoregressive generative models that also have intractable likelihoods.
- Practitioners might redirect effort from inventing new RL objectives toward refining likelihood estimators for existing ones.
- Disentangling estimator choice from loss design may reveal similar dominant factors in RL applied to other high-dimensional generative tasks.
Load-bearing premise
The ELBO estimator from the final sample supplies a sufficiently unbiased and low-variance signal for policy-gradient updates across the tested reward benchmarks and model scales.
What would settle it
Running the same RL experiments on SD 3.5 Medium but replacing the final-sample ELBO estimator with an alternative such as a multi-step Monte Carlo estimator and observing equal or superior stability, speed, and final GenEval scores would falsify the dominance claim.
Figures
read the original abstract
Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper systematically analyzes the RL design space for diffusion models by disentangling policy-gradient objectives, likelihood estimators, and rollout sampling schemes. It claims that an ELBO-based model likelihood estimator computed only from the final generated sample is the dominant factor for effective, efficient, and stable RL optimization, outweighing the choice of loss functional. This is validated on SD 3.5 Medium across reward benchmarks, with reported gains such as improving GenEval from 0.24 to 0.95 in 90 GPU hours (4.6× more efficient than FlowGRPO).
Significance. If substantiated, the result would be significant for RL applications to generative models, as it redirects attention from ad-hoc loss engineering to the statistical properties of likelihood estimators. The consistent cross-benchmark trends and concrete efficiency numbers provide a practical contribution that could simplify and stabilize fine-tuning pipelines for text-to-image diffusion models.
major comments (2)
- [§3.2] §3.2 (Likelihood Estimators): The central claim that the final-sample ELBO estimator provides a sufficiently low-variance, low-bias signal for policy-gradient updates is load-bearing, yet the manuscript contains no variance analysis, bias quantification, or direct comparison to full-trajectory ELBO estimators. This is particularly relevant in sparse-reward regimes such as GenEval, where the reported largest gains occur; without such evidence the dominance conclusion over loss functionals cannot be fully verified.
- [§4.3] §4.3 (Experimental Validation): The efficiency comparisons (e.g., 4.6× over FlowGRPO) and stability claims rest on single-run or unreported-variance results. Adding multiple seeds, error bars, or statistical tests would be required to confirm that improvements are attributable to the estimator rather than uncontrolled variance in the high-dimensional setting.
minor comments (2)
- [Abstract] Abstract: The phrase 'without reward hacking' is used without a precise definition or metric; a brief clarification would improve readability.
- [§3] Notation: The distinction between the three disentangled factors could be summarized in a single table for easier reference across sections.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive evaluation of the significance of our analysis. We address the two major comments point by point below, proposing revisions to strengthen the manuscript where the concerns are valid.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Likelihood Estimators): The central claim that the final-sample ELBO estimator provides a sufficiently low-variance, low-bias signal for policy-gradient updates is load-bearing, yet the manuscript contains no variance analysis, bias quantification, or direct comparison to full-trajectory ELBO estimators. This is particularly relevant in sparse-reward regimes such as GenEval, where the reported largest gains occur; without such evidence the dominance conclusion over loss functionals cannot be fully verified.
Authors: We agree that the manuscript would benefit from explicit variance and bias analysis to further substantiate the central claim. While the consistent empirical dominance of the final-sample ELBO estimator across benchmarks (including sparse-reward settings like GenEval) provides supporting evidence for its practical utility, we acknowledge the absence of direct statistical quantification. In the revised manuscript we will add a new subsection to §3.2 that reports variance estimates for the estimator, includes a direct comparison to full-trajectory ELBO variants, and discusses bias considerations where analytically tractable. revision: yes
-
Referee: [§4.3] §4.3 (Experimental Validation): The efficiency comparisons (e.g., 4.6× over FlowGRPO) and stability claims rest on single-run or unreported-variance results. Adding multiple seeds, error bars, or statistical tests would be required to confirm that improvements are attributable to the estimator rather than uncontrolled variance in the high-dimensional setting.
Authors: We accept that the current efficiency and stability claims would be more robust with multi-seed statistics. The reported gains reflect single-run results as presented. In the revision we will rerun the primary experiments (including the GenEval and efficiency comparisons) with multiple random seeds, add error bars to the relevant tables and figures, and include basic statistical tests to quantify the reliability of the observed differences. revision: yes
Circularity Check
No circularity: empirical disentanglement of RL factors for diffusion models
full rationale
The paper conducts a systematic empirical study that disentangles policy-gradient objectives, likelihood estimators, and rollout schemes through direct experimental comparisons on reward benchmarks with SD 3.5 Medium. The central claim—that an ELBO-based likelihood estimator computed from the final sample dominates performance—is supported by observed efficiency gains (e.g., GenEval improvement) rather than any mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No load-bearing self-citations, uniqueness theorems, or ansatzes reduce the result to its inputs by construction; the analysis remains self-contained via external validation on multiple tasks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The ELBO provides a usable approximation to the intractable likelihood for policy gradient purposes in diffusion models.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ELBO(vθ,x0) = E[ w(t) ||vθ(xt,t) - v||² ]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...
-
Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models
Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.
-
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
Consistency Regularised Gradient Flows for Inverse Problems
A consistency-regularized Euclidean-Wasserstein-2 gradient flow performs joint posterior sampling and prompt optimization in latent space for efficient low-NFE inverse problem solving with diffusion models.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Ahmadian, A., Cremer, C., Gall´e, M., Fadaee, M., Kreutzer, J., Pietquin, O., ¨Ust¨un, A., and Hooker, S. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Training Diffusion Models with Reinforcement Learning
Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax- m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025a. Chen, H., Zheng, K., Zhang, Q., Cui, G., Cui, Y ., Ye, H., Lin, T.-Y ., Liu, M.-Y ., Zhu, J., and Wang, H. Bridging supervised learning and...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Soft Adaptive Policy Optimization
Gao, C., Zheng, C., Chen, X.-H., Dang, K., Liu, S., Yu, B., Yang, A., Bai, S., Zhou, J., and Lin, J. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
He, X., Fu, S., Zhao, Y ., Li, W., Yang, J., Yin, D., Rao, F., and Zhang, B. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,
work page internal anchor Pith review arXiv
-
[8]
Clipscore: A reference-free evaluation metric for image captioning
Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y . Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,
work page 2021
-
[9]
Classifier-Free Diffusion Guidance
Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
The Art of Scaling Reinforcement Learning Compute for LLMs
9 Rethinking the Design Space of Reinforcement Learning for Diffusion Models Khatri, D., Madaan, L., Tiwari, R., Bansal, R., Duvvuri, S. S., Zaheer, M., Dhillon, I. S., Brandfonbrener, D., and Agarwal, R. The art of scaling reinforcement learn- ing compute for llms.arXiv preprint arXiv:2510.13786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Kimi Team, Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Back to Basics: Let Denoising Generative Models Denoise
Li, T. and He, K. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Flow Matching for Generative Modeling
Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Flow-GRPO: Training Flow Matching Models via Online RL
Liu, J., Li, Y ., Fu, Y ., Wang, J., Liu, Q., and Jiang, Z. When speed kills stability: Demystifying RL collapse from the training-inference mismatch, September 2025a. URL https://richardli.xyz/rl-collapse. Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arX...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Understanding R1-Zero-Like Training: A Critical Perspective
Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025c. Lu, C., Zhou, Y ., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Shi, J. and Titsias, M. K. Demystifying diffusion objectives: Reweighted losses are better variational bounds.arXiv preprint arXiv:2511.19664,
-
[19]
Score-Based Generative Modeling through Stochastic Differential Equations
Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling 10 Rethinking the Design Space of Reinforcement Learning for Diffusion Models through stochastic differential equations.arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[20]
Wan: Open and Advanced Large-Scale Video Generative Models
Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a. Wang, Y ., Li, Z., Zang, Y ., Zhou, Y ., Bu, J., Wang, C., Lu, Q., Jin, C., and Wang, J. Pref-grpo: Pairwise preference rewa...
-
[22]
Xue, S., Ge, C., Zhang, S., Li, Y ., and Ma, Z.-M. Ad- vantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025a. Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025...
-
[23]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
George E Uhlenbeck and Leonard S Ornstein
Zhang, Q., Tao, M., and Chen, Y . gddim: generalized denoising diffusion implicit models. InInternational Conference on Learning Representations, 2023b. Zhang, Y ., Liu, Y ., Yuan, H., Yuan, Y ., Gu, Q., and Yao, A. C.-C. On the design of kl-regularized policy gradient algorithms for llm reasoning.arXiv preprint arXiv:2505.17508,
-
[25]
arXiv preprint arXiv:2507.20673 , year=
Zhao, Y ., Liu, Y ., Liu, J., Chen, J., Wu, X., Hao, Y ., Lv, T., Huang, S., Cui, L., Ye, Q., et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673,
-
[26]
Zheng, C., Dang, K., Yu, B., Li, M., Jiang, H., Lin, J., Liu, Y ., Lin, H., Wu, C., Hu, F., et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025a. Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv p...
-
[27]
and its variants (Liu et al., 2025c; Yu et al., 2025; Ahmadian et al., 2024). Recent algorithmic developments on RL methods focus on new objective designs (Zhao et al., 2025; Zheng et al., 2025b; Gao et al., 2025; Chen et al., 2025a; Kimi Team et al.,
work page 2025
-
[28]
and their unbiased estimation (Zhang et al., 2025; Zheng et al., 2025a; Liu et al., 2025a) to ensure stable training. RL for Diffusion and Flow models.RL has also been widely adopted to post-train diffusion and flow models to align model output with human preference (Fan et al., 2023; Black et al., 2023; Domingo-Enrich et al., 2024). FlowGRPO (Liu et al.,...
work page 2025
-
[29]
13 Rethinking the Design Space of Reinforcement Learning for Diffusion Models C
and related variants (Kimi Team et al., 2025; Malkin et al., 2022). 13 Rethinking the Design Space of Reinforcement Learning for Diffusion Models C. Additional Technical Details C.1. ELBO weighting Various ELBO objectives have been proposed for training diffusion and flow models effectively (Song et al., 2020; Kingma et al., 2021; Kingma & Gao, 2023; Karr...
work page 2025
-
[30]
=E t,ϵ 1−t t vθ −v 2 2 (13) Simple weighting: Apart from path-KL weighting, constant weighting across all t is also shown to achieve decent performance in diffusion training (Ho et al., 2020; Shi & Titsias, 2025). Following a similar intuition, we consider the following simply weighted ELBO withw(t) = 1, ELBOsimple(vθ,x
work page 2020
-
[31]
We similarly consider such a formulation, express inv-loss as, ELBOadapt(vθ,x
=E t,ϵ h vθ −v 2 2 i (14) Adaptive weighting: Besides time-dependent only weighting, prior works (Yin et al., 2024; Zheng et al., 2025c) have also adopted data-dependent weighting that self-normalizes the objective to ensure numerical robustness. We similarly consider such a formulation, express inv-loss as, ELBOadapt(vθ,x
work page 2024
-
[32]
benchmark, we use the GenEval score as the sole reward signal. For the OCR task, we combine an OCR-based reward with human preference rewards, including PickScore (Kirstain et al., 2023), CLIPScore (Hessel et al., 2021), and HPSv2.1 (Wu et al., 2023). For experiments on the OCR benchmark, we further consider a composite reward constructed by aggregating P...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.