pith. sign in

arxiv: 2602.04663 · v2 · pith:XMD5YC5Pnew · submitted 2026-02-04 · 💻 cs.LG · cs.AI

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

Pith reviewed 2026-05-21 13:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningdiffusion modelslikelihood estimationELBOpolicy gradienttext-to-image generationreward optimizationstable training
0
0 comments X

The pith

An ELBO likelihood estimator from the final sample dominates policy-gradient loss choice for stable RL in diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper disentangles policy-gradient objectives, likelihood estimators, and rollout schemes to identify what actually drives successful reinforcement learning on diffusion models for image generation. It establishes that an evidence lower bound estimator for the model's likelihood, computed solely from the completed final sample, supplies the critical signal for effective and stable updates. This finding matters because diffusion models lack tractable exact likelihoods, so prior RL methods relied on ad hoc approximations that limited performance; identifying the dominant factor allows simpler, more reliable training pipelines. The result is faster convergence to high reward scores without excessive reward hacking.

Core claim

Adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional.

What carries the argument

ELBO-based likelihood estimator computed only from the final generated sample, which approximates the intractable log-likelihood to provide a low-variance signal for policy-gradient updates.

If this is right

  • RL training reaches GenEval scores of 0.95 in 90 GPU hours on SD 3.5 Medium, which is 4.6 times more efficient than FlowGRPO.
  • The particular functional form of the policy-gradient loss becomes secondary once the ELBO estimator is fixed.
  • Performance gains hold consistently across multiple reward benchmarks without reward hacking.
  • The same estimator choice improves efficiency by a factor of two relative to the prior state-of-the-art method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same final-sample ELBO approach could be tested on flow-based or autoregressive generative models that also have intractable likelihoods.
  • Practitioners might redirect effort from inventing new RL objectives toward refining likelihood estimators for existing ones.
  • Disentangling estimator choice from loss design may reveal similar dominant factors in RL applied to other high-dimensional generative tasks.

Load-bearing premise

The ELBO estimator from the final sample supplies a sufficiently unbiased and low-variance signal for policy-gradient updates across the tested reward benchmarks and model scales.

What would settle it

Running the same RL experiments on SD 3.5 Medium but replacing the final-sample ELBO estimator with an alternative such as a multi-step Monte Carlo estimator and observing equal or superior stability, speed, and final GenEval scores would falsify the dominance claim.

Figures

Figures reproduced from arXiv: 2602.04663 by Bo Yuan, Jaemoo Choi, Jinbin Bai, Molei Tao, Petr Molodyk, Wei Guo, Yi Xin, Yongxin Chen, Yuchen Zhu.

Figure 1
Figure 1. Figure 1: Training efficiency and design-space analysis for reward-based diffusion fine-tuning. (Left) GenEval performance across training time for various fine-tuning methods on SD3.5-Medium. (Right) Conceptual summary of the design space considered in this work, highlighting policy-gradient loss design, likelihood estimation, and sampling strategy. Which components in the RL design space, the policy gradient objec… view at source ↗
Figure 2
Figure 2. Figure 2: Training time comparison on GenEval. We report the total GPU hours (8×H100) required to reach a GenEval score of 0.95 for different fine-tuning methods. ELBO-based likeli￾hood estimation substantially reduces training cost compared to trajectory-based approaches, and ODE sampling further improves efficiency while achieving the same target performance. Gaussian noise according to a forward process, while ge… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison between benchmarks and our model. See App. E for additional figures. to using a = 0. For fast and NFE-efficient sampling, the ODE sampler is more preferable due to the many successes of inference acceleration algorithms (Lu et al., 2022; Zhang & Chen, 2022; Zhang et al., 2023b). The stochasticity of trajectories has an important influence on the validity of estimation formulas. Traje… view at source ↗
Figure 4
Figure 4. Figure 4: The # of prompts seen in training vs. GPU hours to reach GenEval score of 0.95. ODE sampling further boosts efficiency under ELBO Given ELBO-based likelihood estimation, the choice of sampler primarily affects computational efficiency over per￾formance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on ELBO estimators on GenEval. 4.3. Further Discussion Ablation on ELBO Estimations We study the effect of dif￾ferent ELBO estimation strategies. Specifically, we consider three ELBO formulations: a path-KL weighted estimator (13), a simple weighted ELBO estimator (14), and an adap￾tive ELBO estimator (15). Each ELBO can be estimated using different Monte Carlo schemes, either by sampling a single… view at source ↗
Figure 6
Figure 6. Figure 6: Performance of OCR (left) and PickScore (right) across training time. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between benchmarks and our model on Geneval, OCR, and PickScore prompts. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison between benchmarks and our model on GenEval prompts. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison between benchmarks and our model on OCR prompts. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison between benchmarks and our model on PickScore prompts. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison between ELBO-based Likelihood Estimation across variety of loss and samplers. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper systematically analyzes the RL design space for diffusion models by disentangling policy-gradient objectives, likelihood estimators, and rollout sampling schemes. It claims that an ELBO-based model likelihood estimator computed only from the final generated sample is the dominant factor for effective, efficient, and stable RL optimization, outweighing the choice of loss functional. This is validated on SD 3.5 Medium across reward benchmarks, with reported gains such as improving GenEval from 0.24 to 0.95 in 90 GPU hours (4.6× more efficient than FlowGRPO).

Significance. If substantiated, the result would be significant for RL applications to generative models, as it redirects attention from ad-hoc loss engineering to the statistical properties of likelihood estimators. The consistent cross-benchmark trends and concrete efficiency numbers provide a practical contribution that could simplify and stabilize fine-tuning pipelines for text-to-image diffusion models.

major comments (2)
  1. [§3.2] §3.2 (Likelihood Estimators): The central claim that the final-sample ELBO estimator provides a sufficiently low-variance, low-bias signal for policy-gradient updates is load-bearing, yet the manuscript contains no variance analysis, bias quantification, or direct comparison to full-trajectory ELBO estimators. This is particularly relevant in sparse-reward regimes such as GenEval, where the reported largest gains occur; without such evidence the dominance conclusion over loss functionals cannot be fully verified.
  2. [§4.3] §4.3 (Experimental Validation): The efficiency comparisons (e.g., 4.6× over FlowGRPO) and stability claims rest on single-run or unreported-variance results. Adding multiple seeds, error bars, or statistical tests would be required to confirm that improvements are attributable to the estimator rather than uncontrolled variance in the high-dimensional setting.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'without reward hacking' is used without a precise definition or metric; a brief clarification would improve readability.
  2. [§3] Notation: The distinction between the three disentangled factors could be summarized in a single table for easier reference across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the significance of our analysis. We address the two major comments point by point below, proposing revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Likelihood Estimators): The central claim that the final-sample ELBO estimator provides a sufficiently low-variance, low-bias signal for policy-gradient updates is load-bearing, yet the manuscript contains no variance analysis, bias quantification, or direct comparison to full-trajectory ELBO estimators. This is particularly relevant in sparse-reward regimes such as GenEval, where the reported largest gains occur; without such evidence the dominance conclusion over loss functionals cannot be fully verified.

    Authors: We agree that the manuscript would benefit from explicit variance and bias analysis to further substantiate the central claim. While the consistent empirical dominance of the final-sample ELBO estimator across benchmarks (including sparse-reward settings like GenEval) provides supporting evidence for its practical utility, we acknowledge the absence of direct statistical quantification. In the revised manuscript we will add a new subsection to §3.2 that reports variance estimates for the estimator, includes a direct comparison to full-trajectory ELBO variants, and discusses bias considerations where analytically tractable. revision: yes

  2. Referee: [§4.3] §4.3 (Experimental Validation): The efficiency comparisons (e.g., 4.6× over FlowGRPO) and stability claims rest on single-run or unreported-variance results. Adding multiple seeds, error bars, or statistical tests would be required to confirm that improvements are attributable to the estimator rather than uncontrolled variance in the high-dimensional setting.

    Authors: We accept that the current efficiency and stability claims would be more robust with multi-seed statistics. The reported gains reflect single-run results as presented. In the revision we will rerun the primary experiments (including the GenEval and efficiency comparisons) with multiple random seeds, add error bars to the relevant tables and figures, and include basic statistical tests to quantify the reliability of the observed differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical disentanglement of RL factors for diffusion models

full rationale

The paper conducts a systematic empirical study that disentangles policy-gradient objectives, likelihood estimators, and rollout schemes through direct experimental comparisons on reward benchmarks with SD 3.5 Medium. The central claim—that an ELBO-based likelihood estimator computed from the final sample dominates performance—is supported by observed efficiency gains (e.g., GenEval improvement) rather than any mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No load-bearing self-citations, uniqueness theorems, or ansatzes reduce the result to its inputs by construction; the analysis remains self-contained via external validation on multiple tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is primarily empirical; the central claim depends on the validity of the ELBO estimator as a proxy for likelihood and on the representativeness of the chosen reward benchmarks and model scale.

axioms (1)
  • domain assumption The ELBO provides a usable approximation to the intractable likelihood for policy gradient purposes in diffusion models.
    Invoked when claiming the estimator enables stable RL optimization.

pith-pipeline@v0.9.0 · 5787 in / 1144 out tokens · 49221 ms · 2026-05-21T13:37:52.041470+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Reinforce Adjoint Matching derives a simple consistency loss for RL post-training of diffusion models by tilting the clean distribution toward higher-reward samples under KL regularization while keeping the noising pr...

  2. Reinforce Adjoint Matching: Scaling RL Post-Training of Diffusion and Flow-Matching Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Derives RAM, a reward-adjusted consistency loss extending diffusion pretraining regression to efficient KL-regularized RL post-training, achieving peak rewards up to 50x faster than Flow-GRPO on Stable Diffusion 3.5M.

  3. Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.

  4. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

  5. Consistency Regularised Gradient Flows for Inverse Problems

    stat.ML 2026-05 unverdicted novelty 5.0

    A consistency-regularized Euclidean-Wasserstein-2 gradient flow performs joint posterior sampling and prompt optimization in latent space for efficient low-NFE inverse problem solving with diffusion models.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 4 Pith papers · 19 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Ahmadian, A., Cremer, C., Gall´e, M., Fadaee, M., Kreutzer, J., Pietquin, O., ¨Ust¨un, A., and Hooker, S. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740,

  3. [3]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,

  4. [4]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax- m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025a. Chen, H., Zheng, K., Zhang, Q., Cui, G., Cui, Y ., Ye, H., Lin, T.-Y ., Liu, M.-Y ., Zhu, J., and Wang, H. Bridging supervised learning and...

  5. [5]

    Soft Adaptive Policy Optimization

    Gao, C., Zheng, C., Chen, X.-H., Dang, K., Liu, S., Yu, B., Yang, A., Bai, S., Zhou, J., and Lin, J. Soft adaptive policy optimization.arXiv preprint arXiv:2511.20347,

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  7. [7]

    TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

    He, X., Fu, S., Zhao, Y ., Li, W., Yang, J., Yin, D., Rao, F., and Zhang, B. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,

  8. [8]

    Clipscore: A reference-free evaluation metric for image captioning

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y . Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,

  9. [9]

    Classifier-Free Diffusion Guidance

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  10. [10]

    The Art of Scaling Reinforcement Learning Compute for LLMs

    9 Rethinking the Design Space of Reinforcement Learning for Diffusion Models Khatri, D., Madaan, L., Tiwari, R., Bansal, R., Duvvuri, S. S., Zaheer, M., Dhillon, I. S., Brandfonbrener, D., and Agarwal, R. The art of scaling reinforcement learn- ing compute for llms.arXiv preprint arXiv:2510.13786,

  11. [11]

    Kimi Team, Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599,

  12. [12]

    Back to Basics: Let Denoising Generative Models Denoise

    Li, T. and He, K. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

  13. [13]

    Flow Matching for Generative Modeling

    Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  14. [14]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Liu, J., Li, Y ., Fu, Y ., Wang, J., Liu, Q., and Jiang, Z. When speed kills stability: Demystifying RL collapse from the training-inference mismatch, September 2025a. URL https://richardli.xyz/rl-collapse. Liu, J., Liu, G., Liang, J., Li, Y ., Liu, J., Wang, X., Wan, P., Zhang, D., and Ouyang, W. Flow-grpo: Training flow matching models via online rl.arX...

  15. [15]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., and Lin, M. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025c. Lu, C., Zhou, Y ., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information...

  16. [16]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  17. [17]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  18. [18]

    and Titsias, M

    Shi, J. and Titsias, M. K. Demystifying diffusion objectives: Reweighted losses are better variational bounds.arXiv preprint arXiv:2511.19664,

  19. [19]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling 10 Rethinking the Design Space of Reinforcement Learning for Diffusion Models through stochastic differential equations.arXiv preprint arXiv:2011.13456,

  20. [20]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.-W., Chen, D., Yu, F., Zhao, H., Yang, J., et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  21. [21]

    Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a

    Wang, J., Liang, J., Liu, J., Liu, H., Liu, G., Zheng, J., Pang, W., Ma, A., Xie, Z., Wang, X., et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025a. Wang, Y ., Li, Z., Zang, Y ., Zhou, Y ., Bu, J., Wang, C., Lu, Q., Jin, C., and Wang, J. Pref-grpo: Pairwise preference rewa...

  22. [22]

    Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

    Xue, S., Ge, C., Zhang, S., Li, Y ., and Ma, Z.-M. Ad- vantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025a. Xue, Z., Wu, J., Gao, Y ., Kong, F., Zhu, L., Chen, M., Liu, Z., Liu, W., Guo, Q., Huang, W., et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025...

  23. [23]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  24. [24]

    George E Uhlenbeck and Leonard S Ornstein

    Zhang, Q., Tao, M., and Chen, Y . gddim: generalized denoising diffusion implicit models. InInternational Conference on Learning Representations, 2023b. Zhang, Y ., Liu, Y ., Yuan, H., Yuan, Y ., Gu, Q., and Yao, A. C.-C. On the design of kl-regularized policy gradient algorithms for llm reasoning.arXiv preprint arXiv:2505.17508,

  25. [25]

    arXiv preprint arXiv:2507.20673 , year=

    Zhao, Y ., Liu, Y ., Liu, J., Chen, J., Wu, X., Hao, Y ., Lv, T., Huang, S., Cui, L., Ye, Q., et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673,

  26. [26]

    Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025a

    Zheng, C., Dang, K., Yu, B., Li, M., Jiang, H., Lin, J., Liu, Y ., Lin, H., Wu, C., Hu, F., et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025a. Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y ., Men, R., Yang, A., et al. Group sequence policy optimization.arXiv p...

  27. [27]

    Recent algorithmic developments on RL methods focus on new objective designs (Zhao et al., 2025; Zheng et al., 2025b; Gao et al., 2025; Chen et al., 2025a; Kimi Team et al.,

    and its variants (Liu et al., 2025c; Yu et al., 2025; Ahmadian et al., 2024). Recent algorithmic developments on RL methods focus on new objective designs (Zhao et al., 2025; Zheng et al., 2025b; Gao et al., 2025; Chen et al., 2025a; Kimi Team et al.,

  28. [28]

    and their unbiased estimation (Zhang et al., 2025; Zheng et al., 2025a; Liu et al., 2025a) to ensure stable training. RL for Diffusion and Flow models.RL has also been widely adopted to post-train diffusion and flow models to align model output with human preference (Fan et al., 2023; Black et al., 2023; Domingo-Enrich et al., 2024). FlowGRPO (Liu et al.,...

  29. [29]

    13 Rethinking the Design Space of Reinforcement Learning for Diffusion Models C

    and related variants (Kimi Team et al., 2025; Malkin et al., 2022). 13 Rethinking the Design Space of Reinforcement Learning for Diffusion Models C. Additional Technical Details C.1. ELBO weighting Various ELBO objectives have been proposed for training diffusion and flow models effectively (Song et al., 2020; Kingma et al., 2021; Kingma & Gao, 2023; Karr...

  30. [30]

    Following a similar intuition, we consider the following simply weighted ELBO withw(t) = 1, ELBOsimple(vθ,x

    =E t,ϵ 1−t t vθ −v 2 2 (13) Simple weighting: Apart from path-KL weighting, constant weighting across all t is also shown to achieve decent performance in diffusion training (Ho et al., 2020; Shi & Titsias, 2025). Following a similar intuition, we consider the following simply weighted ELBO withw(t) = 1, ELBOsimple(vθ,x

  31. [31]

    We similarly consider such a formulation, express inv-loss as, ELBOadapt(vθ,x

    =E t,ϵ h vθ −v 2 2 i (14) Adaptive weighting: Besides time-dependent only weighting, prior works (Yin et al., 2024; Zheng et al., 2025c) have also adopted data-dependent weighting that self-normalizes the objective to ensure numerical robustness. We similarly consider such a formulation, express inv-loss as, ELBOadapt(vθ,x

  32. [32]

    benchmark, we use the GenEval score as the sole reward signal. For the OCR task, we combine an OCR-based reward with human preference rewards, including PickScore (Kirstain et al., 2023), CLIPScore (Hessel et al., 2021), and HPSv2.1 (Wu et al., 2023). For experiments on the OCR benchmark, we further consider a composite reward constructed by aggregating P...