pith. machine review for the scientific record. sign in

arxiv: 2604.17415 · v2 · submitted 2026-04-19 · 💻 cs.LG · cs.AI· cs.CV

Recognition: unknown

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords reward-based fine-tuningdiffusion modelsflow modelsscore matchingreward alignmentgenerative model fine-tuningvalue guidance
0
0 comments X

The pith

Many reward-based fine-tuning methods for diffusion and flow models reduce to a single score-matching objective against a value-guided target.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that existing reward-based fine-tuning techniques for pretrained diffusion or flow models, though derived from separate starting points, can all be recast as instances of reward score matching. Under this common view, the goal is to adjust the model's score function to match a target score that has been steered by a reward or value signal while staying close to the original pretrained behavior. Differences between methods largely boil down to how the value guidance is estimated and how the strength of the update varies across different timesteps. If this unification holds, it explains why some approaches trade off bias against variance or compute more effectively than others and shows which extra mechanisms add little value.

Core claim

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Many existing methods can be written under the common framework of reward score matching, where alignment becomes score matching against a value-guided target. The main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This view clarifies the bias-variance-compute tradeoffs of existing designs and distinguishes core optimization components from auxiliary mechanisms.

What carries the argument

Reward score matching (RSM): the objective of matching the generative model's score to a value-guided target score, where the target incorporates reward information.

If this is right

  • Existing methods' performance differences arise mainly from bias-variance-compute tradeoffs in estimator choice and timestep weighting.
  • Auxiliary mechanisms that add complexity without altering the core score-matching objective can be removed without loss.
  • Simpler redesigns become possible for both differentiable and black-box reward alignment tasks.
  • The design space of reward-based fine-tuning shrinks to a smaller, more interpretable set of choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unification lens could be applied to fine-tuning of other score-based or flow-based generative models not covered in the current experiments.
  • Practitioners could select estimator type and timestep schedule based on whether their reward signal is noisy or expensive to evaluate.
  • Direct optimization of the unified RSM objective might yield new reward functions that bypass intermediate value estimation steps.

Load-bearing premise

The primary distinctions among existing methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps, without material loss of generality or overlooked auxiliary mechanisms.

What would settle it

Identification of a reward fine-tuning procedure whose update rule cannot be expressed as score matching to any value-guided target, or whose performance gains cannot be reproduced by varying only the estimator and timestep weighting within the RSM objective.

Figures

Figures reproduced from arXiv: 2604.17415 by Jeongjae Lee, Jeongsol Kim, Jinho Chang, Jong Chul Ye.

Figure 1
Figure 1. Figure 1: Temporal Optimization Strength. (a) Successful first-order methods reduce value guidance at low-SNR timesteps. (b) Improved zeroth-order methods reduce value guidance at high￾SNR timesteps. (c) Residual ∇-DB enforces stronger trust-region constraints for low-SNR timesteps. Policy Gradient’s C2(t) is depicted for constant r(x0) = 1 and α = 10−2 . Estimator design determines the quality of Ψˆ ti . Its main p… view at source ↗
Figure 2
Figure 2. Figure 2: Toy analysis of estimator quality under fixed compute. (a) Reference distribution and its reward-tilted target. (b) RMSE of representative first-order (FO) and zeroth-order (ZO) estimators7 at two timesteps. (c) RMSE of various estimators by sample size, for different lookahead depths, branching strategies, and stochasticity localizations. #split is the number of recursive branching stages, and #branch is … view at source ↗
Figure 3
Figure 3. Figure 3: Improving high-SNR timesteps is better than merely suppressing them. Making clipping timestep-fair and reallocating budget improves reward efficiency under matched compute. (a) Aesthetic Score vs. GPU hours. (b) Aesthetic Score vs. KL divergence. (c) Clip fraction for t9 (solid) and t8 (dashed). estimators Ψˆ LA ti from Eqs. (18)–(19) against this ground truth by RMSE under matched compute. See Appendix E.… view at source ↗
Figure 4
Figure 4. Figure 4: Validation: Zeroth-order methods. Principled budget allocation and temporal weighting improve performance on (a) GenEval with SD3.5-M13 and (b, c) HPSv2.1 with SD1.5. Second, we reallocate branching budget toward the high-SNR region to reduce estimator variance where it is largest. Third, once this redistribution is applied, we find that t9 remains too noisy and too heavily clipped to justify further inves… view at source ↗
Figure 5
Figure 5. Figure 5: Validation: First-order methods. Improved reward guidance for low-SNR timesteps yields faster reward gains, while maintaining a competitive reward–KL tradeoff on (a, b) SD3.5-M and (c, d) SD1.5. See Appendix F.2 for more results. 6 Discussion Broadening the framework. RSM covers most affine flow-based RL fine-tuning methods, but several related objectives lie slightly outside its most direct formulation. R… view at source ↗
Figure 6
Figure 6. Figure 6: Ablating the first-order estimator. Replacing ∇xt r(xˆ0) with ∇x0 r(x0) improves both reward efficiency and the reward–KL tradeoff. In flow matching, we compare against two linearized baselines that keep the original local Tweedie-based estimator but adopt milder temporal weighting: (a) reward vs. GPU hours; (b) reward vs. KL. In diffusion, we compare against the corresponding baseline with the original es… view at source ↗
Figure 7
Figure 7. Figure 7: Auxiliary metrics suggest no obvious reward hacking. (a) PickScore remains stable throughout GenEval zeroth-order flow-matching fine-tuning. (b–d) DreamSim diversity on HPSv2.1 for zeroth-order diffusion, first-order flow matching, and first-order diffusion, respectively. F Additional Results F.1 Ablations for First-Order Experiments To isolate the contribution of the modified value-guidance estimator Ψti … view at source ↗
Figure 8
Figure 8. Figure 8: ∥gϕ∥ is negligible. The learned refinement term gϕ is negligible compared to the analytic reward gradient throughout the entire generation process for both (a) Residual ∇-DB, (b) VGG-Flow [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: gϕ is redundant for training. Removing gϕ reduces wall-clock time, while maintaining optimality on the tradeoff between reward and prior/diversity preservation. Averaged across three consecutive random seeds. (a, b) Residual ∇-DB, (c, d) VGG-Flow21 . worse diversity profile. Taken together, these results suggest that the improvements from our redesigns reflect better optimization of the intended objective … view at source ↗
Figure 10
Figure 10. Figure 10: Lbackward of Residual ∇-DB does not contribute to effective training. reducing training time. This confirms gϕ contributes computational overhead without algorithmic benefit. Instability of Backward Loss (Lbackward). Residual ∇-DB incorporates a backward loss Lbackward derived from detailed balance conditions. As detailed in Appendix C.1.1, this term introduces high￾order Jacobian dependencies that are an… view at source ↗
Figure 11
Figure 11. Figure 11: Online samples suffice. Including past rollouts (offline buffer) does not improve the Pareto frontier for Residual ∇-DB. For zeroth-order methods, the reward-gradient estimator takes the form 1 σti E[r(x0)ϵti ]. This perspec￾tive helps clarify why reward normalization can substantially improve optimization. First, subtracting the group mean acts as a control variate. Replacing r with r − µˆG reduces estim… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparisons on the First-order, SD1.5 Validation Setting. Images are shown at checkpoints after 0, 50, 100, 150, 200, 250 training epochs. Prompts: (a) A painting depicting a snowy winter scene featuring a river, a small house on a hill, and a dreamy cloudy sky; (b) abandoned city with ruined buildings, long deserted streets, cars aged by time, trees, flowers, scattered leaves, empty street, v… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparisons on the First-order, SD3.5-M Validation Setting. Images are shown at checkpoints after 0, 50, 100, 150, 200 training epochs. Prompts: (a) A blue jay standing on a large basket of rainbow macarons; (b) an illustration of monochrome cityscape vector graphic;(c) isometric style farmhouse from RPG game, unreal engine, vibrant, beautiful, crisp detailed, ultradetailed, intricate; (d) Two… view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparisons on the Zeroth-order, SD1.5 Validation Setting. Images are shown at checkpoints after 0, 50, 100, 150, 200, 250 training epochs. Prompts: (a) A photograph of a giant diamond gem in the ocean, featuring vibrant colors and detailed textures; (b) logo of mountain, hike, modern, colorful, rounded, 2d concept; (c) A colorful tin toy robot runs a steam engine on a path near a beautiful fl… view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative results on the Zeroth-order, SD3.5-M Validation Setting. Images are shown at checkpoints after 0, 60, 120, 180, 240, 300, 360, 420 training epochs. Prompts: (a) a photo of a brown bed and a pink cell phone; (b) a photo of a cat below a backpack; (c) a photo of a green couch and an orange umbrella; (d) a photo of a refrigerator above a baseball bat; (e) a photo of three donuts. 42 [PITH_FULL_I… view at source ↗
read the original abstract

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Reward Score Matching (RSM) as a unifying framework for reward-based fine-tuning of pretrained diffusion and flow models. It claims that many existing methods, derived from different perspectives, can be rewritten as score matching against a value-guided target distribution, with primary differences reducing to the construction of the value-guidance estimator and the effective optimization strength (weighting) across timesteps. Guided by this view, the authors distinguish core optimization from auxiliary mechanisms and propose simpler, more efficient redesigns for both differentiable and black-box reward alignment tasks.

Significance. If the unification holds with the claimed lack of material loss of generality, the work provides a valuable organizing lens that clarifies bias-variance-compute tradeoffs and reduces the apparent fragmentation of reward fine-tuning methods into a smaller design space. This could facilitate more interpretable and actionable method development. The contribution is conceptual rather than algorithmic, with strength in the reported redesigns; no machine-checked proofs or parameter-free derivations are claimed.

major comments (2)
  1. [§3 (RSM framework)] The central unification claim (that existing methods can be rewritten under RSM without loss of the original behavior) is load-bearing but presented at a high level in the abstract; explicit derivations for representative methods (e.g., the value-guided target and weighting schedule for at least two standard baselines) must be shown in the main text to confirm preservation of objectives and rule out overlooked auxiliary mechanisms.
  2. [Experiments] The redesigns are asserted to be simpler and more efficient, but the experiments section must include direct quantitative comparisons (performance, compute, variance) against the original methods being unified; without these, the practical benefit of the RSM-guided simplifications remains unsubstantiated.
minor comments (2)
  1. Notation for the value-guidance estimator and per-timestep weighting should be introduced with a single consistent definition early in the paper and used uniformly in all equations.
  2. [Abstract] The abstract states that auxiliary mechanisms 'add complexity without clear benefit'; this phrasing should be softened or supported by a brief reference to the specific ablations that demonstrate the lack of benefit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address each major comment point by point below, agreeing where the suggestions strengthen the presentation and providing the requested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [§3 (RSM framework)] The central unification claim (that existing methods can be rewritten under RSM without loss of the original behavior) is load-bearing but presented at a high level in the abstract; explicit derivations for representative methods (e.g., the value-guided target and weighting schedule for at least two standard baselines) must be shown in the main text to confirm preservation of objectives and rule out overlooked auxiliary mechanisms.

    Authors: We agree that explicit derivations will make the unification claim more rigorous and verifiable. In the revised §3, we have added a dedicated subsection with full step-by-step derivations for two representative baselines: one differentiable reward method (e.g., diffusion DPO) and one black-box method (e.g., DDPO). For each, we explicitly derive the value-guided target distribution and the corresponding timestep weighting schedule, showing that the original objective is recovered exactly as score matching under RSM with no additional auxiliary mechanisms required. These derivations confirm preservation of behavior and clarify how differences reduce to estimator construction and weighting. revision: yes

  2. Referee: [Experiments] The redesigns are asserted to be simpler and more efficient, but the experiments section must include direct quantitative comparisons (performance, compute, variance) against the original methods being unified; without these, the practical benefit of the RSM-guided simplifications remains unsubstantiated.

    Authors: We acknowledge that direct comparisons are necessary to substantiate the practical advantages of the RSM redesigns. The revised experiments section now includes head-to-head quantitative evaluations on both differentiable and black-box tasks. We report reward alignment performance, wall-clock training time, memory usage, and empirical variance (across seeds) for the RSM-based methods versus the original baselines. The results demonstrate that the simplifications achieve comparable or superior performance with lower compute and variance, validating the efficiency claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; unification is an independent re-expression

full rationale

The paper algebraically rewrites existing reward-based fine-tuning objectives for diffusion and flow models as score matching against a value-guided target, with method differences isolated to the value estimator construction and per-timestep weighting. This re-expression does not reduce any core claim to a fitted input renamed as prediction, a self-citation chain, or a definitional loop; the derivations remain self-contained against the cited prior methods and do not invoke uniqueness theorems or ansatzes from the authors' own prior work. The framework functions as an organizing view that clarifies tradeoffs without forcing results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The framework appears to rest on standard score-matching concepts from diffusion literature without introducing new postulated quantities.

pith-pipeline@v0.9.0 · 5464 in / 1071 out tokens · 61118 ms · 2026-05-10T05:30:29.711746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline

    cs.AI 2026-05 unverdicted novelty 7.0

    A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.

Reference graph

Works this paper leans on

56 extracted references · 12 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Boffi, and Eric Vanden-Eijnden

    Michael Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.Journal of Machine Learning Research, 26(209): 1–80, 2025. URLhttp://jmlr.org/papers/v26/23-1605.html

  2. [2]

    A markovian decision process.Journal of Mathematics and Mechanics, 6(5): 679–684, 1957

    Richard Bellman. A markovian decision process.Journal of Mathematics and Mechanics, 6(5): 679–684, 1957

  3. [3]

    Training diffusion models with reinforcement learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=YCWjhGrJFD

  4. [4]

    arXiv preprint arXiv:2602.04663 , year=

    Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, and Yongxin Chen. Rethinking the design space of reinforcement learning for diffusion models: On the importance of likelihood estimation beyond loss design, 2026. URL https: //arxiv.org/abs/2602.04663

  5. [5]

    Perception prioritized training of diffusion models

    Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11472–11481, June 2022

  6. [6]

    Diffusion posterior sampling for general noisy inverse problems

    Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. InThe Eleventh Inter- national Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=OnD9zGAGT0k

  7. [7]

    Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=xQBRrtQM8u

  8. [8]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. URL https://arxiv.org/abs/2403.03206

  9. [9]

    Reinforcement learning for fine- tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Reinforcement learning for fine- tuning text-to-image diffusion models. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=8OTPepXzeh

  10. [10]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Yakir Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id=DEiNSfh1k7

  11. [11]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https:// openreview.net/forum?id=Wbr51vK331. 12

  12. [12]

    Tempflow-grpo: When timing matters for grpo in flow models

    Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=7mCo3R3Wyn

  13. [13]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. URLhttps://openreview. net/forum?id=qw8AKxfYbI

  14. [14]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020

  15. [15]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

  16. [16]

    Ouyang and R

    Jian Huang, Yuling Jiao, Lican Kang, Xu Liao, Jin Liu, and Yanyan Liu. Schrödinger-föllmer sampler.IEEE Transactions on Information Theory, 71(2):1283–1299, 2025. doi: 10.1109/TIT. 2024.3522494

  17. [17]

    PPO-Clip attains global optimality: Towards deeper understandings of clipping

    Nai-Chieh Huang, Ping-Chun Hsieh, Kuo-Hao Ho, and I-Chen Wu. PPO-Clip attains global optimality: Towards deeper understandings of clipping. InAAAI, pages 12600–12607, 2024

  18. [18]

    Diffusion fine-tuning via reparameterized policy gradient of the soft q-function, 2026

    Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, and Jinkyoo Park. Diffusion fine-tuning via reparameterized policy gradient of the soft q-function, 2026. URL https: //openreview.net/forum?id=8zoxC9e23q

  19. [19]

    Pick-a-pic: an open dataset of user preferences for text-to-image generation

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: an open dataset of user preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  20. [20]

    PCPO: Proportionate credit policy optimization for preference alignment of image generation models

    Jeongjae Lee and Jong Chul Ye. PCPO: Proportionate credit policy optimization for preference alignment of image generation models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=alY08iknli

  21. [21]

    MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

    Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde, 2025. URL https: //arxiv.org/abs/2507.21802

  22. [22]

    BranchGRPO: Stable and efficient GRPO with structured branching in diffusion models

    Yuming Li, Yikai Wang, Yuying zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. BranchGRPO: Stable and efficient GRPO with structured branching in diffusion models. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=T2nP2IQasd

  23. [23]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=PqvMRDCJT9t

  24. [24]

    Flow-GRPO: Training flow matching models via online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=oCBKGw5HNf

  25. [25]

    Nabla-r2d3: Effective and efficient 3d diffusion alignment with 2d rewards

    Qingming Liu, Zhen Liu, Dinghuai Zhang, and Kui Jia. Nabla-r2d3: Effective and efficient 3d diffusion alignment with 2d rewards. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. URLhttps://openreview.net/forum?id=Dk2qprCnu8

  26. [26]

    Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang

    Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, and Dinghuai Zhang. Value gradient guidance for flow matching alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id= 6MmOy2Ji8V. 13

  27. [27]

    Xiao, Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang

    Zhen Liu, Tim Z. Xiao, Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang. Efficient diversity- preserving diffusion alignment via gradient-informed GFlownets. InThe Thirteenth Interna- tional Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=Aye5wL6TCn

  28. [28]

    doi: 10.1007/s11633-025-1562-4

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.Machine Intelligence Re- search, 22(4):730–751, June 2025. ISSN 2731-5398. doi: 10.1007/s11633-025-1562-4. URL http://dx.doi.org/10.1007/s11633-025-1562-4

  29. [29]

    arXiv preprint arXiv:1705.07798 , year=

    Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes.CoRR, abs/1705.07798, 2017. URL http://arxiv.org/abs/ 1705.07798

  30. [30]

    Better training of gflownets with local credit and incomplete trajectories

    Ling Pan, Nikolay Malkin, Dinghuai Zhang, and Yoshua Bengio. Better training of gflownets with local credit and incomplete trajectories. InInternational Conference on Machine Learning,

  31. [31]

    URLhttps://openreview.net/forum?id=beHp3L9KXc

  32. [32]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=FjNys5c7VyY

  33. [33]

    Connecting stochastic optimal control and reinforcement learning.Journal of Mathematical Physics, 65(8), 2024

    Jannes Quer and Enric Ribera Borrell. Connecting stochastic optimal control and reinforcement learning.Journal of Mathematical Physics, 65(8), 2024

  34. [34]

    On the spectral bias of neural networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–

  35. [35]

    URLhttps://proceedings.mlr.press/v97/rahaman19a

    PMLR, 09–15 Jun 2019. URLhttps://proceedings.mlr.press/v97/rahaman19a. html

  36. [36]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  37. [37]

    Laion aesthetics, Aug 2022

    Chrisoph Schuhmann. Laion aesthetics, Aug 2022. URL https://laion.ai/blog/ laion-aesthetics/

  38. [38]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Francis Bach and David Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL https: //proceedings...

  39. [39]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS

  40. [40]

    Charles M. Stein. Estimation of the mean of a multivariate normal distribution.The Annals of Statistics, 9(6):1135–1151, 1981

  41. [41]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018. URL http://incompleteideas.net/book/the-book-2nd. html

  42. [42]

    Inference-time alignment of diffusion models with direct noise optimization

    Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Inference-time alignment of diffusion models with direct noise optimization. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=JpbqiD7n9r. 14

  43. [43]

    Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review.arXiv preprint arXiv:2407.13734, 2024

    Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, and Sergey Levine. Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review, 2024. URL https://arxiv.org/abs/2407.13734

  44. [44]

    L., Tseng, A

    Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, and Sergey Levine. Fine- tuning of continuous-time diffusion models as entropy-regularized control.arXiv preprint arXiv:2402.15194, 2024

  45. [45]

    Bayesian learning via neural schrödinger-föllmer flows

    Francisco Vargas, Andrius Ovsianas, David Lopes Fernandes, Mark Girolami, Neil D Lawrence, and Nikolas Nüsken. Bayesian learning via neural schrödinger-föllmer flows. InFourth Sym- posium on Advances in Approximate Bayesian Inference, 2022. URL https://openreview. net/forum?id=1Fqd10N5yTF

  46. [46]

    Diffusion model alignment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8228–8238, June 2024

  47. [47]

    GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

    Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, and Xiaodan Liang. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025. URL https://arxiv.org/abs/2510.22319

  48. [48]

    R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229–256, 1992

  49. [49]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Lai. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. URLhttps://arxiv.org/abs/2306.09341

  50. [50]

    Imagereward: learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 15903–15935, 2023

  51. [51]

    arXiv preprint arXiv:2509.25050 , year=

    Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models, 2025. URL https: //arxiv.org/abs/2509.25050

  52. [52]

    DanceGRPO: Unleashing GRPO on Visual Generation

    Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, and Ping Luo. Dancegrpo: Unleashing grpo on visual generation, 2025. URLhttps://arxiv.org/abs/2505.07818

  53. [53]

    Susskind, Navdeep Jaitly, and Shuangfei Zhai

    Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Joshua M. Susskind, Navdeep Jaitly, and Shuangfei Zhai. Improving GFlownets for text-to-image diffusion alignment.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=XDbY3qhM42

  54. [54]

    DiffusionNFT: Online diffusion reinforcement with forward process

    Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. DiffusionNFT: Online diffusion reinforcement with forward process. InThe Fourteenth International Conference on Learning Representations,

  55. [55]

    15 Table 2:Notation guide.Main symbols used in the unified RSM framework

    URLhttps://openreview.net/forum?id=VJZ477R89F. 15 Table 2:Notation guide.Main symbols used in the unified RSM framework. Symbol Meaning Main role Ψ⋆ ti Optimal value guidance, 1 α ∇xti Vti Ideal reward-guided correction to the reference score ˆΨti Practical estimate ofΨ ⋆ ti Guidance estimate before temporal reweighting Ψti Effective guidance,γ(t i) ˆΨti ...

  56. [56]

    Z T 0 u(xt, t)⊤ g(t) dwt − 1 2 Z T 0 ∥ u(xt, t) g(t) ∥2dt # =E u,ν

    For∀j, for almost everyu −j ∈R d−1,lim uj →±∞ p(uj,u −j |x) = 0. Then the conditional score has zero mean: Eu∼p(·|x) ∇u logp(u|x) =0.(42) Proof.For eachj, Ep(·|x)[∂uj logp(u|x)] = Z Rd p(u|x) ∂uj p(u|x) p(u|x) du= Z Rd ∂uj p(u|x)du, where the equality is valid wherever p(u|x)>0 , and condition 2 justifies the integral. Write u= (u j,u −j)and apply Fubini:...