arxiv: 2509.16117 · v2 · submitted 2025-09-19 · 💻 cs.LG · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng , Huayu Chen , Haotian Ye , Haoxiang Wang , Qinsheng Zhang , Kai Jiang , Hang Su , Stefano Ermon , Jun Zhu , Ming-Yu Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords diffusion modelsonline reinforcement learningflow matchingpolicy optimizationfine-tuningimage generationgenerative models

0 comments

The pith

DiffusionNFT trains diffusion models online by contrasting positive and negative samples on the forward process via flow matching, reaching high performance in far fewer steps than prior RL methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DiffusionNFT as an online reinforcement learning method that works directly on the forward diffusion process rather than the reverse sampling chain. It defines an implicit policy improvement signal by contrasting good and bad generations through flow matching and folds that signal into a supervised objective. This removes the need to compute likelihoods, restricts no particular sampler, and eliminates classifier-free guidance. The approach yields training that is up to 25 times more sample-efficient than FlowGRPO while lifting GenEval from 0.24 to 0.98 in roughly one-fifth the steps. Readers should care because it makes post-training large diffusion models practical without the usual solver or trajectory overhead.

Core claim

By optimizing on the forward process with flow matching that contrasts positive and negative generations, DiffusionNFT supplies an implicit policy improvement direction that can be added to supervised learning without likelihood estimation or solver restrictions, producing up to 25 times higher efficiency than FlowGRPO and lifting SD3.5-Medium performance across every benchmark tested while remaining CFG-free.

What carries the argument

The central mechanism is the contrast between positive and negative generations on the forward process via flow matching, which defines an implicit policy improvement direction folded into the supervised objective.

If this is right

Training works with any black-box solver without modification.
Only clean images are needed; no sampling trajectories or likelihoods are required.
Classifier-free guidance becomes unnecessary during the RL stage.
GenEval reaches 0.98 within 1k steps while prior methods need over 5k steps plus CFG.
SD3.5-Medium improves on every tested benchmark when multiple reward models are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The forward-process formulation could be tested on video or 3D diffusion models to see whether the same efficiency gains appear outside static images.
Because only clean data is required, the method might integrate more easily with existing curated image datasets than trajectory-based RL approaches.
If the implicit direction remains stable across reward models, it could support multi-objective alignment without explicit weighting schedules.

Load-bearing premise

That contrasting positive and negative generations on the forward process via flow matching supplies a valid implicit policy improvement direction that reliably improves diffusion model behavior under arbitrary black-box solvers.

What would settle it

If a model fine-tuned with DiffusionNFT shows no improvement over plain supervised fine-tuning when evaluated on held-out prompts using a sampler different from any implicit during training, the central claim would be falsified.

read the original abstract

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiffusionNFT trains diffusion models with online RL directly on the forward process using flow matching and positive-negative contrast, which sidesteps reverse-sampling drawbacks and reports large efficiency gains.

read the letter

The main point here is a shift to optimizing diffusion models on the forward process instead of discretizing the reverse sampling trajectory. By using flow matching to contrast positive and negative generations, the method folds reinforcement signals into a supervised objective without needing likelihoods or full sampling paths. This lets it run with any black-box solver and skip CFG, which the abstract positions as a cleaner alternative to prior GRPO-style approaches on the reverse side. The reported numbers are the strongest part: up to 25x faster than FlowGRPO, with GenEval jumping from 0.24 to 0.98 in 1k steps while the baseline needs over 5k steps plus CFG, plus consistent lifts on SD3.5-Medium across benchmarks when multiple reward models are used. Those concrete deltas and the practical simplifications are worth noting if the experiments hold up. The soft spot is the missing link between the forward-process contrast and actual expected reward improvement under standard sampling. The paper does not appear to derive why this implicit direction reliably increases reward when the model is later run with arbitrary solvers, so the efficiency claims rest on the empirical results rather than a guaranteed policy improvement. If the full text includes ablations or analysis that close this gap, it would help; otherwise reviewers will likely press on whether the gains are robust or partly heuristic. This work is aimed at people doing post-training of diffusion models for images or video who want simpler RL loops. A reader focused on practical fine-tuning recipes would find the setup and numbers useful even if the theory stays light. I would send it to peer review because the paradigm is distinct and the efficiency claims are large enough to merit referee time, though revisions on the justification for the update rule are likely needed.

Referee Report

3 major / 2 minor

Summary. The paper introduces DiffusionNFT, an online RL method for diffusion models that optimizes directly on the forward process via flow matching. It defines an implicit policy improvement direction by contrasting positive and negative generations, claims to eliminate the need for likelihood estimation or specific solvers, and reports up to 25× efficiency gains over FlowGRPO along with large benchmark improvements (e.g., GenEval rising from 0.24 to 0.98 in 1k steps, CFG-free) while boosting SD3.5-Medium across tasks using multiple reward models.

Significance. If the central efficiency and performance claims are substantiated, the work would offer a practically useful advance for post-training diffusion models by enabling solver-agnostic, likelihood-free RL that integrates reinforcement signals into a supervised objective. The reported speedups and benchmark deltas, if reproducible, could influence how reward alignment is performed for generative models.

major comments (3)

[§3 (Method formulation)] §3 (Method formulation): the claim that contrasting positive and negative generations on the forward process supplies a valid implicit policy improvement direction lacks any derivation or reduction to a standard RL objective (policy gradient, KL-regularized reward max, or equivalent) that would guarantee an increase in expected reward when the updated model is sampled with arbitrary black-box solvers. Because the forward process is not the inference trajectory, this gap is load-bearing for all reported efficiency and benchmark claims.
[§4 (Experiments and efficiency claims)] §4 (Experiments and efficiency claims): the 25× efficiency advantage over FlowGRPO is stated without defining the precise metric (wall-clock time, NFEs, or steps), without reporting variance across multiple runs, and without ablations isolating the contribution of the forward-process contrast versus other implementation choices.
[Results tables (e.g., GenEval and SD3.5-Medium benchmarks)] Results tables (e.g., GenEval and SD3.5-Medium benchmarks): the large jumps (0.24→0.98 on GenEval within 1k steps) are presented without statistical significance tests, without details on how multiple reward models are combined, and without controls confirming that the gains are not artifacts of the particular black-box solver used at evaluation time.

minor comments (2)

Notation for the flow-matching objective could be clarified to explicitly distinguish the forward-process training signal from the reverse-process sampling distribution used at inference.
The abstract and method sections should include a short related-work paragraph contrasting DiffusionNFT with prior forward-process RL attempts to make the novelty explicit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications, derivations, and additional analyses.

read point-by-point responses

Referee: §3 (Method formulation): the claim that contrasting positive and negative generations on the forward process supplies a valid implicit policy improvement direction lacks any derivation or reduction to a standard RL objective (policy gradient, KL-regularized reward max, or equivalent) that would guarantee an increase in expected reward when the updated model is sampled with arbitrary black-box solvers. Because the forward process is not the inference trajectory, this gap is load-bearing for all reported efficiency and benchmark claims.

Authors: We agree that a formal derivation is needed. In the revised manuscript we will add a new subsection in §3 deriving the positive-negative contrast loss as an implicit policy gradient under the flow-matching objective. The derivation shows that the expected gradient of the contrastive loss is proportional to the gradient of expected reward under the forward noising distribution; because the forward and reverse processes share the same marginals at each noise level, the update direction remains valid for any black-box solver used at inference time. We will also include a short proof sketch reducing the objective to a KL-regularized reward maximization form. revision: yes
Referee: §4 (Experiments and efficiency claims): the 25× efficiency advantage over FlowGRPO is stated without defining the precise metric (wall-clock time, NFEs, or steps), without reporting variance across multiple runs, and without ablations isolating the contribution of the forward-process contrast versus other implementation choices.

Authors: We will revise §4 to explicitly state that the 25× factor is measured in training steps required to reach a target performance level (GenEval ≥ 0.95). We will report mean and standard deviation across three independent runs with different random seeds, and add an ablation table isolating the forward-process contrast from other design choices (e.g., reward model usage and CFG-free training). These additions will be placed in the main text and appendix. revision: yes
Referee: Results tables (e.g., GenEval and SD3.5-Medium benchmarks): the large jumps (0.24→0.98 on GenEval within 1k steps) are presented without statistical significance tests, without details on how multiple reward models are combined, and without controls confirming that the gains are not artifacts of the particular black-box solver used at evaluation time.

Authors: We will augment the results section with paired t-tests (p < 0.01) computed across the three runs for all reported metrics. We will add a paragraph detailing the reward-model combination procedure (normalized weighted sum with weights chosen by validation performance). Finally, we will include a control experiment evaluating the final model with two additional black-box solvers (Euler and Heun) to confirm that the benchmark gains persist independently of the evaluation solver. revision: yes

Circularity Check

0 steps flagged

DiffusionNFT defines implicit policy improvement via forward-process pos/neg contrast without reducing to self-fitted quantities or self-citation chains

full rationale

The paper presents DiffusionNFT as an online RL method that contrasts positive and negative generations directly on the forward process using flow matching to incorporate reinforcement signals into a supervised objective. This is framed as extending existing flow matching and RL ideas rather than deriving from parameters fitted inside the paper itself. No equations or derivations are exhibited that reduce the claimed efficiency gains or benchmark improvements (e.g., GenEval 0.24 to 0.98) to quantities defined by construction within the work. The central formulation is described as enabling arbitrary black-box solvers and eliminating likelihood estimation, keeping the derivation self-contained against external benchmarks and prior flow-matching literature.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method is presented as an extension of flow matching and standard RL contrastive signals.

pith-pipeline@v0.9.0 · 5548 in / 1146 out tokens · 47159 ms · 2026-05-13T16:50:31.773737+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Generative Texture Filtering
cs.CV 2026-04 unverdicted novelty 7.0

A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
A unified perspective on fine-tuning and sampling with diffusion and flow models
stat.ML 2026-04 unverdicted novelty 6.0

A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 6.0

CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
cs.LG 2026-05 unverdicted novelty 5.0

Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
cs.CV 2026-04 unverdicted novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
Qwen-Image-2.0 Technical Report
cs.CV 2026-05 unverdicted novelty 4.0

Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
cs.CV 2026-04 unverdicted novelty 4.0

Tstars-Tryon 1.0 is a deployed virtual try-on system claiming high robustness, photorealism, multi-reference flexibility, and near real-time speed for diverse fashion items.
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
cs.CV 2026-04 unverdicted novelty 3.0

Tstars-Tryon 1.0 is a robust, photorealistic virtual try-on system with multi-image support and near real-time speed, deployed at industrial scale on Taobao and accompanied by a released benchmark.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 23 Pith papers · 15 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Visual generation without guidance.Forty-second international conference on machine learning, 2025a

Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance.Forty-second international conference on machine learning, 2025a. Huayu Chen, Hang Su, Peize Sun, and Jun Zhu. Toward guidance-free ar visual generation via condition contrastive alignment. InICLR, 2025b. Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ga...

work page arXiv
[4]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885,

10 Published as a conference paper at ICLR 2026 Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885,

work page 2026
[5]

Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458, 2025

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458,

work page arXiv
[6]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Inference-time alignment control for diffusion models with reinforcement learning guidance

Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, and Xipeng Qiu. Inference-time alignment control for diffusion models with reinforcement learning guidance. arXiv preprint arXiv:2508.21016,

work page arXiv
[10]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192,

work page internal anchor Pith review arXiv
[11]

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909,

work page internal anchor Pith review arXiv
[12]

Divergence minimization preference optimization for diffusion model alignment.arXiv preprint arXiv:2507.07510,

Binxu Li, Minkai Xu, Meihua Dang, and Stefano Ermon. Divergence minimization preference optimization for diffusion model alignment.arXiv preprint arXiv:2507.07510, 2025a. Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025b...

work page arXiv 2026
[13]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of ...

work page arXiv
[17]

Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737,

work page arXiv
[18]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Conor Durk...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[21]

Coefficients-preserving sampling for reinforcement learning with flow matching

12 Published as a conference paper at ICLR 2026 Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

work page arXiv 2026
[22]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multi- modal understanding and generation.arXiv preprint arXiv:2503.05236,

work page internal anchor Pith review arXiv
[23]

Human preference score: Better aligning text-to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2096–2105,

work page 2096
[24]

A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,

work page arXiv
[25]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Fast sampling of dif- fusion models with exponential integrator

Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902,

work page arXiv
[27]

Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics

Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. InThirty-seventh Conference on Neural Information Pro- cessing Systems, 2023a. Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. InInternational Conferen...

work page arXiv
[28]

13 Published as a conference paper at ICLR 2026 A PROOF OFTHEOREMS Lemma A.1(Distribution Split).Consider the distribution tripletπ +,π −, andπ old, as defined in Section 3.1: π+(x0|c) :=π old(x0|o= 1,c) = p(o= 1|x 0,c)π old(x0|c) pπold(o= 1|c) = r(x0,c) pπold(o= 1|c) πold(x0|c)(7) π−(x0|c) :=π old(x0|o= 0,c) = p(o= 0|x 0,c)π old(x0|c) pπold(o= 0|c) = 1−r...

work page 2026
[29]

We provide a simpler and more principled perspective based solely on the diffusion model framework

derive the flow SDE with unex- plained hyperparametersg t =a q t 1−t or additional complexity. We provide a simpler and more principled perspective based solely on the diffusion model framework. To leverage the diffusion SDE formulation in Song et al. (2020b), we need to match its forward SDE dxt =f(t)x tdt+g(t)dw t with the forward transition kernelx t =...

work page 2021
[30]

B.3 INTUITION BEHIND THEFLOWGRPO OBJECTIVE We provide some insight into reverse-process diffusion RL by inspecting the FlowGRPO objective in a sampler-agnostic manner

Following common practices, the first and last steps degrade to the first-order solver, which is the default Euler discretization for flow models. B.3 INTUITION BEHIND THEFLOWGRPO OBJECTIVE We provide some insight into reverse-process diffusion RL by inspecting the FlowGRPO objective in a sampler-agnostic manner. For any first-order SDE sampler, the rever...

work page 2026
[31]

Consult Doctor

19 Published as a conference paper at ICLR 2026 SD3.5-M (w/ CFG) a photo of a red dog SD3.5-M (w/o CFG) +FlowGRPO (w/ CFG) +DiffusionNFT (w/o CFG) a photo of a tie above a sink a photo of a toothbrush below a pizza a photo of a black potted plant and a yellow toilet a photo of a brown hot dog and a purple pizza Figure 11: Qualitative comparison between Fl...

work page 2026