Recognition: 2 theorem links
· Lean TheoremDiffusionNFT: Online Diffusion Reinforcement with Forward Process
Pith reviewed 2026-05-13 16:50 UTC · model grok-4.3
The pith
DiffusionNFT trains diffusion models online by contrasting positive and negative samples on the forward process via flow matching, reaching high performance in far fewer steps than prior RL methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By optimizing on the forward process with flow matching that contrasts positive and negative generations, DiffusionNFT supplies an implicit policy improvement direction that can be added to supervised learning without likelihood estimation or solver restrictions, producing up to 25 times higher efficiency than FlowGRPO and lifting SD3.5-Medium performance across every benchmark tested while remaining CFG-free.
What carries the argument
The central mechanism is the contrast between positive and negative generations on the forward process via flow matching, which defines an implicit policy improvement direction folded into the supervised objective.
If this is right
- Training works with any black-box solver without modification.
- Only clean images are needed; no sampling trajectories or likelihoods are required.
- Classifier-free guidance becomes unnecessary during the RL stage.
- GenEval reaches 0.98 within 1k steps while prior methods need over 5k steps plus CFG.
- SD3.5-Medium improves on every tested benchmark when multiple reward models are used.
Where Pith is reading between the lines
- The forward-process formulation could be tested on video or 3D diffusion models to see whether the same efficiency gains appear outside static images.
- Because only clean data is required, the method might integrate more easily with existing curated image datasets than trajectory-based RL approaches.
- If the implicit direction remains stable across reward models, it could support multi-objective alignment without explicit weighting schedules.
Load-bearing premise
That contrasting positive and negative generations on the forward process via flow matching supplies a valid implicit policy improvement direction that reliably improves diffusion model behavior under arbitrary black-box solvers.
What would settle it
If a model fine-tuned with DiffusionNFT shows no improvement over plain supervised fine-tuning when evaluated on held-out prompts using a sampler different from any implicit during training, the central claim would be falsified.
read the original abstract
Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiffusionNFT, an online RL method for diffusion models that optimizes directly on the forward process via flow matching. It defines an implicit policy improvement direction by contrasting positive and negative generations, claims to eliminate the need for likelihood estimation or specific solvers, and reports up to 25× efficiency gains over FlowGRPO along with large benchmark improvements (e.g., GenEval rising from 0.24 to 0.98 in 1k steps, CFG-free) while boosting SD3.5-Medium across tasks using multiple reward models.
Significance. If the central efficiency and performance claims are substantiated, the work would offer a practically useful advance for post-training diffusion models by enabling solver-agnostic, likelihood-free RL that integrates reinforcement signals into a supervised objective. The reported speedups and benchmark deltas, if reproducible, could influence how reward alignment is performed for generative models.
major comments (3)
- [§3 (Method formulation)] §3 (Method formulation): the claim that contrasting positive and negative generations on the forward process supplies a valid implicit policy improvement direction lacks any derivation or reduction to a standard RL objective (policy gradient, KL-regularized reward max, or equivalent) that would guarantee an increase in expected reward when the updated model is sampled with arbitrary black-box solvers. Because the forward process is not the inference trajectory, this gap is load-bearing for all reported efficiency and benchmark claims.
- [§4 (Experiments and efficiency claims)] §4 (Experiments and efficiency claims): the 25× efficiency advantage over FlowGRPO is stated without defining the precise metric (wall-clock time, NFEs, or steps), without reporting variance across multiple runs, and without ablations isolating the contribution of the forward-process contrast versus other implementation choices.
- [Results tables (e.g., GenEval and SD3.5-Medium benchmarks)] Results tables (e.g., GenEval and SD3.5-Medium benchmarks): the large jumps (0.24→0.98 on GenEval within 1k steps) are presented without statistical significance tests, without details on how multiple reward models are combined, and without controls confirming that the gains are not artifacts of the particular black-box solver used at evaluation time.
minor comments (2)
- Notation for the flow-matching objective could be clarified to explicitly distinguish the forward-process training signal from the reverse-process sampling distribution used at inference.
- The abstract and method sections should include a short related-work paragraph contrasting DiffusionNFT with prior forward-process RL attempts to make the novelty explicit.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications, derivations, and additional analyses.
read point-by-point responses
-
Referee: §3 (Method formulation): the claim that contrasting positive and negative generations on the forward process supplies a valid implicit policy improvement direction lacks any derivation or reduction to a standard RL objective (policy gradient, KL-regularized reward max, or equivalent) that would guarantee an increase in expected reward when the updated model is sampled with arbitrary black-box solvers. Because the forward process is not the inference trajectory, this gap is load-bearing for all reported efficiency and benchmark claims.
Authors: We agree that a formal derivation is needed. In the revised manuscript we will add a new subsection in §3 deriving the positive-negative contrast loss as an implicit policy gradient under the flow-matching objective. The derivation shows that the expected gradient of the contrastive loss is proportional to the gradient of expected reward under the forward noising distribution; because the forward and reverse processes share the same marginals at each noise level, the update direction remains valid for any black-box solver used at inference time. We will also include a short proof sketch reducing the objective to a KL-regularized reward maximization form. revision: yes
-
Referee: §4 (Experiments and efficiency claims): the 25× efficiency advantage over FlowGRPO is stated without defining the precise metric (wall-clock time, NFEs, or steps), without reporting variance across multiple runs, and without ablations isolating the contribution of the forward-process contrast versus other implementation choices.
Authors: We will revise §4 to explicitly state that the 25× factor is measured in training steps required to reach a target performance level (GenEval ≥ 0.95). We will report mean and standard deviation across three independent runs with different random seeds, and add an ablation table isolating the forward-process contrast from other design choices (e.g., reward model usage and CFG-free training). These additions will be placed in the main text and appendix. revision: yes
-
Referee: Results tables (e.g., GenEval and SD3.5-Medium benchmarks): the large jumps (0.24→0.98 on GenEval within 1k steps) are presented without statistical significance tests, without details on how multiple reward models are combined, and without controls confirming that the gains are not artifacts of the particular black-box solver used at evaluation time.
Authors: We will augment the results section with paired t-tests (p < 0.01) computed across the three runs for all reported metrics. We will add a paragraph detailing the reward-model combination procedure (normalized weighted sum with weights chosen by validation performance). Finally, we will include a control experiment evaluating the final model with two additional black-box solvers (Euler and Heun) to confirm that the benchmark gains persist independently of the evaluation solver. revision: yes
Circularity Check
DiffusionNFT defines implicit policy improvement via forward-process pos/neg contrast without reducing to self-fitted quantities or self-citation chains
full rationale
The paper presents DiffusionNFT as an online RL method that contrasts positive and negative generations directly on the forward process using flow matching to incorporate reinforcement signals into a supervised objective. This is framed as extending existing flow matching and RL ideas rather than deriving from parameters fitted inside the paper itself. No equations or derivations are exhibited that reduce the claimed efficiency gains or benchmark improvements (e.g., GenEval 0.24 to 0.98) to quantities defined by construction within the work. The central formulation is described as enabling arbitrary black-box solvers and eliminating likelihood estimation, keeping the derivation self-contained against external benchmarks and prior flow-matching literature.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
-
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Generative Texture Filtering
A two-stage fine-tuning strategy on pre-trained generative models enables effective texture filtering that outperforms prior methods on challenging cases.
-
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
-
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
-
Flow-OPD: On-Policy Distillation for Flow Matching Models
Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
A unified perspective on fine-tuning and sampling with diffusion and flow models
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...
-
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
Towards General Preference Alignment: Diffusion Models at Nash Equilibrium
Diff.-NPO frames diffusion alignment as a self-play game reaching Nash equilibrium and reports better text-to-image results than prior DPO-style methods.
-
A Systematic Post-Train Framework for Video Generation
A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
-
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
-
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 unifies high-fidelity image generation and precise editing by coupling Qwen3-VL with a Multimodal Diffusion Transformer, improving text rendering, photorealism, and complex prompt following over prior versions.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Tstars-Tryon 1.0 is a deployed virtual try-on system claiming high robustness, photorealism, multi-reference flexibility, and near real-time speed for diverse fashion items.
-
Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Tstars-Tryon 1.0 is a robust, photorealistic virtual try-on system with multi-image support and near real-time speed, deployed at industrial scale on Taobao and accompanied by a released benchmark.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Visual generation without guidance.Forty-second international conference on machine learning, 2025a
Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance.Forty-second international conference on machine learning, 2025a. Huayu Chen, Hang Su, Peize Sun, and Jun Zhu. Toward guidance-free ar visual generation via condition contrastive alignment. InICLR, 2025b. Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ga...
-
[4]
10 Published as a conference paper at ICLR 2026 Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885,
work page 2026
-
[5]
Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458,
-
[6]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning.arXiv preprint arXiv:2104.08718,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Inference-time alignment control for diffusion models with reinforcement learning guidance
Luozhijie Jin, Zijie Qiu, Jie Liu, Zijie Diao, Lifeng Qiao, Ning Ding, Alex Lamb, and Xipeng Qiu. Inference-time alignment control for diffusion models with reinforcement learning guidance. arXiv preprint arXiv:2508.21016,
-
[10]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192,
work page internal anchor Pith review arXiv
-
[11]
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909,
work page internal anchor Pith review arXiv
-
[12]
Binxu Li, Minkai Xu, Meihua Dang, and Stefano Ermon. Divergence minimization preference optimization for diffusion model alignment.arXiv preprint arXiv:2507.07510, 2025a. Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025b...
-
[13]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of ...
-
[17]
Prabhudesai, M., Goyal, A., Pathak, D., and Fragkiadaki, K
Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Katerina Fragkiadaki, and Deepak Pathak. Video diffusion alignment via reward gradients.arXiv preprint arXiv:2407.08737,
-
[18]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Conor Durk...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[21]
Coefficients-preserving sampling for reinforcement learning with flow matching
12 Published as a conference paper at ICLR 2026 Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952,
-
[22]
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multi- modal understanding and generation.arXiv preprint arXiv:2503.05236,
work page internal anchor Pith review arXiv
-
[23]
Human preference score: Better aligning text-to-image models with human preference
Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2096–2105,
work page 2096
-
[24]
A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025
Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343,
-
[25]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Fast sampling of dif- fusion models with exponential integrator
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902,
-
[27]
Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics
Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. InThirty-seventh Conference on Neural Information Pro- cessing Systems, 2023a. Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. InInternational Conferen...
-
[28]
13 Published as a conference paper at ICLR 2026 A PROOF OFTHEOREMS Lemma A.1(Distribution Split).Consider the distribution tripletπ +,π −, andπ old, as defined in Section 3.1: π+(x0|c) :=π old(x0|o= 1,c) = p(o= 1|x 0,c)π old(x0|c) pπold(o= 1|c) = r(x0,c) pπold(o= 1|c) πold(x0|c)(7) π−(x0|c) :=π old(x0|o= 0,c) = p(o= 0|x 0,c)π old(x0|c) pπold(o= 0|c) = 1−r...
work page 2026
-
[29]
We provide a simpler and more principled perspective based solely on the diffusion model framework
derive the flow SDE with unex- plained hyperparametersg t =a q t 1−t or additional complexity. We provide a simpler and more principled perspective based solely on the diffusion model framework. To leverage the diffusion SDE formulation in Song et al. (2020b), we need to match its forward SDE dxt =f(t)x tdt+g(t)dw t with the forward transition kernelx t =...
work page 2021
-
[30]
Following common practices, the first and last steps degrade to the first-order solver, which is the default Euler discretization for flow models. B.3 INTUITION BEHIND THEFLOWGRPO OBJECTIVE We provide some insight into reverse-process diffusion RL by inspecting the FlowGRPO objective in a sampler-agnostic manner. For any first-order SDE sampler, the rever...
work page 2026
-
[31]
19 Published as a conference paper at ICLR 2026 SD3.5-M (w/ CFG) a photo of a red dog SD3.5-M (w/o CFG) +FlowGRPO (w/ CFG) +DiffusionNFT (w/o CFG) a photo of a tie above a sink a photo of a toothbrush below a pizza a photo of a black potted plant and a yellow toilet a photo of a brown hot dog and a purple pizza Figure 11: Qualitative comparison between Fl...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.