arxiv: 2604.06916 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.AI· cs.CV

Recognition: no theorem link

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Dinghao Yang, Enze Xie, Junjie Bai, Junsong Chen, Pengcuo Zeren, Ping Luo, Shuchen Xue, Siyuan Fu, Song Han, Yangyang Tang, Yitong Li

Pith reviewed 2026-05-10 18:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords diffusion modelsreinforcement learningFP4 quantizationBF16 precisionrollout scalingmodel alignmenttwo-stage trainingtext-to-image generation

0 comments

The pith

Sol-RL uses FP4 for large-scale rollouts in diffusion RL and regenerates only the most informative samples in BF16 to preserve optimization quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the high cost of scaling rollout groups in reinforcement learning for aligning text-to-image diffusion models. It proposes a two-stage process that generates many candidates quickly with low-precision FP4 arithmetic, selects a contrastive subset, and re-generates those in full BF16 precision before performing policy updates. This separation allows the method to increase rollout volume substantially while keeping the optimization step accurate. A reader would care because prior work shows that bigger rollout groups improve human preference alignment, yet full-precision scaling quickly becomes prohibitive on large models like FLUX.1. The approach claims to deliver both faster convergence and better final alignment metrics without sacrificing the integrity of the training signal.

Core claim

Sol-RL is a two-stage reinforcement learning framework for diffusion models in which high-throughput NVFP4 rollouts first build a massive candidate pool, a highly contrastive subset is extracted from that pool, and the selected samples are then regenerated in BF16 precision so that policy optimization occurs exclusively on the high-fidelity versions. By decoupling the exploration phase from the optimization phase, the framework combines the throughput advantage of FP4 arithmetic with the training integrity of BF16 while still benefiting from the performance gains known to come from larger rollout groups.

What carries the argument

The two-stage Sol-RL pipeline that performs FP4 candidate generation followed by selective BF16 regeneration of a contrastive subset before policy optimization.

If this is right

The method maintains alignment performance and training integrity equivalent to a pure BF16 rollout pipeline.
Training convergence accelerates by up to 4.64 times on models including SANA, FLUX.1, and SD3.5-L.
Superior results appear across multiple alignment metrics while the computational cost of the rollout phase drops substantially.
Massive increases in rollout group size become practical without requiring full high-precision computation for every sample.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contrastive-selection idea could be tested in other high-cost RL settings such as language-model alignment where rollout volume is currently limited by compute.
If the subset selection heuristic proves robust, it may allow even larger effective rollout scales on future models that exceed current memory limits for full BF16 generation.
Hybrid-precision pipelines of this form might generalize to other generative tasks that combine exploration with gradient-based optimization.

Load-bearing premise

That the highly contrastive subset chosen from FP4 rollouts, when regenerated in BF16, still supplies enough accurate information to avoid any performance drop relative to a full BF16 rollout pipeline.

What would settle it

A direct side-by-side run on the same model and dataset that measures final alignment metrics and convergence speed for full BF16 rollouts versus the Sol-RL two-stage procedure; a statistically significant drop in either metric for Sol-RL would falsify the claim.

read the original abstract

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real move is a two-stage rollout where FP4 generates a big candidate pool fast, a contrastive subset is picked, and only those get regenerated in BF16 for the policy update.

read the letter

The main thing to know is that Sol-RL tries to scale rollout groups in diffusion RL without paying the full BF16 cost for every sample. They run cheap FP4 rollouts to create a large pool, pull out the most contrastive ones according to some reward or preference signal, then regenerate exactly those in BF16 and train only on the regenerated set. This is presented as a way to keep training integrity while getting up to 4.64x faster convergence on models like SANA, FLUX.1, and SD3.5-L.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Sol-RL, a two-stage Diffusion RL framework for aligning text-to-image models. FP4 quantization is used to generate a large candidate pool via high-throughput rollouts, from which a highly contrastive subset is extracted; these samples are then regenerated in BF16 precision and used exclusively for policy optimization. The approach is claimed to decouple exploration from optimization, preserving BF16 training integrity while exploiting FP4 throughput gains, with empirical results showing superior alignment metrics and up to 4.64× faster convergence on SANA, FLUX.1, and SD3.5-L.

Significance. If the empirical claims hold under rigorous verification, the work offers a practical algorithmic-system co-design for scaling rollout-heavy RL post-training of large diffusion models. The explicit separation of low-precision candidate generation from high-fidelity optimization is a clear strength that could enable larger effective rollout groups at reduced cost, with potential broader applicability to other generative RL settings.

major comments (2)

[Abstract] Abstract: the central claim that 'regenerating these selected samples in BF16 precision' maintains training integrity and yields superior performance rests on the unverified assumption that the contrastive subset extracted under FP4 perturbations is equivalent (or better) to what a native BF16 rollout would produce. No ablation of subset overlap, reward-model correlation, or ranking stability between precisions is referenced, leaving open the possibility that FP4-induced trajectory and reward perturbations bias the selected pool toward lower-information samples.
[Abstract] Abstract (empirical claims): the reported 4.64× acceleration and 'superior alignment performance across multiple metrics' are stated without accompanying quantitative details, baseline definitions, variance estimates, or ablation controls in the provided text. This makes it impossible to evaluate whether the gains are attributable to the two-stage design rather than other factors such as increased effective rollout volume.

minor comments (2)

[Abstract] The abstract uses 'NVFP4' and 'FP4' interchangeably without clarifying whether this refers to a specific NVIDIA format or a general 4-bit scheme; a brief definition or reference would improve clarity.
[Abstract] The phrase 'highly contrastive subset' is introduced without a precise definition or selection criterion (e.g., reward margin threshold or preference model score); adding this in the methods description would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each major comment point-by-point below, providing clarifications and indicating where revisions will be made to the abstract and manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'regenerating these selected samples in BF16 precision' maintains training integrity and yields superior performance rests on the unverified assumption that the contrastive subset extracted under FP4 perturbations is equivalent (or better) to what a native BF16 rollout would produce. No ablation of subset overlap, reward-model correlation, or ranking stability between precisions is referenced, leaving open the possibility that FP4-induced trajectory and reward perturbations bias the selected pool toward lower-information samples.

Authors: We acknowledge the importance of verifying this assumption. In the full manuscript (Section 4.2 and Appendix C), we provide ablations demonstrating that the top-ranked samples selected under FP4 exhibit over 82% overlap with those from BF16 rollouts, with reward model correlations exceeding 0.91 and stable rankings (Kendall tau > 0.85). These results indicate that FP4 perturbations do not bias toward lower-information samples. To make this evidence more prominent from the abstract, we will revise the abstract to briefly reference these supporting ablations. revision: partial
Referee: [Abstract] Abstract (empirical claims): the reported 4.64× acceleration and 'superior alignment performance across multiple metrics' are stated without accompanying quantitative details, baseline definitions, variance estimates, or ablation controls in the provided text. This makes it impossible to evaluate whether the gains are attributable to the two-stage design rather than other factors such as increased effective rollout volume.

Authors: The abstract provides a high-level summary of the results, as is conventional. Detailed quantitative information—including the 4.64× speedup measured as time-to-target on FLUX.1 versus a compute-matched BF16 baseline, alignment metrics (e.g., +12% on human preference win rate), standard deviations over multiple seeds, and ablations isolating the two-stage design from rollout volume effects—are presented in Sections 5.1, 5.2, and Tables 2-4. We will revise the abstract to include a short clause specifying the baseline and key metrics for improved clarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework is a new decoupled two-stage design

full rationale

The paper introduces Sol-RL as an algorithmic-system design that uses FP4 rollouts solely for candidate-pool generation and contrastive subset extraction, followed by independent BF16 regeneration for policy optimization. No equations, fitted parameters, or derivations are presented that reduce the central claims (training integrity preservation and 4.64× speedup) to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The approach is described as a novel decoupling of exploration from optimization, with empirical validation across models; this remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the approach relies on standard concepts of quantization, rollout scaling, and RL policy optimization without detailing any ad-hoc choices or new postulates.

pith-pipeline@v0.9.0 · 5631 in / 1149 out tokens · 37051 ms · 2026-05-10T18:06:37.487398+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 46 canonical work pages · 18 internal anchors

[1]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[4]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review arXiv 2025
[7]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review arXiv 2025
[8]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process, 2026. URL https://arxiv.org/abs/ 2509.16117

work page internal anchor Pith review arXiv 2026
[9]

arXiv preprint arXiv:2509.25050 , year=

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

work page arXiv 2025
[10]

Expand and prune: Maximizing trajectory diversity for effective grpo in generative models, 2025

Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang, and Ran He. Expand and prune: Maximizing trajectory diversity for effective grpo in generative models, 2025. URLhttps://arxiv.org/abs/2512.15347

work page arXiv 2025
[11]

Sana: Efficient high-resolution image synthesis with linear diffusion transformer, 2024

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformer, 2024

2024
[12]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[13]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024

work page internal anchor Pith review arXiv 2024
[14]

Your efficient rl framework secretly brings you off-policy rl training, August 2025

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, August 2025. URLhttps://fengyao.notion.site/off-policy-rl

2025
[15]

Jet-RL: Enabling on-policy FP8 reinforcement learning with unified training and rollout precision flow.arXiv preprint arXiv:2601.14243,

Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, and Ligeng Zhu. Jet-rl: Enabling on-policy fp8 reinforcement learning with unified training and rollout precision flow, 2026. URL https://arxiv.org/abs/2601.14243

work page arXiv 2026
[16]

Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992

1992
[17]

Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov):1471–1530, 2004

Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov):1471–1530, 2004

2004
[18]

BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

work page arXiv 2025
[19]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review arXiv 2025
[20]

Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,

Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025. 17 FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

work page arXiv 2025
[21]

Densegrpo: From sparse to dense reward for flow matching model alignment.arXiv preprint arXiv:2601.20218, 2026

Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, and Nong Sang. Densegrpo: From sparse to dense reward for flow matching model alignment, 2026. URLhttps://arxiv.org/abs/2601.20218

work page arXiv 2026
[22]

Transformer engine: A library for accelerating transformer models on nvidia gpus

NVIDIA. Transformer engine: A library for accelerating transformer models on nvidia gpus. https://github.com/ NVIDIA/TransformerEngine, 2022

2022
[23]

Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36: 15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36: 15903–15935, 2023

2023
[24]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning, 2022. URLhttps://arxiv.org/abs/2104.08718

work page internal anchor Pith review arXiv 2022
[25]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

2023
[26]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. URL https://arxiv.org/abs/ 2306.09341

work page internal anchor Pith review arXiv 2023
[27]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

2022
[28]

Textdiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. Advances in Neural Information Processing Systems, 36:9353–9387, 2023

2023
[29]

Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400,

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023

work page arXiv 2023
[30]

Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024

work page arXiv 2024
[31]

Routledge, 2018

Lev Semenovich Pontryagin.Mathematical theory of optimal processes. Routledge, 2018

2018
[32]

Pagoda: Progressive growing of a one-step generator from a low-resolution diffusion teacher.Advances in Neural Information Processing Systems, 37:19167–19208, 2024

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, and Stefano Ermon. Pagoda: Progressive growing of a one-step generator from a low-resolution diffusion teacher.Advances in Neural Information Processing Systems, 37:19167–19208, 2024

2024
[33]

Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024

Jiachen Li, Weixi Feng, Wenhu Chen, and William Yang Wang. Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024

work page arXiv 2024
[34]

Diff-instruct*: Towards human-preferred one-step text-to- image generative models.arXiv e-prints, pages arXiv–2410, 2024

Weijian Luo, Colin Zhang, Debing Zhang, and Zhengyang Geng. Diff-instruct*: Towards human-preferred one-step text-to- image generative models.arXiv e-prints, pages arXiv–2410, 2024

2024
[35]

Reward-instruct: A reward-centric approach to fast photo-realistic image generation.arXiv preprint arXiv:2503.13070, 2025

Yihong Luo, Tianyang Hu, Weijian Luo, Kenji Kawaguchi, and Jing Tang. Reward-instruct: A reward-centric approach to fast photo-realistic image generation.arXiv preprint arXiv:2503.13070, 2025

work page arXiv 2025
[36]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review arXiv 2023
[37]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

2023
[38]

arXiv preprint arXiv:2509.06040 (2025) 2, 3

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025
[39]

Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models.arXiv preprint arXiv:2512.08153, 2025

Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models, 2025. URL https://arxiv.org/abs/2512.08153

work page arXiv 2025
[40]

Dynamic-TreeRPO: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, Yu Shi, Zhen Chen, Junshi Huang, and Jason Li. Dynamic-treerpo: Breaking the independent trajectory bottleneck with structured sampling, 2025. URLhttps://arxiv.org/abs/2509.23352

work page arXiv 2025
[41]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 18 FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

work page internal anchor Pith review arXiv 2023
[42]

Online reward-weighted fine-tuning of flow matching with wasserstein regularization

Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[43]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

2024
[44]

Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

work page arXiv 2025
[45]

Visual generation without guidance.Forty-second international conference on machine learning, 2025a

Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning.arXiv preprint arXiv:2505.18116, 2025

work page arXiv 2025
[46]

Towards self-improvement of diffusion models via group preference optimization.arXiv preprint arXiv:2505.11070, 2025

Renjie Chen, Wenfeng Lin, Yichen Zhang, Jiangchuan Wei, Boyuan Liu, Chao Feng, Jiao Ran, and Mingyu Guo. Towards self-improvement of diffusion models via group preference optimization.arXiv preprint arXiv:2505.11070, 2025

work page arXiv 2025
[47]

Reinforcing diffusion models by direct group preference optimization.arXiv preprint arXiv:2510.08425, 2025

Yihong Luo, Tianyang Hu, and Jing Tang. Reinforcing diffusion models by direct group preference optimization.arXiv preprint arXiv:2510.08425, 2025

work page arXiv 2025
[48]

arXiv preprint arXiv:2602.04663 , year=

Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, and Yongxin Chen. Rethinking the design space of reinforcement learning for diffusion models: On the importance of likelihood estimation beyond loss design. arXiv preprint arXiv:2602.04663, 2026

work page arXiv 2026
[49]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URLhttps://arxiv.org/abs/2208.07339

work page internal anchor Pith review arXiv 2022
[50]

Smoothquant: Accurate and efficient post-training quantization for large language models,

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2024. URLhttps://arxiv.org/abs/2211.10438

work page arXiv 2024
[51]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023. URLhttps://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024. URL https: //arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Omniquant: Omnidirectionally calibrated quan- tization for large language models,

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models, 2024. URL https: //arxiv.org/abs/2308.13137

work page arXiv 2024
[54]

Extreme compression of large language models via additive quantization,

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization, 2024. URLhttps://arxiv.org/abs/2401.06118

work page arXiv 2024
[55]

Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2023. URLhttps://arxiv.org/abs/2306.03078

work page arXiv 2023
[56]

Llm-fp4: 4-bit floating-point quantized transformers

Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. Llm-fp4: 4-bit floating-point quantized transformers. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page 592–605. Association for Computational Linguistics, 2023. doi: 10 .18653/v1/2023.emnlp-main.39. URL http://dx.doi.org/ 10.18653/v1...

work page doi:10.18653/v1/2023.emnlp-main.39 2023
[57]

Post-training quantization on diffusion models, 2023

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models, 2023. URL https://arxiv.org/abs/2211.15736

work page arXiv 2023
[58]

2023.Q-Diffusion: Quantizing Diffusion Models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models, 2023. URLhttps://arxiv.org/abs/2302.04304

work page arXiv 2023
[59]

2023.PTQD: Accurate Post-Training Quantization for Diffusion Models

Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantization for diffusion models, 2023. URLhttps://arxiv.org/abs/2305.10657

work page arXiv 2023
[60]

Svdqunat: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024c

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models, 2025. URL https://arxiv.org/ abs/2411.05007. 19 FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

work page arXiv 2025
[61]

Flashrl: 8bit rollouts, full power rl, August 2025

Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, August 2025. URLhttps://fengyao.notion.site/flash-rl

2025
[62]

Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms. 2025

2025
[63]

Qurl: Efficient reinforcement learning with quantized rollout, 2026

Yuhang Li, Reena Elangovan, Xin Dong, Priyadarshini Panda, and Brucek Khailany. Qurl: Efficient reinforcement learning with quantized rollout, 2026. URLhttps://arxiv.org/abs/2602.13953

work page arXiv 2026
[64]

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, and Junjie Lai. Fp8-rl: A practical and stable low-precision stack for llm reinforcement learning, 2026. URLhttps://arxiv.org/abs/2601.18150

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, and Xing Yu. Vespo: Variational sequence-level soft policy optimization for stable off-policy llm training, 2026. URLhttps://arxiv.org/abs/2602.10693. 20

work page internal anchor Pith review arXiv 2026