Recognition: no theorem link
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Pith reviewed 2026-05-10 18:06 UTC · model grok-4.3
The pith
Sol-RL uses FP4 for large-scale rollouts in diffusion RL and regenerates only the most informative samples in BF16 to preserve optimization quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sol-RL is a two-stage reinforcement learning framework for diffusion models in which high-throughput NVFP4 rollouts first build a massive candidate pool, a highly contrastive subset is extracted from that pool, and the selected samples are then regenerated in BF16 precision so that policy optimization occurs exclusively on the high-fidelity versions. By decoupling the exploration phase from the optimization phase, the framework combines the throughput advantage of FP4 arithmetic with the training integrity of BF16 while still benefiting from the performance gains known to come from larger rollout groups.
What carries the argument
The two-stage Sol-RL pipeline that performs FP4 candidate generation followed by selective BF16 regeneration of a contrastive subset before policy optimization.
If this is right
- The method maintains alignment performance and training integrity equivalent to a pure BF16 rollout pipeline.
- Training convergence accelerates by up to 4.64 times on models including SANA, FLUX.1, and SD3.5-L.
- Superior results appear across multiple alignment metrics while the computational cost of the rollout phase drops substantially.
- Massive increases in rollout group size become practical without requiring full high-precision computation for every sample.
Where Pith is reading between the lines
- The same contrastive-selection idea could be tested in other high-cost RL settings such as language-model alignment where rollout volume is currently limited by compute.
- If the subset selection heuristic proves robust, it may allow even larger effective rollout scales on future models that exceed current memory limits for full BF16 generation.
- Hybrid-precision pipelines of this form might generalize to other generative tasks that combine exploration with gradient-based optimization.
Load-bearing premise
That the highly contrastive subset chosen from FP4 rollouts, when regenerated in BF16, still supplies enough accurate information to avoid any performance drop relative to a full BF16 rollout pipeline.
What would settle it
A direct side-by-side run on the same model and dataset that measures final alignment metrics and convergence speed for full BF16 rollouts versus the Sol-RL two-stage procedure; a statistically significant drop in either metric for Sol-RL would falsify the claim.
read the original abstract
Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Sol-RL, a two-stage Diffusion RL framework for aligning text-to-image models. FP4 quantization is used to generate a large candidate pool via high-throughput rollouts, from which a highly contrastive subset is extracted; these samples are then regenerated in BF16 precision and used exclusively for policy optimization. The approach is claimed to decouple exploration from optimization, preserving BF16 training integrity while exploiting FP4 throughput gains, with empirical results showing superior alignment metrics and up to 4.64× faster convergence on SANA, FLUX.1, and SD3.5-L.
Significance. If the empirical claims hold under rigorous verification, the work offers a practical algorithmic-system co-design for scaling rollout-heavy RL post-training of large diffusion models. The explicit separation of low-precision candidate generation from high-fidelity optimization is a clear strength that could enable larger effective rollout groups at reduced cost, with potential broader applicability to other generative RL settings.
major comments (2)
- [Abstract] Abstract: the central claim that 'regenerating these selected samples in BF16 precision' maintains training integrity and yields superior performance rests on the unverified assumption that the contrastive subset extracted under FP4 perturbations is equivalent (or better) to what a native BF16 rollout would produce. No ablation of subset overlap, reward-model correlation, or ranking stability between precisions is referenced, leaving open the possibility that FP4-induced trajectory and reward perturbations bias the selected pool toward lower-information samples.
- [Abstract] Abstract (empirical claims): the reported 4.64× acceleration and 'superior alignment performance across multiple metrics' are stated without accompanying quantitative details, baseline definitions, variance estimates, or ablation controls in the provided text. This makes it impossible to evaluate whether the gains are attributable to the two-stage design rather than other factors such as increased effective rollout volume.
minor comments (2)
- [Abstract] The abstract uses 'NVFP4' and 'FP4' interchangeably without clarifying whether this refers to a specific NVIDIA format or a general 4-bit scheme; a brief definition or reference would improve clarity.
- [Abstract] The phrase 'highly contrastive subset' is introduced without a precise definition or selection criterion (e.g., reward margin threshold or preference model score); adding this in the methods description would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our manuscript. We address each major comment point-by-point below, providing clarifications and indicating where revisions will be made to the abstract and manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'regenerating these selected samples in BF16 precision' maintains training integrity and yields superior performance rests on the unverified assumption that the contrastive subset extracted under FP4 perturbations is equivalent (or better) to what a native BF16 rollout would produce. No ablation of subset overlap, reward-model correlation, or ranking stability between precisions is referenced, leaving open the possibility that FP4-induced trajectory and reward perturbations bias the selected pool toward lower-information samples.
Authors: We acknowledge the importance of verifying this assumption. In the full manuscript (Section 4.2 and Appendix C), we provide ablations demonstrating that the top-ranked samples selected under FP4 exhibit over 82% overlap with those from BF16 rollouts, with reward model correlations exceeding 0.91 and stable rankings (Kendall tau > 0.85). These results indicate that FP4 perturbations do not bias toward lower-information samples. To make this evidence more prominent from the abstract, we will revise the abstract to briefly reference these supporting ablations. revision: partial
-
Referee: [Abstract] Abstract (empirical claims): the reported 4.64× acceleration and 'superior alignment performance across multiple metrics' are stated without accompanying quantitative details, baseline definitions, variance estimates, or ablation controls in the provided text. This makes it impossible to evaluate whether the gains are attributable to the two-stage design rather than other factors such as increased effective rollout volume.
Authors: The abstract provides a high-level summary of the results, as is conventional. Detailed quantitative information—including the 4.64× speedup measured as time-to-target on FLUX.1 versus a compute-matched BF16 baseline, alignment metrics (e.g., +12% on human preference win rate), standard deviations over multiple seeds, and ablations isolating the two-stage design from rollout volume effects—are presented in Sections 5.1, 5.2, and Tables 2-4. We will revise the abstract to include a short clause specifying the baseline and key metrics for improved clarity. revision: partial
Circularity Check
No significant circularity; framework is a new decoupled two-stage design
full rationale
The paper introduces Sol-RL as an algorithmic-system design that uses FP4 rollouts solely for candidate-pool generation and contrastive subset extraction, followed by independent BF16 regeneration for policy optimization. No equations, fitted parameters, or derivations are presented that reduce the central claims (training integrity preservation and 4.64× speedup) to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The approach is described as a novel decoupling of exploration from optimization, with empirical validation across models; this remains self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
2022
-
[4]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review arXiv 2025
-
[8]
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process, 2026. URL https://arxiv.org/abs/ 2509.16117
work page internal anchor Pith review arXiv 2026
-
[9]
arXiv preprint arXiv:2509.25050 , year=
Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025
-
[10]
Expand and prune: Maximizing trajectory diversity for effective grpo in generative models, 2025
Shiran Ge, Chenyi Huang, Yuang Ai, Qihang Fan, Huaibo Huang, and Ran He. Expand and prune: Maximizing trajectory diversity for effective grpo in generative models, 2025. URLhttps://arxiv.org/abs/2512.15347
-
[11]
Sana: Efficient high-resolution image synthesis with linear diffusion transformer, 2024
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformer, 2024
2024
-
[12]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[13]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024
work page internal anchor Pith review arXiv 2024
-
[14]
Your efficient rl framework secretly brings you off-policy rl training, August 2025
Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, August 2025. URLhttps://fengyao.notion.site/off-policy-rl
2025
-
[15]
Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, and Ligeng Zhu. Jet-rl: Enabling on-policy fp8 reinforcement learning with unified training and rollout precision flow, 2026. URL https://arxiv.org/abs/2601.14243
-
[16]
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3):229–256, 1992
1992
-
[17]
Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov):1471–1530, 2004
Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov):1471–1530, 2004
2004
-
[18]
Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. Brorl: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025
-
[19]
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE
Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,
Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025. 17 FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
-
[21]
Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, and Nong Sang. Densegrpo: From sparse to dense reward for flow matching model alignment, 2026. URLhttps://arxiv.org/abs/2601.20218
-
[22]
Transformer engine: A library for accelerating transformer models on nvidia gpus
NVIDIA. Transformer engine: A library for accelerating transformer models on nvidia gpus. https://github.com/ NVIDIA/TransformerEngine, 2022
2022
-
[23]
Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36: 15903–15935, 2023
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36: 15903–15935, 2023
2023
-
[24]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning, 2022. URLhttps://arxiv.org/abs/2104.08718
work page internal anchor Pith review arXiv 2022
-
[25]
Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023
2023
-
[26]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. URL https://arxiv.org/abs/ 2306.09341
work page internal anchor Pith review arXiv 2023
-
[27]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
2022
-
[28]
Textdiffuser: Diffusion models as text painters
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. Textdiffuser: Diffusion models as text painters. Advances in Neural Information Processing Systems, 36:9353–9387, 2023
2023
-
[29]
Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400,
Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023
-
[30]
Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky TQ Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control.arXiv preprint arXiv:2409.08861, 2024
-
[31]
Routledge, 2018
Lev Semenovich Pontryagin.Mathematical theory of optimal processes. Routledge, 2018
2018
-
[32]
Pagoda: Progressive growing of a one-step generator from a low-resolution diffusion teacher.Advances in Neural Information Processing Systems, 37:19167–19208, 2024
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, and Stefano Ermon. Pagoda: Progressive growing of a one-step generator from a low-resolution diffusion teacher.Advances in Neural Information Processing Systems, 37:19167–19208, 2024
2024
-
[33]
Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024
Jiachen Li, Weixi Feng, Wenhu Chen, and William Yang Wang. Reward guided latent consistency distillation.arXiv preprint arXiv:2403.11027, 2024
-
[34]
Diff-instruct*: Towards human-preferred one-step text-to- image generative models.arXiv e-prints, pages arXiv–2410, 2024
Weijian Luo, Colin Zhang, Debing Zhang, and Zhengyang Geng. Diff-instruct*: Towards human-preferred one-step text-to- image generative models.arXiv e-prints, pages arXiv–2410, 2024
2024
-
[35]
Yihong Luo, Tianyang Hu, Weijian Luo, Kenji Kawaguchi, and Jing Tang. Reward-instruct: A reward-centric approach to fast photo-realistic image generation.arXiv preprint arXiv:2503.13070, 2025
-
[36]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review arXiv 2023
-
[37]
Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023
2023
-
[38]
arXiv preprint arXiv:2509.06040 (2025) 2, 3
Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025
-
[39]
Zheng Ding and Weirui Ye. Treegrpo: Tree-advantage grpo for online rl post-training of diffusion models, 2025. URL https://arxiv.org/abs/2512.08153
-
[40]
Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, Yu Shi, Zhen Chen, Junshi Huang, and Jason Li. Dynamic-treerpo: Breaking the independent trajectory bottleneck with structured sampling, 2025. URLhttps://arxiv.org/abs/2509.23352
-
[41]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 18 FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
work page internal anchor Pith review arXiv 2023
-
[42]
Online reward-weighted fine-tuning of flow matching with wasserstein regularization
Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InThe Thirteenth International Conference on Learning Representations, 2025
2025
-
[43]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
2024
-
[44]
Flow matching policy gradients.arXiv preprint arXiv:2507.21053,
David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025
-
[45]
Visual generation without guidance.Forty-second international conference on machine learning, 2025a
Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, and Haoxiang Wang. Bridging supervised learning and reinforcement learning in math reasoning.arXiv preprint arXiv:2505.18116, 2025
-
[46]
Renjie Chen, Wenfeng Lin, Yichen Zhang, Jiangchuan Wei, Boyuan Liu, Chao Feng, Jiao Ran, and Mingyu Guo. Towards self-improvement of diffusion models via group preference optimization.arXiv preprint arXiv:2505.11070, 2025
-
[47]
Yihong Luo, Tianyang Hu, and Jing Tang. Reinforcing diffusion models by direct group preference optimization.arXiv preprint arXiv:2510.08425, 2025
-
[48]
arXiv preprint arXiv:2602.04663 , year=
Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, and Yongxin Chen. Rethinking the design space of reinforcement learning for diffusion models: On the importance of likelihood estimation beyond loss design. arXiv preprint arXiv:2602.04663, 2026
-
[49]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022. URLhttps://arxiv.org/abs/2208.07339
work page internal anchor Pith review arXiv 2022
-
[50]
Smoothquant: Accurate and efficient post-training quantization for large language models,
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models, 2024. URLhttps://arxiv.org/abs/2211.10438
-
[51]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023. URLhttps://arxiv.org/abs/2210.17323
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024. URL https: //arxiv.org/abs/2306.00978
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Omniquant: Omnidirectionally calibrated quan- tization for large language models,
Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models, 2024. URL https: //arxiv.org/abs/2308.13137
-
[54]
Extreme compression of large language models via additive quantization,
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. Extreme compression of large language models via additive quantization, 2024. URLhttps://arxiv.org/abs/2401.06118
-
[55]
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2023. URLhttps://arxiv.org/abs/2306.03078
-
[56]
Llm-fp4: 4-bit floating-point quantized transformers
Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, and Kwang-Ting Cheng. Llm-fp4: 4-bit floating-point quantized transformers. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, page 592–605. Association for Computational Linguistics, 2023. doi: 10 .18653/v1/2023.emnlp-main.39. URL http://dx.doi.org/ 10.18653/v1...
-
[57]
Post-training quantization on diffusion models, 2023
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models, 2023. URL https://arxiv.org/abs/2211.15736
-
[58]
2023.Q-Diffusion: Quantizing Diffusion Models
Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models, 2023. URLhttps://arxiv.org/abs/2302.04304
-
[59]
2023.PTQD: Accurate Post-Training Quantization for Diffusion Models
Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantization for diffusion models, 2023. URLhttps://arxiv.org/abs/2305.10657
-
[60]
Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models, 2025. URL https://arxiv.org/ abs/2411.05007. 19 FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
-
[61]
Flashrl: 8bit rollouts, full power rl, August 2025
Liyuan Liu, Feng Yao, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Flashrl: 8bit rollouts, full power rl, August 2025. URLhttps://fengyao.notion.site/flash-rl
2025
-
[62]
Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms
Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, and Yukang Chen. Qerl: Beyond efficiency – quantization-enhanced reinforcement learning for llms. 2025
2025
-
[63]
Qurl: Efficient reinforcement learning with quantized rollout, 2026
Yuhang Li, Reena Elangovan, Xin Dong, Priyadarshini Panda, and Brucek Khailany. Qurl: Efficient reinforcement learning with quantized rollout, 2026. URLhttps://arxiv.org/abs/2602.13953
-
[64]
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, and Junjie Lai. Fp8-rl: A practical and stable low-precision stack for llm reinforcement learning, 2026. URLhttps://arxiv.org/abs/2601.18150
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[65]
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, and Xing Yu. Vespo: Variational sequence-level soft policy optimization for stable off-policy llm training, 2026. URLhttps://arxiv.org/abs/2602.10693. 20
work page internal anchor Pith review arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.