FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
Pith reviewed 2026-05-16 11:04 UTC · model grok-4.3
The pith
FP8 rollout with token-level importance sampling corrections delivers up to 44 percent throughput gains in LLM reinforcement learning while matching BF16 learning behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An FP8 rollout stack that combines blockwise W8A8 quantization, per-step QKV scale recalibration for the KV cache, and token-level importance sampling corrections (TIS/MIS) achieves up to 44 percent higher rollout throughput across dense and MoE models while producing learning curves and final policies that are statistically comparable to BF16 baselines.
What carries the argument
The FP8 rollout stack: blockwise FP8 quantization for weights, per-step QKV recalibration for KV-cache, and token-level importance-sampling corrections (TIS/MIS) to offset train-inference mismatch.
If this is right
- Rollout generation time drops substantially for long output sequences without extra hardware.
- Memory capacity for KV cache increases, allowing longer contexts or larger batch sizes during RL.
- The same corrections can be reused when swapping between different inference engines.
- MoE models see the same relative gains as dense models, broadening applicability.
- Overall RL training wall-clock time decreases while policy quality stays the same.
Where Pith is reading between the lines
- The approach may extend to even lower precisions such as FP4 if the importance-sampling corrections are retuned.
- Similar mismatch-correction logic could stabilize low-precision inference in other online learning loops beyond RL.
- Because the stack is backend-agnostic, it could be dropped into existing RL pipelines with minimal code changes.
- The per-step recalibration overhead might become negligible at very large batch sizes, further improving net gains.
Load-bearing premise
Token-level importance sampling fully removes any bias or instability introduced by FP8 rollouts so that the trainer still learns the intended policy.
What would settle it
A side-by-side run on the same task, model size, and random seed where the FP8 setup produces a statistically different reward curve or final policy compared with the BF16 baseline.
Figures
read the original abstract
Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents FP8-RL, a practical low-precision stack for LLM reinforcement learning rollouts. It enables FP8 W8A8 linear layers via blockwise quantization, extends FP8 to the KV cache with per-step QKV scale recalibration, and applies token-level importance sampling corrections (TIS/MIS variants) to mitigate train-inference mismatch caused by dynamic weight changes and quantization. The central empirical claim is up to 44% rollout throughput improvement across dense and MoE models while preserving learning behavior comparable to BF16 baselines, implemented in the veRL ecosystem with support for FSDP/Megatron training and vLLM/SGLang inference.
Significance. If the importance-sampling corrections are shown to fully neutralize distribution shift without introducing bias, the work would provide a directly usable efficiency lever for scaling RL post-training of large models, where rollout time dominates. The engineering focus on repeated quantization, KV-cache handling, and backend integration is a concrete contribution that could be adopted in production pipelines.
major comments (2)
- [Experimental results] Experimental results section: the claim of 'comparable learning behavior' to BF16 baselines rests on learning curves without reported error bars, statistical tests, or ablations isolating the contribution of TIS/MIS corrections versus quantization noise. This leaves open whether the policy-gradient estimator remains unbiased when FP8 logits deviate from BF16 probabilities.
- [Method] Mismatch-correction description: the token-level TIS/MIS formulation is presented at a high level; it is unclear whether the importance weights exactly recover the BF16 policy probabilities given blockwise scale factors and per-step KV recalibration, or whether residual quantization error in the logits propagates into the trainer.
minor comments (2)
- [Abstract] The abstract states 'up to 44%' throughput gain; the main text should tabulate the precise model sizes, sequence lengths, and hardware configurations that achieve this number.
- [Method] Notation for TIS versus MIS should be defined explicitly with equations rather than left as acronyms.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will incorporate revisions to improve the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [Experimental results] Experimental results section: the claim of 'comparable learning behavior' to BF16 baselines rests on learning curves without reported error bars, statistical tests, or ablations isolating the contribution of TIS/MIS corrections versus quantization noise. This leaves open whether the policy-gradient estimator remains unbiased when FP8 logits deviate from BF16 probabilities.
Authors: We agree that additional statistical rigor would strengthen the empirical claims. In the revised manuscript we will add error bars computed across multiple random seeds for all learning curves and include statistical significance tests (e.g., paired t-tests on final performance metrics). We will also insert new ablation experiments that isolate the contribution of the TIS/MIS corrections versus quantization noise alone, demonstrating that the corrections are necessary to keep learning trajectories aligned with the BF16 baseline. On the unbiasedness question, the token-level importance weights are constructed precisely to reweight the FP8-sampled trajectories back to the distribution induced by the BF16 policy; under standard importance-sampling assumptions the policy-gradient estimator remains unbiased in expectation, with quantization error primarily increasing variance rather than introducing systematic bias. We will make this derivation explicit in the revision. revision: yes
-
Referee: [Method] Mismatch-correction description: the token-level TIS/MIS formulation is presented at a high level; it is unclear whether the importance weights exactly recover the BF16 policy probabilities given blockwise scale factors and per-step KV recalibration, or whether residual quantization error in the logits propagates into the trainer.
Authors: We will expand the method section with the full mathematical formulation. The importance weights are computed as the ratio of the BF16 policy probability (obtained from the original high-precision logits) to the FP8 policy probability (obtained by dequantizing the FP8 logits using the blockwise scale factors). The per-step QKV recalibration updates the KV-cache scales at every generation step to minimize accumulated quantization error in the attention computation. Because the final logits still contain residual quantization noise, the recovered probabilities are approximate rather than exact; however, the importance-sampling correction ensures that the expectation of the gradient estimator matches the BF16 case. We will add a short discussion of how any remaining logit error affects variance (but not bias) and is absorbed by the trainer. These clarifications will be included in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical claims rest on direct measurements
full rationale
The paper describes an engineering stack for FP8 rollouts in LLM RL, including blockwise quantization, KV-cache recalibration, and token-level importance sampling corrections. All central claims—up to 44% throughput gains and comparable learning behavior to BF16—are presented as outcomes of direct experimental comparisons rather than any derivation, prediction, or first-principles result. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the importance-sampling mitigation is introduced as an empirical fix and validated by measurement, leaving the argument chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- blockwise FP8 quantization scales
axioms (1)
- domain assumption FP8 linear layers and KV-cache with per-step recalibration preserve sufficient numerical fidelity for RL rollouts when paired with importance sampling
Forward citations
Cited by 2 Pith papers
-
AIS: Adaptive Importance Sampling for Quantized RL
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Reference graph
Works this paper leans on
-
[1]
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
R. Qin et al., “Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning.” [On- line]. Available: https://arxiv.org/abs/2511.14617
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
NVIDIA, “TensorRT LLM.” [Online]. Available: https://github.com/NVIDIA/TensorRT-LLM
-
[3]
Efficient Memory Management for Large Language Model Serving with PagedAtten - tion,
W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAtten - tion,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[4]
Available: https://github.com/sgl-project/sglang
“SGLang.” [Online]. Available: https://github.com/sgl-project/sglang
-
[5]
When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch
J. Liu, Y. Li, Y. Fu, J. Wang, Q. Liu, and Z. Jiang, “When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch.” [Online]. Available: https://richardli.xyz/rl-collapse
-
[6]
FlashRL: 8Bit Rollouts, Full Power RL
L. Liu, F. Yao, D. Zhang, C. Dong, J. Shang, and J. Gao, “FlashRL: 8Bit Rollouts, Full Power RL.” [Online]. Available: https://fengyao.notion.site/flash-rl
-
[7]
Your Efficient RL Framework Secretly Brings You Off-Policy RL Training
F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao, “Your Efficient RL Framework Secretly Brings You Off-Policy RL Training.” [Online]. Available: https://fengyao.notion.site/off-policy-rl
-
[8]
DeepSeek-AI, “DeepSeek-V3 Technical Report,” CoRR, 2024
work page 2024
-
[9]
HybridFlow: A Flexible and Efficient RLHF Framework
G. Sheng et al. , “HybridFlow: A Flexible and Efficient RLHF Framework,” arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale,
Q. Yu et al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” CoRR, 2025
work page 2025
-
[11]
NeMo RL: A Scalable and Efficient Post-Training Library
“NeMo RL: A Scalable and Efficient Post-Training Library.” 2025
work page 2025
-
[12]
FP8 Formats for Deep Learning,
P. Micikevicius et al., “FP8 Formats for Deep Learning,” CoRR, 2022
work page 2022
-
[13]
FP8-LM: Training FP8 Large Language Models,
H. Peng et al., “FP8-LM: Training FP8 Large Language Models,” CoRR, 2023
work page 2023
-
[14]
Z. Yao et al., “DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.” [Online]. Available: https://arxiv.org/abs/2308.01320
-
[15]
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
J. Hu et al., “OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework,” arXiv preprint arXiv:2405.11143, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
W. Wang et al., “Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User- Friendly Scaling Library.” [Online]. Available: https://arxiv.org/abs/2506.06122
-
[17]
slime: An LLM post-training framework for RL Scaling
Z. Zhu, C. Xie, X. Lv, and slime Contributors, “slime: An LLM post-training framework for RL Scaling.” 2025
work page 2025
-
[18]
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
W. Fu et al. , “AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning.” [Online]. Available: https://arxiv.org/abs/2505.24298
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop!
X. Zhao et al., “Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop!.” [Online]. Available: https://ringtech.notion.site/icepop
-
[20]
Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers
W. Ma et al. , “Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers.” [Online]. Available: https://arxiv.org/abs/2510.11370
-
[21]
vLLM and T. Teams, “No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan.” [Online]. Available: https://blog.vllm.ai/2025/11/10/bitwise- consistent-train-inference.html
work page 2025
-
[22]
Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025
P. Qi et al., “Defeating the Training-Inference Mismatch via FP16.” [Online]. Available: https://arxiv.org/ abs/2510.26788 A Appendix: FP8 W8A8 Linear Rollout Configuration This appendix provides detailed configuration instructions and usage examples for enabling FP8 W8A8 linear rollout in the veRL framework. A.1 Basic Configuration To enable FP8 quantiza...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.