pith. sign in

arxiv: 2601.18150 · v2 · submitted 2026-01-26 · 💻 cs.LG · cs.CL

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Pith reviewed 2026-05-16 11:04 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords FP8LLMreinforcement learningrolloutquantizationimportance samplingthroughputKV cache
0
0 comments X

The pith

FP8 rollout with token-level importance sampling corrections delivers up to 44 percent throughput gains in LLM reinforcement learning while matching BF16 learning behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a complete low-precision stack for the generation phase of LLM reinforcement learning to cut memory traffic and compute during long sequence rollouts. It applies blockwise FP8 quantization to linear layers, extends FP8 to the KV cache with per-step recalibration, and uses token-level importance sampling corrections to keep the rollout distribution close enough to the trainer's expectations. The stack integrates with common training frameworks and inference engines, and experiments on both dense and MoE models show the speed-up without detectable change in final policy quality. A sympathetic reader would care because rollout time currently dominates RL training cost; removing that bottleneck at scale would make policy optimization cheaper and faster.

Core claim

An FP8 rollout stack that combines blockwise W8A8 quantization, per-step QKV scale recalibration for the KV cache, and token-level importance sampling corrections (TIS/MIS) achieves up to 44 percent higher rollout throughput across dense and MoE models while producing learning curves and final policies that are statistically comparable to BF16 baselines.

What carries the argument

The FP8 rollout stack: blockwise FP8 quantization for weights, per-step QKV recalibration for KV-cache, and token-level importance-sampling corrections (TIS/MIS) to offset train-inference mismatch.

If this is right

  • Rollout generation time drops substantially for long output sequences without extra hardware.
  • Memory capacity for KV cache increases, allowing longer contexts or larger batch sizes during RL.
  • The same corrections can be reused when swapping between different inference engines.
  • MoE models see the same relative gains as dense models, broadening applicability.
  • Overall RL training wall-clock time decreases while policy quality stays the same.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to even lower precisions such as FP4 if the importance-sampling corrections are retuned.
  • Similar mismatch-correction logic could stabilize low-precision inference in other online learning loops beyond RL.
  • Because the stack is backend-agnostic, it could be dropped into existing RL pipelines with minimal code changes.
  • The per-step recalibration overhead might become negligible at very large batch sizes, further improving net gains.

Load-bearing premise

Token-level importance sampling fully removes any bias or instability introduced by FP8 rollouts so that the trainer still learns the intended policy.

What would settle it

A side-by-side run on the same task, model size, and random seed where the FP8 setup produces a statistically different reward curve or final policy compared with the BF16 baseline.

Figures

Figures reproduced from arXiv: 2601.18150 by Jingqi Zhang, Jingyi Yang, Junjie Lai, Shuai Zhang, Shuang Yu, Xue Huang, Zhaopeng Qiu.

Figure 1
Figure 1. Figure 1: The RL workflow in veRL with FP8 W8A8 Linear quantization. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training curves for Qwen3-8B-Base. Orange: BF16 baseline without TIS. Blue: FP8 W8A8 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rollout performance on Qwen3-8B-Base. Time-per-token (ms/token) during generation [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training curves for Qwen3-30B-A3B-Base MoE. Orange: BF16 with token-level TIS. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Rollout performance on Qwen3-30B-A3B-Base MoE. Time-per-token (ms/token) during [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Two implementation strategies for FP8 KV Cache. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training curves for Qwen3-8B-Base with KV-Cache FP8. Blue: BF16 baseline. Yellow: [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Rollout speedup on Qwen3-8B-Base. Blue: BF16 baseline. Yellow: Linear W8A8 + TIS. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training curves for end-to-end FP8 RL. Blue: BF16 training + BF16 rollout. Orange: BF16 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training curves for Qwen3-8B-Base with Trainer-Side KV-cache calibration (NeMo-RL). [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Rollout performance on Qwen3-8B-Base with Trainer-Side calibration (NeMo-RL). Blue: [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FP8-RL, a practical low-precision stack for LLM reinforcement learning rollouts. It enables FP8 W8A8 linear layers via blockwise quantization, extends FP8 to the KV cache with per-step QKV scale recalibration, and applies token-level importance sampling corrections (TIS/MIS variants) to mitigate train-inference mismatch caused by dynamic weight changes and quantization. The central empirical claim is up to 44% rollout throughput improvement across dense and MoE models while preserving learning behavior comparable to BF16 baselines, implemented in the veRL ecosystem with support for FSDP/Megatron training and vLLM/SGLang inference.

Significance. If the importance-sampling corrections are shown to fully neutralize distribution shift without introducing bias, the work would provide a directly usable efficiency lever for scaling RL post-training of large models, where rollout time dominates. The engineering focus on repeated quantization, KV-cache handling, and backend integration is a concrete contribution that could be adopted in production pipelines.

major comments (2)
  1. [Experimental results] Experimental results section: the claim of 'comparable learning behavior' to BF16 baselines rests on learning curves without reported error bars, statistical tests, or ablations isolating the contribution of TIS/MIS corrections versus quantization noise. This leaves open whether the policy-gradient estimator remains unbiased when FP8 logits deviate from BF16 probabilities.
  2. [Method] Mismatch-correction description: the token-level TIS/MIS formulation is presented at a high level; it is unclear whether the importance weights exactly recover the BF16 policy probabilities given blockwise scale factors and per-step KV recalibration, or whether residual quantization error in the logits propagates into the trainer.
minor comments (2)
  1. [Abstract] The abstract states 'up to 44%' throughput gain; the main text should tabulate the precise model sizes, sequence lengths, and hardware configurations that achieve this number.
  2. [Method] Notation for TIS versus MIS should be defined explicitly with equations rather than left as acronyms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and will incorporate revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results section: the claim of 'comparable learning behavior' to BF16 baselines rests on learning curves without reported error bars, statistical tests, or ablations isolating the contribution of TIS/MIS corrections versus quantization noise. This leaves open whether the policy-gradient estimator remains unbiased when FP8 logits deviate from BF16 probabilities.

    Authors: We agree that additional statistical rigor would strengthen the empirical claims. In the revised manuscript we will add error bars computed across multiple random seeds for all learning curves and include statistical significance tests (e.g., paired t-tests on final performance metrics). We will also insert new ablation experiments that isolate the contribution of the TIS/MIS corrections versus quantization noise alone, demonstrating that the corrections are necessary to keep learning trajectories aligned with the BF16 baseline. On the unbiasedness question, the token-level importance weights are constructed precisely to reweight the FP8-sampled trajectories back to the distribution induced by the BF16 policy; under standard importance-sampling assumptions the policy-gradient estimator remains unbiased in expectation, with quantization error primarily increasing variance rather than introducing systematic bias. We will make this derivation explicit in the revision. revision: yes

  2. Referee: [Method] Mismatch-correction description: the token-level TIS/MIS formulation is presented at a high level; it is unclear whether the importance weights exactly recover the BF16 policy probabilities given blockwise scale factors and per-step KV recalibration, or whether residual quantization error in the logits propagates into the trainer.

    Authors: We will expand the method section with the full mathematical formulation. The importance weights are computed as the ratio of the BF16 policy probability (obtained from the original high-precision logits) to the FP8 policy probability (obtained by dequantizing the FP8 logits using the blockwise scale factors). The per-step QKV recalibration updates the KV-cache scales at every generation step to minimize accumulated quantization error in the attention computation. Because the final logits still contain residual quantization noise, the recovered probabilities are approximate rather than exact; however, the importance-sampling correction ensures that the expectation of the gradient estimator matches the BF16 case. We will add a short discussion of how any remaining logit error affects variance (but not bias) and is absorbed by the trainer. These clarifications will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on direct measurements

full rationale

The paper describes an engineering stack for FP8 rollouts in LLM RL, including blockwise quantization, KV-cache recalibration, and token-level importance sampling corrections. All central claims—up to 44% throughput gains and comparable learning behavior to BF16—are presented as outcomes of direct experimental comparisons rather than any derivation, prediction, or first-principles result. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the importance-sampling mitigation is introduced as an empirical fix and validated by measurement, leaving the argument chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that FP8 quantization plus importance sampling can keep rollout distributions close enough to BF16 for stable RL, plus standard engineering assumptions about blockwise scaling and per-step recalibration.

free parameters (1)
  • blockwise FP8 quantization scales
    Per-block scales for weights and activations are calibrated during rollout; their exact fitting procedure is not detailed in the abstract.
axioms (1)
  • domain assumption FP8 linear layers and KV-cache with per-step recalibration preserve sufficient numerical fidelity for RL rollouts when paired with importance sampling
    Invoked to justify the stability claim across dense and MoE models.

pith-pipeline@v0.9.0 · 5557 in / 1207 out tokens · 25911 ms · 2026-05-16T11:04:00.331898+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  2. FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

    cs.LG 2026-04 unverdicted novelty 6.0

    Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 2 Pith papers · 4 internal anchors

  1. [1]

    Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning

    R. Qin et al., “Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning.” [On- line]. Available: https://arxiv.org/abs/2511.14617

  2. [2]

    TensorRT LLM

    NVIDIA, “TensorRT LLM.” [Online]. Available: https://github.com/NVIDIA/TensorRT-LLM

  3. [3]

    Efficient Memory Management for Large Language Model Serving with PagedAtten - tion,

    W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAtten - tion,” in Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  4. [4]

    Available: https://github.com/sgl-project/sglang

    “SGLang.” [Online]. Available: https://github.com/sgl-project/sglang

  5. [5]

    When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch

    J. Liu, Y. Li, Y. Fu, J. Wang, Q. Liu, and Z. Jiang, “When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch.” [Online]. Available: https://richardli.xyz/rl-collapse

  6. [6]

    FlashRL: 8Bit Rollouts, Full Power RL

    L. Liu, F. Yao, D. Zhang, C. Dong, J. Shang, and J. Gao, “FlashRL: 8Bit Rollouts, Full Power RL.” [Online]. Available: https://fengyao.notion.site/flash-rl

  7. [7]

    Your Efficient RL Framework Secretly Brings You Off-Policy RL Training

    F. Yao, L. Liu, D. Zhang, C. Dong, J. Shang, and J. Gao, “Your Efficient RL Framework Secretly Brings You Off-Policy RL Training.” [Online]. Available: https://fengyao.notion.site/off-policy-rl

  8. [8]

    DeepSeek-V3 Technical Report,

    DeepSeek-AI, “DeepSeek-V3 Technical Report,” CoRR, 2024

  9. [9]

    HybridFlow: A Flexible and Efficient RLHF Framework

    G. Sheng et al. , “HybridFlow: A Flexible and Efficient RLHF Framework,” arXiv preprint arXiv: 2409.19256, 2024

  10. [10]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale,

    Q. Yu et al., “DAPO: An Open-Source LLM Reinforcement Learning System at Scale,” CoRR, 2025

  11. [11]

    NeMo RL: A Scalable and Efficient Post-Training Library

    “NeMo RL: A Scalable and Efficient Post-Training Library.” 2025

  12. [12]

    FP8 Formats for Deep Learning,

    P. Micikevicius et al., “FP8 Formats for Deep Learning,” CoRR, 2022

  13. [13]

    FP8-LM: Training FP8 Large Language Models,

    H. Peng et al., “FP8-LM: Training FP8 Large Language Models,” CoRR, 2023

  14. [14]

    Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

    Z. Yao et al., “DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.” [Online]. Available: https://arxiv.org/abs/2308.01320

  15. [15]

    OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

    J. Hu et al., “OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework,” arXiv preprint arXiv:2405.11143, 2024

  16. [16]

    Reinforcement learning optimization for large-scale learning: An efficient and user-friendly scaling library.arXiv preprint arXiv:2506.06122, 2025a

    W. Wang et al., “Reinforcement Learning Optimization for Large-Scale Learning: An Efficient and User- Friendly Scaling Library.” [Online]. Available: https://arxiv.org/abs/2506.06122

  17. [17]

    slime: An LLM post-training framework for RL Scaling

    Z. Zhu, C. Xie, X. Lv, and slime Contributors, “slime: An LLM post-training framework for RL Scaling.” 2025

  18. [18]

    AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    W. Fu et al. , “AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning.” [Online]. Available: https://arxiv.org/abs/2505.24298

  19. [19]

    Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop!

    X. Zhao et al., “Small Leak Can Sink a Great Ship–Boost RL Training on MoE with IcePop!.” [Online]. Available: https://ringtech.notion.site/icepop

  20. [20]

    Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers

    W. Ma et al. , “Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers.” [Online]. Available: https://arxiv.org/abs/2510.11370

  21. [21]

    No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan

    vLLM and T. Teams, “No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan.” [Online]. Available: https://blog.vllm.ai/2025/11/10/bitwise- consistent-train-inference.html

  22. [22]

    Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

    P. Qi et al., “Defeating the Training-Inference Mismatch via FP16.” [Online]. Available: https://arxiv.org/ abs/2510.26788 A Appendix: FP8 W8A8 Linear Rollout Configuration This appendix provides detailed configuration instructions and usage examples for enabling FP8 W8A8 linear rollout in the veRL framework. A.1 Basic Configuration To enable FP8 quantiza...