QaRL aligns quantized rollouts with training in LLM RL and uses TBPO with dual clipping to stabilize optimization, delivering +5.5 improvement over standard quantized-rollout baselines on Qwen3-30B math problems while retaining speed benefits.
InFindings of the As- sociation for Computational Linguistics: ACL 2024, pages 467–484
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
QaRL: Rollout-Aligned Quantization-Aware RL for Fast and Stable Training under Training--Inference Mismatch
QaRL aligns quantized rollouts with training in LLM RL and uses TBPO with dual clipping to stabilize optimization, delivering +5.5 improvement over standard quantized-rollout baselines on Qwen3-30B math problems while retaining speed benefits.