pith. machine review for the scientific record. sign in

arxiv: 2605.12380 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Alexander J. Smola, Murdock Aubry, Nicholas Stranges, Rasool Fakoor

Pith reviewed 2026-05-13 05:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningpolicy optimizationadaptive clippingeffective sample sizetrust regionoff-policy learningRL post-trainingbatch adaptation
0
0 comments X

The pith

The normalized effective sample size of each batch's policy ratios adaptively replaces fixed clipping and removes multiple hyper-parameters in RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning post-training is fragile because updating the policy changes the data distribution, entangling two concerns that existing methods address with fixed hyper-parameters chosen before training begins: a trust-region concern about moving the policy too far and an off-policy concern about using unreliable older data. These fixed parameters make algorithms sensitive to configuration and require retuning for new tasks, model scales, or mismatches. The paper proposes a batch-adaptive objective that computes the normalized effective sample size of the policy-ratio distribution in the current batch and uses it in place of fixed clipping. This single statistic automatically caps the score-function weight and sets the strength of an off-policy regularizer, keeping updates close to standard on-policy behavior when ratios are uniform while tightening when data is stale or mismatched. Experiments across varied settings show the method matches or exceeds tuned baselines while introducing no new objective hyper-parameters and eliminating several existing ones.

Core claim

The normalized effective sample size of the policy-ratio distribution in each batch serves as a reliable data-driven proxy that replaces fixed clipping, caps the score-function weight, and controls the strength of an off-policy regularizer, allowing the update to stay close to the usual on-policy score-function update when ratios are nearly uniform and to tighten automatically when stale or mismatched data cause ratio concentration while retaining a nonzero learning signal on high-ratio tokens.

What carries the argument

the normalized effective sample size of the policy-ratio distribution in each batch, which replaces fixed clipping, caps the score-function weight, and sets the strength of an off-policy regularizer

If this is right

  • The update stays close to the usual on-policy score-function update when ratios are nearly uniform.
  • It tightens automatically when stale or mismatched data cause ratio concentration.
  • It retains a nonzero learning signal on high-ratio tokens.
  • The method matches or exceeds tuned baselines across a wide range of settings.
  • No new objective hyper-parameters are introduced and several existing ones are removed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce the tuning burden when scaling RL post-training to new tasks or larger models.
  • Batch-level statistics such as effective sample size could serve as a general mechanism for adapting other objectives that mix on- and off-policy signals.
  • Similar adaptive use of ratio distributions might improve stability in sequential decision algorithms outside standard RL post-training.

Load-bearing premise

The normalized effective sample size of the policy-ratio distribution in each batch reliably captures both the trust-region violation risk and the off-policy data reliability without introducing new instabilities or requiring hidden tuning.

What would settle it

A controlled experiment in a high-mismatch setting where the ESS-based adaptive method performs measurably worse than carefully tuned baselines that use fixed clipping and regularizers.

Figures

Figures reproduced from arXiv: 2605.12380 by Alexander J. Smola, Murdock Aubry, Nicholas Stranges, Rasool Fakoor.

Figure 1
Figure 1. Figure 1: Sensitivity of GRPO’s reward to the clip range ϵ ∈ {0.2, 0.4, 0.6} (shaded region: ±1 std over clip values) versus P3O run once with no clip hyperparameter. P3O’s ESS-driven cap (Eq. (11)) removes this tuning burden while matching or exceeding the best GRPO variant across both model families. same base objective (Sec. 2) but relies on a fixed clip range; both methods otherwise use identical hyper-parameter… view at source ↗
Figure 2
Figure 2. Figure 2: P3O is robust to off-policy data introduced through the varied sampling temperature of rollouts. Sampling rollouts at a temperature other than 1.0 introduces a distribution shift in the token-level log probabilities, creating off-policy data for Qwen3-4B-Thinking-2507. Corresponding Qwen2.5-1.5B results are deferred to [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: P3O is robust to off-policy data introduced through the BF16 Train + FP8 Rollout training scheme. As accuracy collapse is observed in longer rollout lengths [33], a rollout length of 16,384 tokens was used in this experiment. The demonstrated robustness of P3O to off-policy data allows for the use of faster rollout generation strategies. In contrast, GRPO’s performance degrades significantly under the same… view at source ↗
Figure 4
Figure 4. Figure 4: Pass@k averaged over all five held-out benchmarks (AIME24/25/26, AMO-Bench, AMC). Left: clip-ratio variants at 4K-token evaluation; GRPO is averaged over ϵ ∈ {0.2, 0.4, 0.6}. Ours matches or exceeds the averaged GRPO sweep without requiring a clip-ratio choice. Right: BF16-train + FP8-rollout variants at 16K-token evaluation. GRPO collapses by iter 30 (near-zero pass@k) while Ours retains strong performanc… view at source ↗
Figure 5
Figure 5. Figure 5: Asynchronous-training comparison between P3O and GRPO under one optimizer step per rollout epoch. Rollouts are generated by a stale policy while the learner continues updating, creating the off-policy lag discussed in the main text. In this one-step pipeline setting, P3O maintains a higher and more stable reward trajectory than GRPO across training. The corresponding two￾step pipeline produces the same qua… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of GRPO and P3O with respect to the mixing of off-policy data. A rollout length of 4,096 tokens was used in this experiment. This experiment uses Qwen3-8B [30] to generate rollouts for training Qwen3-4B-Thinking-2507, as it is from a different model family. Data was mixed at a 50% ratio, meaning half of the rollouts were generated by the training model and half were generated by the separate pol… view at source ↗
Figure 7
Figure 7. Figure 7: Temperature-robustness results for Qwen2.5-1.5B, corresponding to [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training curves for the two-anchor extension of P3O, P3O, and GRPO on Qwen3-4B￾Thinking. All runs use the DeepSeek dataset with temperature 1.2 and a rollout length of 4,096 tokens. The two-anchor extension peaks at a reward comparable to P3O and GRPO but undergoes a sharp collapse after step 16, dropping to near-zero reward by step 24 before the run terminates. P3O maintains stable, high reward throughout… view at source ↗
Figure 9
Figure 9. Figure 9: Training curves for the two-anchor extension of P3O, P3O, and GRPO (ϵ=0.4) on Qwen3-4B-Thinking at default temperature. All runs use the DeepSeek dataset with a rollout length of 4,096 tokens. Unlike the temperature-1.2 regime ( [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at https://github.com/FeynRL-project/FeynRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a batch-adaptive objective for RL post-training of large models that replaces fixed clipping and several hyperparameters with the normalized effective sample size (ESS) of the per-batch policy-ratio distribution; this single statistic is used both to cap score-function weights (enforcing a trust region) and to modulate an off-policy regularizer, so that the update remains close to standard on-policy score-function gradient when ratios are uniform and automatically tightens under ratio concentration from stale or mismatched data. Experiments across diverse settings are reported to match or exceed tuned baselines while introducing no new objective hyperparameters.

Significance. If the central claim holds, the approach would meaningfully reduce hyperparameter sensitivity in RL post-training for large models, where numerical and distributional mismatches between training and rollout systems are common. The public code release is a clear strength that supports reproducibility and further testing of the batch-adaptive mechanism.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method): the central claim that normalized ESS simultaneously replaces clipping and sets the off-policy regularizer without new hyperparameters rests on the assumption that ESS = (∑ r_i)^2 / ∑ r_i^2 (normalized by batch size) reliably proxies both trust-region violation risk and data reliability. No derivation or ablation is provided showing how this scalar is exactly inserted into the objective or normalized, leaving open the possibility that the construction reduces to a reparameterized form of existing clipping or regularization terms.
  2. [§4] §4 (experiments): the reported matches to tuned baselines lack error bars, details on how ESS is computed across batches, and ablations isolating the effect of the ESS-based capping versus the regularizer modulation. Without these, it is impossible to verify that the method avoids the instabilities the skeptic notes for heavy-tailed ratio regimes typical in LLM post-training.
minor comments (1)
  1. [Abstract] The abstract states that the code is available at the given GitHub link; please confirm the repository contains the exact scripts used for the reported experiments and any additional implementation details on ESS normalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications from the existing text and indicating where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): the central claim that normalized ESS simultaneously replaces clipping and sets the off-policy regularizer without new hyperparameters rests on the assumption that ESS = (∑ r_i)^2 / ∑ r_i^2 (normalized by batch size) reliably proxies both trust-region violation risk and data reliability. No derivation or ablation is provided showing how this scalar is exactly inserted into the objective or normalized, leaving open the possibility that the construction reduces to a reparameterized form of existing clipping or regularization terms.

    Authors: Section 3 derives the normalized ESS explicitly as ESS = (1/N) * (∑ r_i)^2 / ∑ r_i^2, where r_i denotes the per-token policy ratio in the current batch and N is the batch size. This scalar is inserted into the objective by (i) capping the score-function importance weights at 1/ESS to enforce a per-batch trust region and (ii) scaling the coefficient of the off-policy regularizer by (1 - ESS). When the ratio distribution is uniform, ESS approaches 1 and the update recovers the standard on-policy score-function gradient; when ratios concentrate due to staleness or mismatch, ESS drops and both the cap and regularizer tighten automatically. The construction is not a reparameterization of fixed clipping because the threshold is computed from the empirical second moment of the batch ratios rather than a preset hyperparameter. We will add an explicit ablation that isolates the capping term from the regularizer modulation to further demonstrate their distinct contributions. revision: partial

  2. Referee: [§4] §4 (experiments): the reported matches to tuned baselines lack error bars, details on how ESS is computed across batches, and ablations isolating the effect of the ESS-based capping versus the regularizer modulation. Without these, it is impossible to verify that the method avoids the instabilities the skeptic notes for heavy-tailed ratio regimes typical in LLM post-training.

    Authors: We agree that error bars, explicit computation details, and component ablations are required for full verification. The revised manuscript will report mean performance with standard error bars over multiple random seeds for every experiment. We will add a dedicated paragraph in §4 describing the per-batch ESS computation (ratios are evaluated on the sampled tokens, then normalized ESS is obtained exactly as defined in §3). We will also include ablations that disable the ESS-based weight cap and the regularizer modulation independently. For heavy-tailed regimes, the adaptive tightening via low ESS is intended to mitigate instability; we will supplement the experiments with plots of per-batch ratio histograms and ESS values to illustrate this behavior under the conditions noted. revision: yes

Circularity Check

0 steps flagged

No circularity: objective defined directly from batch statistics

full rationale

The paper proposes an explicit batch-adaptive objective that computes normalized effective sample size (ESS) from the current batch's policy ratios and uses that scalar to cap score-function weights and modulate an off-policy regularizer. This is a direct construction from observable batch data rather than a fitted parameter or self-referential definition. No step claims a 'prediction' that reduces to the input by construction, no uniqueness theorem is imported via self-citation, and the provided text contains no load-bearing self-citations. The derivation chain remains self-contained: the method is presented as a heuristic that automatically tightens when ratios concentrate, with empirical validation against baselines, without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract, the method relies on standard RL assumptions such as the policy gradient theorem and the interpretation of effective sample size as a reliability measure; no new free parameters or invented entities are introduced.

axioms (2)
  • standard math Policy gradient theorem applies to the score-function updates used here
    Invoked implicitly when describing the on-policy score-function update
  • domain assumption Normalized effective sample size of policy ratios accurately reflects both trust-region and off-policy reliability
    Core premise that allows the single statistic to replace fixed hyperparameters

pith-pipeline@v0.9.0 · 5600 in / 1358 out tokens · 96034 ms · 2026-05-13T05:28:53.772247+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

  1. [1]

    Amo-bench: Large language models still struggle in high school math competitions, 2025

    Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, and Shuang Zhou. Amo-bench: Large language models still struggle in high school math competitions, 2025. 9

  2. [2]

    What matters for on-policy deep actor-critic methods? A large-scale study

    Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters for on-policy deep actor-critic methods? A large-scale study. InInternational Conference on Learning Representations (ICLR), 2021. 1

  3. [3]

    Bridging the training-inference gap in LLMs by leveraging self-generated tokens.Transactions on Machine Learning Research, 2025

    Zhepeng Cen, Yao Liu, Siliang Zeng, Pratik Chaudhari, Huzefa Rangwala, George Karypis, and Rasool Fakoor. Bridging the training-inference gap in LLMs by leveraging self-generated tokens.Transactions on Machine Learning Research, 2025. 4

  4. [4]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645:633–638, 2025

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645:633–638, 2025. 1

  5. [5]

    Implementation matters in deep policy gradients: A case study on PPO and TRPO

    Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on PPO and TRPO. InInternational Conference on Learning Representations (ICLR),

  6. [6]

    IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures

    Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InInternational Conference on Machine Learning, pages 1407–1416, 2018. 2

  7. [7]

    Rasool Fakoor, Pratik Chaudhari, and Alexander J. Smola. P3O: policy-on policy-off pol- icy optimization. InProceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, page 371, 2019. 2, 5

  8. [8]

    Ddpg++: Striving for simplicity in continuous-control off-policy reinforcement learning.arXiv:2006.15199, 2020

    Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. Ddpg++: Striving for simplicity in continuous-control off-policy reinforcement learning.arXiv:2006.15199, 2020. 1

  9. [9]

    Rasool Fakoor, Pratik Chaudhari, Stefano Soatto, and Alexander J. Smola. Meta-q-learning. In ICLR, 2020. 2, 4

  10. [10]

    Lipton, Pratik Chaudhari, and Alexander J

    Rasool Fakoor, Jonas Mueller, Zachary C. Lipton, Pratik Chaudhari, and Alexander J. Smola. Time-varying propensity score to bridge the gap between the past and present. InICLR, 2024. 2, 4

  11. [11]

    Continuous doubly constrained batch reinforcement learning.Advances in Neural Information Processing Systems, 34:11260–11273, 2021

    Rasool Fakoor, Jonas W Mueller, Kavosh Asadi, Pratik Chaudhari, and Alexander J Smola. Continuous doubly constrained batch reinforcement learning.Advances in Neural Information Processing Systems, 34:11260–11273, 2021. 2

  12. [12]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InAAAI Conference on Artificial Intelligence,

  13. [13]

    Batch size-invariance for policy optimization

    Jacob Hilton, Karl Cobbe, and John Schulman. Batch size-invariance for policy optimization. InAdvances in Neural Information Processing Systems, volume 35, pages 17086–17098, 2022. 2, 6

  14. [14]

    The 37 implementation details of proximal policy optimization

    Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. The 37 implementation details of proximal policy optimization. InICLR Blog Track,

  15. [15]

    A note on importance sampling using standardized weights.Technical Report 348, 1992

    Augustine Kong. A note on importance sampling using standardized weights.Technical Report 348, 1992. 2

  16. [16]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.arXiv:2005.01643, May 2020. 2 10

  17. [17]

    Budgeting counterfactual for offline rl

    Yao Liu, Pratik Chaudhari, and Rasool Fakoor. Budgeting counterfactual for offline rl. In Advances in Neural Information Processing Systems, volume 36, pages 5729–5751, 2023. 2

  18. [18]

    Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog. 7

  19. [19]

    Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370, 2025

    Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370, 2025. 2

  20. [20]

    2023 american mathematics competitions (amc 10 and amc 12)

    Mathematical Association of America. 2023 american mathematics competitions (amc 10 and amc 12). https://huggingface.co/datasets/math-ai/amc23, 2023. Dataset curated by the Math-AI community. 9

  21. [21]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

  22. [22]

    post-mortem

    Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via FP16.arXiv preprint arXiv:2510.26788, 2025. 2

  23. [23]

    Springer Science & Business Media, 2013

    Sidney I Resnick.A probability path. Springer Science & Business Media, 2013. 4

  24. [24]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational Conference on Machine Learning (ICML), pages 1889–1897, 2015. 2, 3

  25. [25]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 5

  27. [27]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc. 1, 4

  28. [28]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018. 3

  29. [29]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. 7

  30. [30]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025. 7, 16

  31. [31]

    Fp8 quantization — vllm documentation

    vLLM Team. Fp8 quantization — vllm documentation. https://docs.vllm.ai/en/v0.5. 0.post1/quantization/fp8.html, 2024. Accessed: 2026-05-06. 8

  32. [32]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3):229–256, 1992. 3

  33. [33]

    Jet-rl: Enabling on-policy fp8 reinforcement learning with unified training and rollout precision flow, 2026

    Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, and Ligeng Zhu. Jet-rl: Enabling on-policy fp8 reinforcement learning with unified training and rollout precision flow, 2026. 8

  34. [34]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, 11 Jingjing Liu...

  35. [35]

    American invitational mathematics examination (aime) 2024,

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,

  36. [36]

    American invitational mathematics examination (aime) 2025,

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025,

  37. [37]

    American invitational mathematics examination (aime) 2026,

    Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026,

  38. [38]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 5, 7

  39. [39]

    Fine-Tuning Language Models from Human Preferences

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 1, 4 12 Algorithm 1GRPO Require:Policyπ θ; group sizeG; clip(ϵ ℓ, ϵh); entropy coefficientβ ent Rollout:for each prompt xp, sample G comple...