arxiv: 2605.12380 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Alexander J. Smola, Murdock Aubry, Nicholas Stranges, Rasool Fakoor

Pith reviewed 2026-05-13 05:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningpolicy optimizationadaptive clippingeffective sample sizetrust regionoff-policy learningRL post-trainingbatch adaptation

0 comments

The pith

The normalized effective sample size of each batch's policy ratios adaptively replaces fixed clipping and removes multiple hyper-parameters in RL post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning post-training is fragile because updating the policy changes the data distribution, entangling two concerns that existing methods address with fixed hyper-parameters chosen before training begins: a trust-region concern about moving the policy too far and an off-policy concern about using unreliable older data. These fixed parameters make algorithms sensitive to configuration and require retuning for new tasks, model scales, or mismatches. The paper proposes a batch-adaptive objective that computes the normalized effective sample size of the policy-ratio distribution in the current batch and uses it in place of fixed clipping. This single statistic automatically caps the score-function weight and sets the strength of an off-policy regularizer, keeping updates close to standard on-policy behavior when ratios are uniform while tightening when data is stale or mismatched. Experiments across varied settings show the method matches or exceeds tuned baselines while introducing no new objective hyper-parameters and eliminating several existing ones.

Core claim

The normalized effective sample size of the policy-ratio distribution in each batch serves as a reliable data-driven proxy that replaces fixed clipping, caps the score-function weight, and controls the strength of an off-policy regularizer, allowing the update to stay close to the usual on-policy score-function update when ratios are nearly uniform and to tighten automatically when stale or mismatched data cause ratio concentration while retaining a nonzero learning signal on high-ratio tokens.

What carries the argument

the normalized effective sample size of the policy-ratio distribution in each batch, which replaces fixed clipping, caps the score-function weight, and sets the strength of an off-policy regularizer

If this is right

The update stays close to the usual on-policy score-function update when ratios are nearly uniform.
It tightens automatically when stale or mismatched data cause ratio concentration.
It retains a nonzero learning signal on high-ratio tokens.
The method matches or exceeds tuned baselines across a wide range of settings.
No new objective hyper-parameters are introduced and several existing ones are removed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce the tuning burden when scaling RL post-training to new tasks or larger models.
Batch-level statistics such as effective sample size could serve as a general mechanism for adapting other objectives that mix on- and off-policy signals.
Similar adaptive use of ratio distributions might improve stability in sequential decision algorithms outside standard RL post-training.

Load-bearing premise

The normalized effective sample size of the policy-ratio distribution in each batch reliably captures both the trust-region violation risk and the off-policy data reliability without introducing new instabilities or requiring hidden tuning.

What would settle it

A controlled experiment in a high-mismatch setting where the ESS-based adaptive method performs measurably worse than carefully tuned baselines that use fixed clipping and regularizers.

Figures

Figures reproduced from arXiv: 2605.12380 by Alexander J. Smola, Murdock Aubry, Nicholas Stranges, Rasool Fakoor.

**Figure 1.** Figure 1: Sensitivity of GRPO’s reward to the clip range ϵ ∈ {0.2, 0.4, 0.6} (shaded region: ±1 std over clip values) versus P3O run once with no clip hyperparameter. P3O’s ESS-driven cap (Eq. (11)) removes this tuning burden while matching or exceeding the best GRPO variant across both model families. same base objective (Sec. 2) but relies on a fixed clip range; both methods otherwise use identical hyper-parameter… view at source ↗

**Figure 2.** Figure 2: P3O is robust to off-policy data introduced through the varied sampling temperature of rollouts. Sampling rollouts at a temperature other than 1.0 introduces a distribution shift in the token-level log probabilities, creating off-policy data for Qwen3-4B-Thinking-2507. Corresponding Qwen2.5-1.5B results are deferred to [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: P3O is robust to off-policy data introduced through the BF16 Train + FP8 Rollout training scheme. As accuracy collapse is observed in longer rollout lengths [33], a rollout length of 16,384 tokens was used in this experiment. The demonstrated robustness of P3O to off-policy data allows for the use of faster rollout generation strategies. In contrast, GRPO’s performance degrades significantly under the same… view at source ↗

**Figure 4.** Figure 4: Pass@k averaged over all five held-out benchmarks (AIME24/25/26, AMO-Bench, AMC). Left: clip-ratio variants at 4K-token evaluation; GRPO is averaged over ϵ ∈ {0.2, 0.4, 0.6}. Ours matches or exceeds the averaged GRPO sweep without requiring a clip-ratio choice. Right: BF16-train + FP8-rollout variants at 16K-token evaluation. GRPO collapses by iter 30 (near-zero pass@k) while Ours retains strong performanc… view at source ↗

**Figure 5.** Figure 5: Asynchronous-training comparison between P3O and GRPO under one optimizer step per rollout epoch. Rollouts are generated by a stale policy while the learner continues updating, creating the off-policy lag discussed in the main text. In this one-step pipeline setting, P3O maintains a higher and more stable reward trajectory than GRPO across training. The corresponding twostep pipeline produces the same qua… view at source ↗

**Figure 6.** Figure 6: Comparison of GRPO and P3O with respect to the mixing of off-policy data. A rollout length of 4,096 tokens was used in this experiment. This experiment uses Qwen3-8B [30] to generate rollouts for training Qwen3-4B-Thinking-2507, as it is from a different model family. Data was mixed at a 50% ratio, meaning half of the rollouts were generated by the training model and half were generated by the separate pol… view at source ↗

**Figure 7.** Figure 7: Temperature-robustness results for Qwen2.5-1.5B, corresponding to [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Training curves for the two-anchor extension of P3O, P3O, and GRPO on Qwen3-4BThinking. All runs use the DeepSeek dataset with temperature 1.2 and a rollout length of 4,096 tokens. The two-anchor extension peaks at a reward comparable to P3O and GRPO but undergoes a sharp collapse after step 16, dropping to near-zero reward by step 24 before the run terminates. P3O maintains stable, high reward throughout… view at source ↗

**Figure 9.** Figure 9: Training curves for the two-anchor extension of P3O, P3O, and GRPO (ϵ=0.4) on Qwen3-4B-Thinking at default temperature. All runs use the DeepSeek dataset with a rollout length of 4,096 tokens. Unlike the temperature-1.2 regime ( [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sampling, and other implementation details. Existing methods manage this fragility by adding hyper-parameters to the training objective, which makes the algorithm more sensitive to its configuration and requires retuning whenever the task, model scale, or distribution mismatch changes. This fragility traces to two concerns that current objectives entangle through hyper-parameters set before training begins: a trust-region concern, that updates should not move the policy too far from its current value, and an off-policy concern, that data from older or different behavior policies should influence the update only to the extent that it remains reliable. Neither concern is a constant to set in advance, and their severity is reflected in the policy-ratio distribution of the current batch. We present a simple yet effective batch-adaptive objective that replaces fixed clipping with the normalized effective sample size of the policy ratios. The same statistic caps the score-function weight and sets the strength of an off-policy regularizer, so the update stays close to the usual on-policy score-function update when ratios are nearly uniform, and tightens automatically when stale or mismatched data cause ratio concentration, while retaining a nonzero learning signal on high-ratio tokens. Experiments across a wide range of settings show that our method matches or exceeds tuned baselines, introducing no new objective hyper-parameters and removing several existing ones. The code is available at https://github.com/FeynRL-project/FeynRL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper replaces fixed clipping and off-policy weights with normalized effective sample size from each batch's policy ratios, removing several hyperparameters while claiming to match tuned baselines.

read the letter

The core idea here is to compute the normalized effective sample size of the policy ratios in the current batch and use that single number to control both how much the update is allowed to deviate from the current policy and how strongly to regularize against off-policy data. When the ratios are fairly uniform the method stays close to a plain on-policy score-function update; when they concentrate it tightens automatically but still keeps a nonzero gradient on high-ratio tokens. That is the main technical move, and it is presented as a way to reduce the hyperparameter fragility that comes from rollout-training mismatches in large-model RL post-training. The experiments are described as covering a wide range of settings and showing performance at least as good as carefully tuned baselines, with no new objective hyperparameters introduced and several old ones removed. The code is released, which is useful for anyone who wants to check the implementation details. The soft spot is the choice of normalized ESS itself. Effective sample size is a variance proxy that is heavily influenced by the largest ratios, and it can change sharply even when the actual KL or total variation between policies stays moderate. In the heavy-tailed ratio distributions that often appear in LLM post-training, this could either suppress useful updates too aggressively or fail to regularize when a few moderate ratios inflate the statistic. The abstract gives no derivation of the exact normalization, no error bars, and no ablation on how the ESS is computed or thresholded, so the central claim that this statistic reliably handles both concerns at once remains plausible but unverified from the given information. This paper is for practitioners who run RL fine-tuning loops on large models and spend time retuning clipping and regularization coefficients whenever the rollout system or model scale changes. Readers who care about reducing hyperparameter sensitivity in policy optimization will find the most direct value. It is worth sending to peer review because the practical problem is real, the proposed fix is simple, and the reported results are positive, even though referees will need to see more analysis of the ESS statistic and its behavior under distribution shift.

Referee Report

2 major / 1 minor

Summary. The paper proposes a batch-adaptive objective for RL post-training of large models that replaces fixed clipping and several hyperparameters with the normalized effective sample size (ESS) of the per-batch policy-ratio distribution; this single statistic is used both to cap score-function weights (enforcing a trust region) and to modulate an off-policy regularizer, so that the update remains close to standard on-policy score-function gradient when ratios are uniform and automatically tightens under ratio concentration from stale or mismatched data. Experiments across diverse settings are reported to match or exceed tuned baselines while introducing no new objective hyperparameters.

Significance. If the central claim holds, the approach would meaningfully reduce hyperparameter sensitivity in RL post-training for large models, where numerical and distributional mismatches between training and rollout systems are common. The public code release is a clear strength that supports reproducibility and further testing of the batch-adaptive mechanism.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the central claim that normalized ESS simultaneously replaces clipping and sets the off-policy regularizer without new hyperparameters rests on the assumption that ESS = (∑ r_i)^2 / ∑ r_i^2 (normalized by batch size) reliably proxies both trust-region violation risk and data reliability. No derivation or ablation is provided showing how this scalar is exactly inserted into the objective or normalized, leaving open the possibility that the construction reduces to a reparameterized form of existing clipping or regularization terms.
[§4] §4 (experiments): the reported matches to tuned baselines lack error bars, details on how ESS is computed across batches, and ablations isolating the effect of the ESS-based capping versus the regularizer modulation. Without these, it is impossible to verify that the method avoids the instabilities the skeptic notes for heavy-tailed ratio regimes typical in LLM post-training.

minor comments (1)

[Abstract] The abstract states that the code is available at the given GitHub link; please confirm the repository contains the exact scripts used for the reported experiments and any additional implementation details on ESS normalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications from the existing text and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the central claim that normalized ESS simultaneously replaces clipping and sets the off-policy regularizer without new hyperparameters rests on the assumption that ESS = (∑ r_i)^2 / ∑ r_i^2 (normalized by batch size) reliably proxies both trust-region violation risk and data reliability. No derivation or ablation is provided showing how this scalar is exactly inserted into the objective or normalized, leaving open the possibility that the construction reduces to a reparameterized form of existing clipping or regularization terms.

Authors: Section 3 derives the normalized ESS explicitly as ESS = (1/N) * (∑ r_i)^2 / ∑ r_i^2, where r_i denotes the per-token policy ratio in the current batch and N is the batch size. This scalar is inserted into the objective by (i) capping the score-function importance weights at 1/ESS to enforce a per-batch trust region and (ii) scaling the coefficient of the off-policy regularizer by (1 - ESS). When the ratio distribution is uniform, ESS approaches 1 and the update recovers the standard on-policy score-function gradient; when ratios concentrate due to staleness or mismatch, ESS drops and both the cap and regularizer tighten automatically. The construction is not a reparameterization of fixed clipping because the threshold is computed from the empirical second moment of the batch ratios rather than a preset hyperparameter. We will add an explicit ablation that isolates the capping term from the regularizer modulation to further demonstrate their distinct contributions. revision: partial
Referee: [§4] §4 (experiments): the reported matches to tuned baselines lack error bars, details on how ESS is computed across batches, and ablations isolating the effect of the ESS-based capping versus the regularizer modulation. Without these, it is impossible to verify that the method avoids the instabilities the skeptic notes for heavy-tailed ratio regimes typical in LLM post-training.

Authors: We agree that error bars, explicit computation details, and component ablations are required for full verification. The revised manuscript will report mean performance with standard error bars over multiple random seeds for every experiment. We will add a dedicated paragraph in §4 describing the per-batch ESS computation (ratios are evaluated on the sampled tokens, then normalized ESS is obtained exactly as defined in §3). We will also include ablations that disable the ESS-based weight cap and the regularizer modulation independently. For heavy-tailed regimes, the adaptive tightening via low ESS is intended to mitigate instability; we will supplement the experiments with plots of per-batch ratio histograms and ESS values to illustrate this behavior under the conditions noted. revision: yes

Circularity Check

0 steps flagged

No circularity: objective defined directly from batch statistics

full rationale

The paper proposes an explicit batch-adaptive objective that computes normalized effective sample size (ESS) from the current batch's policy ratios and uses that scalar to cap score-function weights and modulate an off-policy regularizer. This is a direct construction from observable batch data rather than a fitted parameter or self-referential definition. No step claims a 'prediction' that reduces to the input by construction, no uniqueness theorem is imported via self-citation, and the provided text contains no load-bearing self-citations. The derivation chain remains self-contained: the method is presented as a heuristic that automatically tightens when ratios concentrate, with empirical validation against baselines, without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract, the method relies on standard RL assumptions such as the policy gradient theorem and the interpretation of effective sample size as a reliability measure; no new free parameters or invented entities are introduced.

axioms (2)

standard math Policy gradient theorem applies to the score-function updates used here
Invoked implicitly when describing the on-policy score-function update
domain assumption Normalized effective sample size of policy ratios accurately reflects both trust-region and off-policy reliability
Core premise that allows the single statistic to replace fixed hyperparameters

pith-pipeline@v0.9.0 · 5600 in / 1358 out tokens · 96034 ms · 2026-05-13T05:28:53.772247+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
P3O replaces the fixed clip in Eq. (4) with two terms whose strength is set by the batch ESS... LP3O(θ) = ... −sg(min{ρt, eB}) logπθ(yt|c<t)A + (1−eB) KL(πθ ∥ πb)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection refines
The same statistic caps the score-function weight and sets the strength of an off-policy regularizer... introducing no new objective hyper-parameters

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 7 internal anchors

[1]

Amo-bench: Large language models still struggle in high school math competitions, 2025

Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, and Shuang Zhou. Amo-bench: Large language models still struggle in high school math competitions, 2025. 9

work page 2025
[2]

What matters for on-policy deep actor-critic methods? A large-scale study

Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphaël Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem. What matters for on-policy deep actor-critic methods? A large-scale study. InInternational Conference on Learning Representations (ICLR), 2021. 1

work page 2021
[3]

Bridging the training-inference gap in LLMs by leveraging self-generated tokens.Transactions on Machine Learning Research, 2025

Zhepeng Cen, Yao Liu, Siliang Zeng, Pratik Chaudhari, Huzefa Rangwala, George Karypis, and Rasool Fakoor. Bridging the training-inference gap in LLMs by leveraging self-generated tokens.Transactions on Machine Learning Research, 2025. 4

work page 2025
[4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645:633–638, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 645:633–638, 2025. 1

work page 2025
[5]

Implementation matters in deep policy gradients: A case study on PPO and TRPO

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. Implementation matters in deep policy gradients: A case study on PPO and TRPO. InInternational Conference on Learning Representations (ICLR),

work page
[6]

IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures

Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, V olodymyr Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InInternational Conference on Machine Learning, pages 1407–1416, 2018. 2

work page 2018
[7]

Rasool Fakoor, Pratik Chaudhari, and Alexander J. Smola. P3O: policy-on policy-off pol- icy optimization. InProceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, page 371, 2019. 2, 5

work page 2019
[8]

Ddpg++: Striving for simplicity in continuous-control off-policy reinforcement learning.arXiv:2006.15199, 2020

Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. Ddpg++: Striving for simplicity in continuous-control off-policy reinforcement learning.arXiv:2006.15199, 2020. 1

work page arXiv 2006
[9]

Rasool Fakoor, Pratik Chaudhari, Stefano Soatto, and Alexander J. Smola. Meta-q-learning. In ICLR, 2020. 2, 4

work page 2020
[10]

Lipton, Pratik Chaudhari, and Alexander J

Rasool Fakoor, Jonas Mueller, Zachary C. Lipton, Pratik Chaudhari, and Alexander J. Smola. Time-varying propensity score to bridge the gap between the past and present. InICLR, 2024. 2, 4

work page 2024
[11]

Continuous doubly constrained batch reinforcement learning.Advances in Neural Information Processing Systems, 34:11260–11273, 2021

Rasool Fakoor, Jonas W Mueller, Kavosh Asadi, Pratik Chaudhari, and Alexander J Smola. Continuous doubly constrained batch reinforcement learning.Advances in Neural Information Processing Systems, 34:11260–11273, 2021. 2

work page 2021
[12]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InAAAI Conference on Artificial Intelligence,

work page
[13]

Batch size-invariance for policy optimization

Jacob Hilton, Karl Cobbe, and John Schulman. Batch size-invariance for policy optimization. InAdvances in Neural Information Processing Systems, volume 35, pages 17086–17098, 2022. 2, 6

work page 2022
[14]

The 37 implementation details of proximal policy optimization

Shengyi Huang, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. The 37 implementation details of proximal policy optimization. InICLR Blog Track,

work page
[15]

A note on importance sampling using standardized weights.Technical Report 348, 1992

Augustine Kong. A note on importance sampling using standardized weights.Technical Report 348, 1992. 2

work page 1992
[16]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.arXiv:2005.01643, May 2020. 2 10

work page internal anchor Pith review Pith/arXiv arXiv 2005
[17]

Budgeting counterfactual for offline rl

Yao Liu, Pratik Chaudhari, and Rasool Fakoor. Budgeting counterfactual for offline rl. In Advances in Neural Information Processing Systems, volume 36, pages 5729–5751, 2023. 2

work page 2023
[18]

Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog. 7

work page 2025
[19]

Stabilizing moe reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370, 2025

Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370, 2025. 2

work page arXiv 2025
[20]

2023 american mathematics competitions (amc 10 and amc 12)

Mathematical Association of America. 2023 american mathematics competitions (amc 10 and amc 12). https://huggingface.co/datasets/math-ai/amc23, 2023. Dataset curated by the Math-AI community. 9

work page 2023
[21]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedba...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

post-mortem

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via FP16.arXiv preprint arXiv:2510.26788, 2025. 2

work page arXiv 2025
[23]

Springer Science & Business Media, 2013

Sidney I Resnick.A probability path. Springer Science & Business Media, 2013. 4

work page 2013
[24]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational Conference on Machine Learning (ICML), pages 1889–1897, 2015. 2, 3

work page 2015
[25]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc. 1, 4

work page 2020
[28]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018. 3

work page 2018
[29]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. 7

work page 2024
[30]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 7, 16

work page 2025
[31]

Fp8 quantization — vllm documentation

vLLM Team. Fp8 quantization — vllm documentation. https://docs.vllm.ai/en/v0.5. 0.post1/quantization/fp8.html, 2024. Accessed: 2026-05-06. 8

work page 2024
[32]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3):229–256, 1992. 3

work page 1992
[33]

Jet-rl: Enabling on-policy fp8 reinforcement learning with unified training and rollout precision flow, 2026

Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, and Ligeng Zhu. Jet-rl: Enabling on-policy fp8 reinforcement learning with unified training and rollout precision flow, 2026. 8

work page 2026
[34]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, 11 Jingjing Liu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

American invitational mathematics examination (aime) 2024,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2024,

work page 2024
[36]

American invitational mathematics examination (aime) 2025,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025,

work page 2025
[37]

American invitational mathematics examination (aime) 2026,

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026,

work page 2026
[38]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019. 1, 4 12 Algorithm 1GRPO Require:Policyπ θ; group sizeG; clip(ϵ ℓ, ϵh); entropy coefficientβ ent Rollout:for each prompt xp, sample G comple...

work page internal anchor Pith review Pith/arXiv arXiv 1909