CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Fuzheng Zhang; Guorui Zhou; Kun Gai; Leiyu Pan; Minxuan Lv; Wenping Hu; Yuntao Li; Zhenpeng Su

arxiv: 2509.20712 · v5 · submitted 2025-09-25 · 💻 cs.LG · cs.CL

CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

Zhenpeng Su , Leiyu Pan , Minxuan Lv , Yuntao Li , Wenping Hu , Fuzheng Zhang , Kun Gai , Guorui Zhou This is my paper

Pith reviewed 2026-05-18 15:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learningpolicy optimizationentropy dynamicsclipping mechanismlarge language modelsexploration exploitationmathematical reasoning

0 comments

The pith

CE-GPPO reintroduces bounded gradients from clipped tokens to stabilize entropy and improve the exploration-exploitation balance in RL for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard clipping in PPO discards gradient signals from low-probability tokens, which disrupts entropy dynamics and harms the balance between exploration and exploitation. CE-GPPO addresses this by gently reintroducing those gradients in a controlled way while keeping changes to the native PPO algorithm minimal. A sympathetic reader would care because entropy management is central to effective RL fine-tuning of language models on reasoning tasks, and the method shows consistent gains across model scales on mathematical benchmarks along with theoretical support for reduced instability.

Core claim

Analysis of entropy dynamics shows clipped tokens play a critical overlooked role in regulation. CE-GPPO reintroduces their gradients in a gentle and bounded manner by controlling magnitude outside the clipping interval, achieving a better exploration-exploitation trade-off. Theoretical justification and experiments on reasoning benchmarks confirm it mitigates entropy instability while outperforming strong baselines.

What carries the argument

Gradient-preserving clipping, which reintroduces gradients from tokens outside the clipping interval in a bounded manner to coordinate entropy evolution without altering the core PPO update.

Load-bearing premise

That reintroducing gradients from clipped tokens in a bounded manner will stabilize entropy evolution without creating new training instabilities or performance drops.

What would settle it

Run CE-GPPO and standard PPO on the same mathematical reasoning benchmarks and measure whether entropy variance decreases and final performance improves without new divergence or regression.

Figures

Figures reproduced from arXiv: 2509.20712 by Fuzheng Zhang, Guorui Zhou, Kun Gai, Leiyu Pan, Minxuan Lv, Wenping Hu, Yuntao Li, Zhenpeng Su.

**Figure 1.** Figure 1: Left: Importance sampling distribution of tokens with different probabilities. Based on the distribution, all tokens can be categorized into four types: PA&HP, NA&LP, PA&LP and NA&HP. Center: The effect of the four token types on entropy dynamics. The two categories shown at the top contribute to entropy reduction, while those at the bottom contribute to entropy increase. Green check marks indicate tokens … view at source ↗

**Figure 2.** Figure 2: Based on DeepSeek-R1-Distill-Qwen-7B, a comparison of GRPO, DAPO, and GPPO in terms of entropy dynamics and AIME25 benchmark accuracy. potential data contamination, the dataset has been further processed with 9-gram deduplication against the evaluation benchmarks. Training We conducted training with CE-GPPO on two model sizes, DeepSeek-R1-Distill-Qwen1.5B and DeepSeek-R1-Distill-Qwen-7B. The maximum trai… view at source ↗

**Figure 3.** Figure 3: Entropy dynamics and benchmark accuracy under different β1/β2 configurations. key finding that the choice of β1 and β2 directly governs the evolution of entropy. The underlying mechanism is that: • A larger β1 amplifies gradients beyond the left clip boundary (mainly from NA&LP tokens). These gradients strengthen high-probability tokens, accelerating exploitation and thus causing entropy to collapse quick… view at source ↗

**Figure 4.** Figure 4: Comparison of KL divergence and gradient [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of CE-GPPO with other entropy collapse mitigation strategies. Native GRPO denotes the baseline without any mitigation strategy. α = 0.001/0.003 indicate the addition of an entropy loss term to the Native GRPO baseline, where α represents the entropy loss coefficient. DAPO refers to applying the Clip Higher strategy on Native GRPO baseline. balance between exploration and exploitation. • Compa… view at source ↗

**Figure 6.** Figure 6: Entropy dynamics and benchmark accuracy under different β1/β2 configurations. For β1 = 0/β2 = 1, the setting is maintained consistently across 0–1000 steps. For β1 = 0/β2 = 1 → β1 = 0.5/β2 = 1 configuration, the transition occurs at step 585. A Appendix A.1 The Role of Entropy at Different Stages of Training We further investigate the role of entropy at different stages of training. The results show that … view at source ↗

read the original abstract

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CE-GPPO reintroduces bounded gradients from clipped tokens to stabilize entropy in PPO for LLM reasoning, with solid empirical gains on math benchmarks but an open question on whether the change keeps the original trust-region guarantees.

read the letter

The main point is that this paper modifies PPO by feeding back gradients from tokens that fall outside the clip range, but only after bounding their size. The goal is to prevent entropy from collapsing or swinging too wildly during RL training on reasoning tasks. They argue that standard clipping throws away useful signal that affects entropy dynamics, and their fix reintroduces it in a controlled way to improve the exploration-exploitation balance. The experiments on mathematical reasoning benchmarks are the clearest strength: consistent gains over strong baselines across model scales, which matters for anyone running PPO-style post-training on LLMs. That part looks reproducible enough to be worth testing. The softer area is the theory. The stress-test concern is reasonable: if the added bounded term lets the total policy step grow beyond what the original clipping enforces, the surrogate may lose its monotonic-improvement property or allow larger KL divergence than intended. The abstract claims theoretical justification, but without seeing the exact bounding argument or how the advantage-weighted contributions from out-of-clip tokens are kept in check, it is hard to know whether the entropy stability is robust or just an empirical side effect. Minor details like exact data filtering rules or error bars would also help, though those are fixable. This is aimed at researchers who already use PPO variants for LLM reasoning and want a lightweight tweak for entropy control. A reader who cares about practical RLHF improvements will find the mechanism and results useful even if they end up adjusting the bound themselves. It deserves a serious referee because the targeted change addresses a real pain point and the benchmark evidence is relevant, even if the theoretical section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces CE-GPPO as an extension to standard PPO for RL-based optimization of LLMs on reasoning tasks. It claims that clipped tokens in PPO's surrogate objective play an overlooked role in entropy dynamics; the proposed method reintroduces their gradients in a bounded manner to stabilize entropy, improve the exploration-exploitation trade-off, and yield better performance. The manuscript provides a theoretical analysis of the modified gradient and reports empirical gains on mathematical reasoning benchmarks across model scales.

Significance. If the bounded reintroduction of clipped-token gradients can be shown to preserve PPO's trust-region guarantees while demonstrably stabilizing entropy, the approach would offer a lightweight, interpretable improvement to existing RLHF pipelines for reasoning models. The empirical results on standard math benchmarks, if reproducible and properly controlled, would constitute a practical contribution even if the theoretical novelty is incremental.

major comments (2)

[Theoretical justification] Theoretical justification section: the derivation of the modified gradient term for tokens outside the clipping interval does not explicitly bound the total KL divergence or demonstrate that the advantage-weighted contribution from out-of-clip tokens remains dominated by the original clipped surrogate. Without this step, it is unclear whether the entropy-stability argument preserves the monotonic-improvement property of the PPO surrogate.
[Empirical evaluation] §4 (or equivalent empirical section), Table or Figure reporting main results: the manuscript does not detail the exact clipping threshold, the gradient-magnitude bound hyperparameter, or the data-exclusion rules used in the entropy-dynamics plots; these omissions make it impossible to verify that the reported entropy stabilization is not an artifact of the chosen bound or of selective token filtering.

minor comments (2)

[Method] Notation for the bounded gradient term should be introduced with an explicit equation number and contrasted directly with the standard PPO clipping indicator.
[Abstract and introduction] The abstract states 'theoretical justification' but the main text should include a short lemma or corollary that isolates the entropy-control effect from the performance improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, indicating the specific revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Theoretical justification] Theoretical justification section: the derivation of the modified gradient term for tokens outside the clipping interval does not explicitly bound the total KL divergence or demonstrate that the advantage-weighted contribution from out-of-clip tokens remains dominated by the original clipped surrogate. Without this step, it is unclear whether the entropy-stability argument preserves the monotonic-improvement property of the PPO surrogate.

Authors: We thank the referee for identifying this gap in the theoretical presentation. Our current derivation bounds the per-token gradient contribution from out-of-clip tokens via a magnitude hyperparameter, which directly limits their influence on the policy update and thereby stabilizes entropy. To make the connection to PPO's trust-region guarantees explicit, we will revise the theoretical justification section to include a formal bound on the additional KL divergence induced by these terms and demonstrate that their advantage-weighted contribution remains strictly dominated by the clipped surrogate terms. This addition will confirm that the monotonic-improvement property is preserved under the bounded modification. revision: yes
Referee: [Empirical evaluation] §4 (or equivalent empirical section), Table or Figure reporting main results: the manuscript does not detail the exact clipping threshold, the gradient-magnitude bound hyperparameter, or the data-exclusion rules used in the entropy-dynamics plots; these omissions make it impossible to verify that the reported entropy stabilization is not an artifact of the chosen bound or of selective token filtering.

Authors: We agree that these details are necessary for full reproducibility and to rule out potential artifacts. In the revised manuscript we will explicitly report the clipping threshold (ε = 0.2), the gradient-magnitude bound hyperparameter (λ = 0.05), and the precise token-filtering criteria applied when generating the entropy-dynamics plots. These additions will allow independent verification that the observed stabilization arises from the proposed mechanism rather than from hyperparameter choice or selective data handling. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents CE-GPPO as an extension of standard PPO that adds a bounded gradient contribution from clipped tokens to stabilize entropy. The abstract and provided context describe an analysis of entropy dynamics followed by a proposed modification with theoretical justification and empirical validation on reasoning benchmarks. No equations or steps are shown that reduce the claimed entropy control or performance gains to a fitted parameter renamed as prediction, a self-referential definition, or a load-bearing self-citation whose validity depends on the current work. The derivation appears self-contained against external PPO baselines and benchmark results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that entropy dynamics are primarily driven by gradients from out-of-clip tokens and introduces at least one tunable magnitude control for those gradients.

free parameters (1)

gradient magnitude bound for clipped tokens
The paper states that CE-GPPO controls the magnitude of gradients from tokens outside the clipping interval, implying a tunable or chosen bound parameter.

axioms (1)

domain assumption Clipped tokens play a critical yet overlooked role in regulating entropy evolution
This premise is stated as the result of the systematic analysis of entropy dynamics in existing PPO variants.

pith-pipeline@v0.9.0 · 5771 in / 1148 out tokens · 42772 ms · 2026-05-18T15:02:48.492017+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CE-GPPO objective with β1·(1−ε)/sg(δ)·δ·Â and β2·(1+ε)/sg(δ)·δ·Â for out-of-clip tokens; gradient form Fi,t(θ) bounded by β·(1±ε)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Entropy change ≈ −η Cov(log π, π·Â); clipped low-probability tokens regulate collapse/explosion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 7.0

Entropy polarity from a first-order entropy change approximation enables Polarity-Aware Policy Optimization (PAPO) that preserves complementary polarity branches and outperforms baselines on math and agentic RL fine-t...
Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control
cs.LG 2026-05 unverdicted novelty 6.0

Entropy polarity is a signed token-level quantity derived from a first-order approximation of entropy change that predicts whether RL updates expand or contract policy entropy in LLM fine-tuning, revealing an asymmetr...
Flexible Entropy Control in RLVR with a Gradient-Preserving Perspective
cs.LG 2026-02 unverdicted novelty 6.0

Dynamic clipping strategies based on importance sampling regions enable precise entropy management in RLVR, mitigating collapse and improving benchmark performance.
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
cs.LG 2025-12 unverdicted novelty 6.0

Entropy Ratio Clipping introduces a global entropy-ratio constraint that stabilizes RL policy updates in LLM post-training beyond local PPO clipping.
Revisiting Entropy in Reinforcement Learning for Large Reasoning Models
cs.CL 2025-11 unverdicted novelty 6.0

Tokens with positive advantages primarily drive entropy collapse in RLVR training of LLMs, and reweighting their loss contributions regulates entropy while maintaining competitive performance.
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

OGER adds an auxiliary exploration reward built from offline trajectories and model entropy to hybrid RL training, yielding gains on math reasoning benchmarks and out-of-domain generalization.
Targeted Exploration via Unified Entropy Control for Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 5.0

UEC-RL improves RL reasoning performance in LLMs and VLMs by activating exploration on hard prompts and stabilizing entropy, delivering a 37.9% relative gain over GRPO on Geometry3K.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 6 Pith papers · 13 internal anchors

[1]

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. 2019. http://proceedings.mlr.press/v97/ahmed19a.html Understanding the impact of entropy on policy optimization . In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learn...

work page 2019
[2]

Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, and 80 others. 2025. https://doi.org/10.48550/ARXIV.2507.20534 Kimi K2: open agentic intelligence . CoRR, abs/2507.20534

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.20534 2025
[3]

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. https://doi.org/10.48550/ARXIV.2505.16400 Acereason-nemotron: Advancing math and code reasoning through reinforcement learning . CoRR, abs/2505.16400

work page doi:10.48550/arxiv.2505.16400 2025
[4]

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. 2025. https://doi.org/10.48550/ARXIV.2506.14758 Reasoning with exploration: An entropy perspective . CoRR, abs/2506.14758

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.14758 2025
[5]

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Hao-Si Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025 a . https://api.semanticscholar.org/CorpusID:278959427 The entropy mechanism of reinforcement learning for reasoning language models . ArXiv, abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025 b . https://doi.org/10.48550/ARXIV.2505.22617 The entropy mechanism of reinforcement learning for reasoning language models . CoRR, abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22617 2025
[7]

DeepSeek - AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 81 others. 2025. https://doi.org/10.48550/ARXIV.2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinfor...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025
[8]

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. http://proceedings.mlr.press/v70/haarnoja17a.html Reinforcement learning with deep energy-based policies . In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research...

work page 2017
[9]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. http://proceedings.mlr.press/v80/haarnoja18b.html Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15,...

work page 2018
[10]

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. 2025. https://doi.org/10.48550/ARXIV.2505.22312 Skywork open reasoner 1 technical report . CoRR, abs/2505.22312

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22312 2025
[11]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, and 4 others. 2024. https://doi.org/10.48550/ARXIV.2411.15124 T \" u...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15124 2024
[12]

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. 2024. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/repo...

work page 2024
[13]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. https://openreview.net/forum?id=v8L0pN6EOi Let's verify step by step . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

work page 2024
[14]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. Notion Blog

work page 2025
[15]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/b1efd...

work page 2022
[16]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2016. http://arxiv.org/abs/1506.02438 High-dimensional continuous control using generalized advantage estimation . In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . CoRR, abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://doi.org/10.48550/ARXIV.2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . CoRR, abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[19]

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. 2025 a . https://doi.org/10.48550/ARXIV.2508.07629 Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization . CoRR, abs/2508.07629

work page doi:10.48550/arxiv.2508.07629 2025
[20]

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, and Guorui Zhou. 2025 b . Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization. arXiv preprint arXiv:2508.07629

work page arXiv 2025
[21]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, and 1 others. 2024. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, and 16 others. 2025. https://doi.org/10.48550/ARXIV.2503.14476 DAPO: an open-source LLM reinforcement learning system at scale . ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025
[23]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong - Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. 2025. https://doi.org/10.48550/ARXIV.2507.18071 Group sequence policy optimization . CoRR, abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.18071 2025
[24]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page
[25]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. 2019. http://proceedings.mlr.press/v97/ahmed19a.html Understanding the impact of entropy on policy optimization . In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA , volume 97 of Proceedings of Machine Learn...

work page 2019

[2] [2]

Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, Zhuofu Chen, Jialei Cui, Hao Ding, Mengnan Dong, Angang Du, Chenzhuang Du, Dikang Du, Yulun Du, Yu Fan, Yichen Feng, and 80 others. 2025. https://doi.org/10.48550/ARXIV.2507.20534 Kimi K2: open agentic intelligence . CoRR, abs/2507.20534

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.20534 2025

[3] [3]

Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. https://doi.org/10.48550/ARXIV.2505.16400 Acereason-nemotron: Advancing math and code reasoning through reinforcement learning . CoRR, abs/2505.16400

work page doi:10.48550/arxiv.2505.16400 2025

[4] [4]

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. 2025. https://doi.org/10.48550/ARXIV.2506.14758 Reasoning with exploration: An entropy perspective . CoRR, abs/2506.14758

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.14758 2025

[5] [5]

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Hao-Si Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025 a . https://api.semanticscholar.org/CorpusID:278959427 The entropy mechanism of reinforcement learning for reasoning language models . ArXiv, abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025 b . https://doi.org/10.48550/ARXIV.2505.22617 The entropy mechanism of reinforcement learning for reasoning language models . CoRR, abs/2505.22617

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22617 2025

[7] [7]

DeepSeek - AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 81 others. 2025. https://doi.org/10.48550/ARXIV.2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinfor...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.12948 2025

[8] [8]

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. http://proceedings.mlr.press/v70/haarnoja17a.html Reinforcement learning with deep energy-based policies . In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70 of Proceedings of Machine Learning Research...

work page 2017

[9] [9]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. http://proceedings.mlr.press/v80/haarnoja18b.html Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15,...

work page 2018

[10] [10]

Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Bo An, Yang Liu, and Yahui Zhou. 2025. https://doi.org/10.48550/ARXIV.2505.22312 Skywork open reasoner 1 technical report . CoRR, abs/2505.22312

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.22312 2025

[11] [11]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, and 4 others. 2024. https://doi.org/10.48550/ARXIV.2411.15124 T \" u...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2411.15124 2024

[12] [12]

Jia LI, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. 2024. Numinamath. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/project-numina/aimo-progress-prize/blob/main/repo...

work page 2024

[13] [13]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. https://openreview.net/forum?id=v8L0pN6EOi Let's verify step by step . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

work page 2024

[14] [14]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. 2025. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. Notion Blog

work page 2025

[15] [15]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. http://papers.nips.cc/paper\_files/paper/2022/hash/b1efd...

work page 2022

[16] [16]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2016. http://arxiv.org/abs/1506.02438 High-dimensional continuous control using generalized advantage estimation . In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal policy optimization algorithms . CoRR, abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. https://doi.org/10.48550/ARXIV.2402.03300 Deepseekmath: Pushing the limits of mathematical reasoning in open language models . CoRR, abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[19] [19]

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, Fuzheng Zhang, Kun Gai, and Guorui Zhou. 2025 a . https://doi.org/10.48550/ARXIV.2508.07629 Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization . CoRR, abs/2508.07629

work page doi:10.48550/arxiv.2508.07629 2025

[20] [20]

Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong, Jiaming Huang, Wenping Hu, and Guorui Zhou. 2025 b . Klear-reasoner: Advancing reasoning capability via gradient-preserving clipping policy optimization. arXiv preprint arXiv:2508.07629

work page arXiv 2025

[21] [21]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, and 1 others. 2024. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, and 16 others. 2025. https://doi.org/10.48550/ARXIV.2503.14476 DAPO: an open-source LLM reinforcement learning system at scale . ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.14476 2025

[23] [23]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong - Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. 2025. https://doi.org/10.48550/ARXIV.2507.18071 Group sequence policy optimization . CoRR, abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.18071 2025

[24] [24]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

work page

[25] [25]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page