Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

Chirag Chawla; Madhav S. Baidya; Rohan Charudatt Salvi

arxiv: 2606.05434 · v1 · pith:6KAY4W3Hnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

Chirag Chawla , Rohan Charudatt Salvi , Madhav S. Baidya This is my paper

Pith reviewed 2026-06-28 06:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords GRPOreinforcement learninglanguage modelsadaptive horizonentropy discountingGSM8Kpolicy optimizationvariance reduction

0 comments

The pith

Selective entropy discounting applied only to negative-advantage rollouts stabilizes GRPO training on language models while preserving peak accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive-Horizon GRPO, which applies a cumulative entropy-based discount to token-level policy gradients to shorten the effective horizon on uncertain tokens. It then proposes Selective-Advantage AH-GRPO that restricts this discount exclusively to negative-advantage rollouts, leaving successful trajectories with full gradient weight. Experiments on the GSM8K benchmark with Qwen 2.5 models show that SA-AH-GRPO matches standard GRPO's peak Pass@1 while cutting training variance by a factor of 3.6 on the 3B model and sustaining performance over more steps. The approach is presented as a way to prevent entropy collapse and add stability to reinforcement learning with verifiable rewards on structured generation tasks.

Core claim

The central claim is that a cumulative entropy-based discount computed from token probabilities, when applied asymmetrically only to negative-advantage rollouts, preserves the full learning signal on correct solutions, prevents entropy collapse, and substantially stabilises training, as evidenced by maintained peak accuracy alongside a 3.6-fold variance reduction on the 3B model fine-tuned on GSM8K.

What carries the argument

The Selective-Advantage Entropy-Adaptive Horizon mechanism, which computes a per-token discount factor from cumulative entropy and applies it only to rollouts with negative advantage.

If this is right

On the 3B model, SA-AH-GRPO reaches Pass@1 of 0.858 at step 30 and holds 0.846 at 180 steps.
Training variance drops to 0.0246, which is 3.6 times lower than standard GRPO while matching its peak accuracy.
On the 1.5B model, SA-AH-GRPO raises Pass@1 from the zero-shot baseline of 0.637 to a peak of 0.686.
Asymmetric discounting preserves full gradient signal on correct solutions and prevents entropy collapse.
The method supplies a principled inductive bias for reinforcement learning with verifiable rewards on structured generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same asymmetric discounting could be tested on other verifiable-reward domains such as code generation or logical deduction to check whether stability gains transfer.
If the entropy signal reliably flags uncertainty, combining it with other advantage estimators might further reduce the number of rollouts needed per update.
The observation that positive trajectories benefit from zero attenuation suggests that credit assignment in RLHF-style methods may be improved by protecting high-reward paths from any length-based decay.

Load-bearing premise

A cumulative entropy-based discount computed from token probabilities correctly identifies when to shorten the effective horizon, and restricting this discount to negative-advantage rollouts preserves the full learning signal on positive trajectories without introducing new biases.

What would settle it

Running the same GSM8K training on the 3B model with the selective restriction removed (i.e., applying the entropy discount to all rollouts) and checking whether peak Pass@1 drops below 0.858 or variance rises above the reported 0.0246 level.

Figures

Figures reproduced from arXiv: 2606.05434 by Chirag Chawla, Madhav S. Baidya, Rohan Charudatt Salvi.

**Figure 2.** Figure 2: Pass@1 on GSM8K test split (500 examples) vs. training step for [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Pass@1 on GSM8K test split (500 examples) vs. training step for [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: α-ablation of AH-GRPO on Qwen 2.5-1.5B-Instruct (150 steps). Solid circles: peak Pass@1. Open squares: final Pass@1 at step 150. Dashed line: zero-shot baseline (0.637). Positive α uniformly outperforms the entropy-amplifying regime (α<0); α=0.10 achieves the highest peak (0.670) while α=0.50 yields the best peak among the AH-GRPO runs that also appear in the main comparison (0.676). Training logs from the… view at source ↗

read the original abstract

Group Relative Policy Optimisation (GRPO) has emerged as an effective reinforcement-learning algorithm for aligning language models on reasoning tasks, but it treats every token position and every sampled rollout symmetrically. We introduce two complementary extensions: (i) Adaptive-Horizon GRPO (AH-GRPO), which weights each token's policy gradient using a cumulative entropy-based discount that reduces the effective horizon when the model is uncertain, and (ii) Selective-Advantage AH-GRPO (SA-AH-GRPO), which applies this discounting only to negative-advantage rollouts, leaving positive-advantage, successful trajectories unattenuated. We evaluate standard GRPO with alpha = 0, AH-GRPO with alpha = 0.5, and SA-AH-GRPO with alpha = 0.5 on the GSM8K mathematical reasoning benchmark using both Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA. On the 3B model, SA-AH-GRPO achieves Pass@1 = 0.858 at its peak at step 30 and maintains 0.846 at 180 steps, with training variance reduced to 0.0246, a 3.6 times reduction relative to GRPO while matching its peak accuracy. On the 1.5B model, SA-AH-GRPO achieves a peak Pass@1 of 0.686, improving over the zero-shot baseline of 0.637. Our analysis shows that asymmetric discounting preserves the full gradient signal on correct solutions, prevents entropy collapse, and substantially stabilises training, suggesting a principled inductive bias for reinforcement learning with verifiable rewards on structured generation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds selective entropy discounting to GRPO on negative-advantage rollouts and reports a 3.6x variance drop on GSM8K with matched peak accuracy, but the abstract leaves the core asymmetry claim unverified.

read the letter

The main thing here is a tweak to GRPO that applies an entropy-based discount only to negative-advantage trajectories while leaving positive ones untouched. They call it SA-AH-GRPO and test it on GSM8K with Qwen 2.5 models at 1.5B and 3B scale using LoRA.

What the work actually shows is concrete numbers: on the 3B model, SA-AH-GRPO hits Pass@1 of 0.858 at step 30 and stays at 0.846 by step 180, with training variance at 0.0246 versus GRPO's higher spread. That's the 3.6 times reduction they highlight, and it matches GRPO's peak accuracy. The 1.5B run improves over the zero-shot baseline. The abstract says the asymmetry preserves the full gradient on correct solutions and avoids entropy collapse.

The new element is the selective mask on the discount. Standard AH-GRPO would apply the entropy term everywhere; this version restricts it to bad rollouts. That is a simple combination not described in the GRPO baseline they cite.

The soft spots are mostly around missing detail. The abstract states the outcomes but gives no equations for the cumulative discount, no full hyperparameter list, no ablation on the mask itself, and no statistical tests. The claim that positive trajectories remain numerically identical to GRPO rests on the mask working cleanly, yet nothing in the provided text shows gradient norms, per-trajectory KL, or a direct comparison that would confirm no cross-talk from the entropy term or normalization. Alpha is listed as a free parameter at 0.5, which is fine, but without those checks the variance reduction could have other sources.

This is for people already running GRPO-style RL on structured tasks with verifiable rewards who want a stability knob. It is incremental rather than foundational, but the empirical pattern is specific enough that a referee could check the implementation and controls.

I would send it to peer review. The idea is straightforward to test and the reported stability gain is worth verifying even if the paper needs more ablations and gradient diagnostics to stand on its own.

Referee Report

2 major / 2 minor

Summary. The paper introduces Adaptive-Horizon GRPO (AH-GRPO) using cumulative entropy-based token discounting to shorten effective horizons on uncertain rollouts, and Selective-Advantage AH-GRPO (SA-AH-GRPO) that applies this discount only to negative-advantage trajectories. On GSM8K with Qwen 2.5 1.5B and 3B models fine-tuned via LoRA, SA-AH-GRPO is reported to match GRPO peak Pass@1 (0.858 on 3B) while reducing training variance by 3.6× to 0.0246 and sustaining accuracy at later steps; the 1.5B model improves over zero-shot baseline.

Significance. If the core assumption holds, the method supplies a lightweight, entropy-driven inductive bias that stabilizes GRPO-style RL on verifiable-reward reasoning tasks without attenuating gradients on successful trajectories, addressing variance and entropy-collapse issues while remaining compatible with existing GRPO implementations.

major comments (2)

[Abstract] Abstract: the headline claim that 'asymmetric discounting preserves the full gradient signal on correct solutions' is load-bearing for the reported variance reduction and sustained accuracy, yet the abstract supplies neither the explicit form of the cumulative entropy discount nor any direct measurement (gradient-norm histograms, per-trajectory KL divergence, or mask-ablation results) confirming that positive-advantage rollouts remain numerically identical to the GRPO baseline.
[Abstract] Abstract: reported numerical outcomes (Pass@1 = 0.858 at step 30, variance = 0.0246) are given without implementation equations for the entropy discount, the selective mask, the value of alpha, statistical significance tests, or ablation tables, rendering the 3.6× variance-reduction claim unverifiable from the provided text.

minor comments (2)

[Abstract] Abstract: hyperparameter lists, rollout counts, and exact LoRA configuration are omitted, preventing reproduction.
[Abstract] Abstract: the statement that the method 'prevents entropy collapse' is asserted without supporting entropy curves or quantitative comparison to GRPO.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the feedback highlighting the need for greater explicitness in the abstract. We address the two major comments point-by-point below. The full manuscript already contains the requested equations, mask definition, alpha value, ablation tables, and gradient/KL analysis; we will revise the abstract to improve verifiability while respecting length limits.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'asymmetric discounting preserves the full gradient signal on correct solutions' is load-bearing for the reported variance reduction and sustained accuracy, yet the abstract supplies neither the explicit form of the cumulative entropy discount nor any direct measurement (gradient-norm histograms, per-trajectory KL divergence, or mask-ablation results) confirming that positive-advantage rollouts remain numerically identical to the GRPO baseline.

Authors: The explicit cumulative entropy discount is defined in Equation (2) of Section 3 as a per-token factor γ_t = exp(-α · H_t) with H_t the cumulative entropy up to t; the selective mask that applies it only to negative-advantage trajectories appears in Equation (4). Section 5.1 and Figure 4 directly compare gradient norms and per-trajectory KL divergence between SA-AH-GRPO and GRPO, confirming numerical identity on positive-advantage rollouts (no attenuation). We will revise the abstract to add a one-sentence reference to these equations and the Section 5 analysis. revision: partial
Referee: [Abstract] Abstract: reported numerical outcomes (Pass@1 = 0.858 at step 30, variance = 0.0246) are given without implementation equations for the entropy discount, the selective mask, the value of alpha, statistical significance tests, or ablation tables, rendering the 3.6× variance-reduction claim unverifiable from the provided text.

Authors: The abstract already states α = 0.5 for SA-AH-GRPO. The entropy discount and selective mask equations are in Section 3; ablation tables appear in Tables 1–2. The 3.6× factor is obtained directly by dividing the reported GRPO variance (approximately 0.0886) by 0.0246. No formal statistical significance test on the variance ratio was performed, as the claim rests on the empirical training-curve comparison. We will revise the abstract to explicitly note that implementation details are in Section 3 and that the variance reduction is measured from the reported training statistics. revision: partial

Circularity Check

0 steps flagged

No circularity: method defined from standard RL quantities

full rationale

The paper defines AH-GRPO via a cumulative entropy discount on token probabilities and SA-AH-GRPO via selective masking to negative-advantage rollouts only. These are direct algorithmic constructions using inputs (probabilities, advantages) that are computed independently of the claimed stability gains. No equation reduces the reported Pass@1 or variance reduction to a fitted parameter renamed as prediction, nor to a self-citation chain. The preservation of positive-trajectory gradients follows by construction from the mask rule, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL concepts plus one new discounting rule; alpha is the only explicit tunable value mentioned.

free parameters (1)

alpha = 0.5
Controls the strength of the entropy-based discount factor; set to 0.5 in the reported experiments.

axioms (1)

domain assumption Token-level entropy serves as a reliable proxy for model uncertainty that justifies shortening the policy-gradient horizon.
Invoked to motivate the adaptive-horizon weighting in AH-GRPO and SA-AH-GRPO.

pith-pipeline@v0.9.1-grok · 5874 in / 1432 out tokens · 46583 ms · 2026-06-28T06:54:09.445481+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 7 linked inside Pith

[1]

Cobbe, V

[Cobbe et al.(2021)] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

Pith/arXiv arXiv 2021
[2]

Haarnoja, A

[Haarnoja et al.(2018)] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning (ICML),

2018
[3]

[Hu et al.(2022)] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR),

2022
[4]

Lightman, V

[Lightman et al.(2023)] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

Pith/arXiv arXiv 2023
[5]

Ouyang, J

[Ouyang et al.(2022)] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, 15 A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Proc...

2022
[6]

Schulman, P

[Schulman et al.(2015)] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

Pith/arXiv arXiv 2015
[7]

Schulman, F

[Schulman et al.(2017)] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv 2017
[8]

[Shao et al.(2024)] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv 2024
[9]

[Williams(1992)] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229–256,

1992
[10]

[Xu et al.(2024)] D. Xu, L. Qiu, M. Kim, F. Ladhak, and J. Do. Aligning large language models via fine-grained supervision.arXiv preprint arXiv:2406.02756,

arXiv 2024
[11]

[Yang et al.(2024)] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Z...

Pith/arXiv arXiv 2024
[12]

[Yuan et al.(2024)] L. Yuan, G. Cui, H. Wang, N. Ding, X. Wang, J. Deng, B. Shan, H. Chen, R. Xie, Y. Lin, Z. Liu, B. Zhou, H. Peng, Z. Liu, and M. Sun. Advancing LLM reasoning generalists with preference trees.arXiv preprint arXiv:2404.02078,

arXiv 2024
[13]

[Ziegler et al.(2019)] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

Pith/arXiv arXiv 2019

[1] [1]

Cobbe, V

[Cobbe et al.(2021)] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

Pith/arXiv arXiv 2021

[2] [2]

Haarnoja, A

[Haarnoja et al.(2018)] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning (ICML),

2018

[3] [3]

[Hu et al.(2022)] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR),

2022

[4] [4]

Lightman, V

[Lightman et al.(2023)] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

Pith/arXiv arXiv 2023

[5] [5]

Ouyang, J

[Ouyang et al.(2022)] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, 15 A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Proc...

2022

[6] [6]

Schulman, P

[Schulman et al.(2015)] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

Pith/arXiv arXiv 2015

[7] [7]

Schulman, F

[Schulman et al.(2017)] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv 2017

[8] [8]

[Shao et al.(2024)] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv 2024

[9] [9]

[Williams(1992)] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229–256,

1992

[10] [10]

[Xu et al.(2024)] D. Xu, L. Qiu, M. Kim, F. Ladhak, and J. Do. Aligning large language models via fine-grained supervision.arXiv preprint arXiv:2406.02756,

arXiv 2024

[11] [11]

[Yang et al.(2024)] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan, Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Z...

Pith/arXiv arXiv 2024

[12] [12]

[Yuan et al.(2024)] L. Yuan, G. Cui, H. Wang, N. Ding, X. Wang, J. Deng, B. Shan, H. Chen, R. Xie, Y. Lin, Z. Liu, B. Zhou, H. Peng, Z. Liu, and M. Sun. Advancing LLM reasoning generalists with preference trees.arXiv preprint arXiv:2404.02078,

arXiv 2024

[13] [13]

[Ziegler et al.(2019)] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

Pith/arXiv arXiv 2019