pith. machine review for the scientific record. sign in

arxiv: 2605.05965 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords eligibility tracescredit assignmentRLVRGRPOlarge language modelsreinforcement learningreasoningpolicy optimization
0
0 comments X

The pith

Selective Eligibility Traces replace uniform credit assignment in critic-free RLVR by masking low-entropy tokens to focus learning on critical reasoning steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that methods like GRPO waste learning signal by broadcasting the same advantage to every token in a trajectory, which slows progress on complex reasoning tasks. S-trace addresses this by first using a basic eligibility traces construction called P-trace that stays critic-free and sample-efficient, then adding a sparse step that masks tokens whose next-token distribution has low entropy. This produces finer credit assignment while keeping the algorithm simple. A reader would care because the change yields measurable gains in accuracy and efficiency on standard math reasoning benchmarks without adding a separate value head.

Core claim

S-trace implements sparse eligibility traces by selectively masking low-entropy tokens, thereby achieving fine-grained credit assignment under the critic-free RLVR objective; it rests on the partial trust-region preservation intuition and identifies GSPO as the uniform-credit special case of the same framework. On Qwen3 models the method improves average pass@16 by 0.49 percent at 1.7B scale, 3.16 percent at 4B scale, and 2.98 percent at 8B scale while also raising sample and token efficiency.

What carries the argument

S-trace, the sparse eligibility traces mechanism that selectively masks low-entropy tokens to restrict credit propagation to high-entropy positions while remaining critic-free.

If this is right

  • S-trace outperforms GRPO by 0.49% on Qwen3-1.7B, 3.16% on Qwen3-4B, and 2.98% on Qwen3-8B in average pass@16.
  • The method simultaneously improves sample efficiency and token efficiency.
  • GSPO corresponds to the uniform-credit special case inside the eligibility-traces framework.
  • The partial trust-region preservation argument supports stable updates in the critic-free setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Entropy-based masking may serve as a general proxy for locating reasoning-critical tokens across other policy-gradient algorithms.
  • The same sparsity idea could be applied to reduce variance in longer-horizon reasoning trajectories without increasing model size.
  • Combining S-trace with occasional critic updates might further tighten credit assignment while retaining most of the efficiency gain.
  • The approach suggests that uniform credit assignment is a hidden bottleneck that limits scaling of pure RLVR methods.

Load-bearing premise

Masking low-entropy tokens removes only non-critical information and the partial trust-region preservation property still holds without a learned critic.

What would settle it

If S-trace and GRPO produce statistically indistinguishable pass@16 scores and efficiency metrics when trained on the same Qwen3 models for the same number of steps on standard math-reasoning benchmarks, the performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.05965 by Chaoli Mou, Xinning Chen, Yu Zhang, Zhan Zhuang.

Figure 1
Figure 1. Figure 1: Reward training dynamics on DAPO-Math-14k. (a) On Qwen3-1.7B, S-trace-0.9 marginally outpaces other methods in learning speed, converging to an asymptotic performance similar to GRPO(λ)-0.9 while maintaining a consistent lead over GRPO and OPO. (b) Extending to Qwen3-4B, our proposed methods demonstrate markedly efficient learning compared to the baselines. Notably, our methods match the performance level … view at source ↗
Figure 2
Figure 2. Figure 2: Response training dynamics of Qwen3 models on DAPO-Math-14k. (a) On Qwen3- 1.7B, S-trace-0.9 exhibits the best token efficiency overall. (b) On Qwen3-4B, methods incorporating eligibility traces demonstrate significant token efficiency. Although P/S-trace exhibits fluctuations, these methods consistently yield shorter reasoning trajectories than the GRPO and OPO baselines. rapidly dominate the trajectory t… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics of Qwen3-8B on DAPO-Math-14k. (a) The reward curve of GRPO(λ)- 0.9 visually overlaps with GRPO, showing no efficiency gains, whereas S-trace-0.9 sustains superior sample efficiency and asymptotic performance. (b) S-trace-0.9 consistently maintains a significantly lower mean response length than the baselines throughout training, demonstrating superior token efficiency. 0 100 200 300 400 S… view at source ↗
Figure 4
Figure 4. Figure 4: Policy gradient clip fraction dynamics on DAPO-Math-14k. In both settings, GRPO(λ) exhibits consistently higher clipping fraction compared to P/S-trace, indicating high variance in its importance weights. In contrast, P/S-trace maintains a low clipping profile, thereby retaining richer gradient signals for accelerated learning. the lens of regularization, the high clip fraction in GRPO(λ) effectively funct… view at source ↗
Figure 5
Figure 5. Figure 5: Training instability of P-trace on Qwen3-1.7B at λ = 0.9. S-trace-0.9 preserves the sample efficiency of P-trace-0.9 while resolving its optimization volatility through selective credit assignment. A.2 Training Instability Analysis In this work, we investigate eligibility traces methods under high temporal horizons (e.g., λ ∈ {0.9, 0.99}) to maximize their ability to capture long-term dependencies. However… view at source ↗
Figure 6
Figure 6. Figure 6: Training instability of P-trace on Qwen3-4B at λ = 0.99. P-trace exhibits severe training instability at this setting. In contrast, GRPO(λ) maintains stability by inducing significantly higher clipping fractions (up to 24× higher at step 200). This suggests that GRPO(λ)’s inherent stochastic dropout acts as a smoothing regularizer that enhances training stability, albeit at the cost of sample efficiency. O… view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity analysis of P-trace to λ. Reducing λ to {0.7, 0.8} preserves sample efficiency comparable to the λ = 0.9 baseline while yielding more stable clipping fractions. This demonstrates that P-trace operates within a lenient hyperparameter landscape, maintaining robustness without sacrificing performance. Additionally, view at source ↗
Figure 8
Figure 8. Figure 8: Dynamics of mean response length under varying λ. Both λ = 0.7 and λ = 0.8 settings maintain superior token efficiency comparable to the P-trace-0.9 baseline and consistently outperform GRPO. These results attest to the robustness of P-trace across a wide effective range view at source ↗
Figure 9
Figure 9. Figure 9: Training dynamics comparison between uniform-based and recency-based eligibility traces. The recency-based method (GRPO(λ)) consistently outperforms the uniform-based baseline (GSPO) by achieving higher asymptotic performance and enhanced token efficiency, while simulta￾neously maintaining greater optimization stability with significantly lower KL divergence and clip fractions. Substituting the explicit ex… view at source ↗
read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key approach for improving the reasoning abilities of large language models. However, widely used critic-free algorithms such as Group Relative Policy Optimization (GRPO) necessitate a ``uniform credit assignment'' assumption that indiscriminately broadcast trajectory-level advantages, hindering learning efficiency by failing to distinguish critical reasoning steps. To address this limitation, we propose Selective Eligibility Traces (S-trace). Grounded in the intuition of partial trust region preservation, we initially introduce P-trace as a sample-efficient, critic-free eligibility traces method, upon which we build S-trace, implementing a sparse eligibility traces mechanism to further mitigate variance and achieve fine-grained credit assignment by selectively masking low-entropy tokens. Theoretically, we contextualize the recent Group Sequence Policy Optimization (GSPO) method within the critic-free eligibility traces framework, identifying it as a special instance of the eligibility traces method operating under uniform credit assignment. Experiments demonstrate that S-trace not only outperforms GRPO, showing gains of 0.49\% on Qwen3-1.7B and 3.16\% on Qwen3-4B, and maintaining a robust 2.98\% improvement when scaled further to Qwen3-8B in average pass@16, but notably achieves this with simultaneously higher sample and token efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Selective Eligibility Traces (S-trace) for critic-free RLVR in LLMs to move beyond uniform credit assignment in methods like GRPO. It first introduces P-trace as a sample-efficient eligibility traces approach grounded in partial trust-region preservation, then extends it to S-trace via sparse masking of low-entropy tokens for finer-grained assignment. The work frames GSPO as a special case of eligibility traces under uniform credit assignment and reports empirical gains of 0.49% (Qwen3-1.7B), 3.16% (Qwen3-4B), and 2.98% (Qwen3-8B) in average pass@16 alongside improved sample and token efficiency.

Significance. If the empirical gains and efficiency claims hold under rigorous validation, the method could meaningfully advance critic-free RL for LLM reasoning by enabling more targeted credit assignment, potentially reducing training costs while improving performance on verifiable-reward tasks.

major comments (3)
  1. [Abstract] Abstract: the claim that GSPO is a special instance of the eligibility traces method operating under uniform credit assignment is asserted but no derivation, update-rule comparison, or mathematical contextualization is supplied, which is load-bearing for the paper's theoretical framing.
  2. [Abstract] Abstract: the reported percentage gains (0.49%, 3.16%, 2.98%) and efficiency improvements are presented without error bars, statistical significance tests, ablation studies, or full experimental protocol details, undermining assessment of whether the improvements are robust or attributable to the selective masking mechanism.
  3. [Abstract] Abstract: the central assumption that selectively masking low-entropy tokens produces fine-grained credit assignment without discarding critical reasoning information (and that the partial trust-region intuition remains valid under the critic-free RLVR objective) receives no supporting analysis, trajectory inspection, or counter-example check, yet this selectivity rule directly determines the eligibility trace sparsity and advantage signal.
minor comments (1)
  1. [Abstract] The abstract introduces the 'partial trust region preservation intuition' without indicating how it is formalized in the P-trace or S-trace update rules or how it differs from standard eligibility trace decay.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications from the full manuscript and committing to targeted revisions in the abstract and related sections to improve clarity and self-containment.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that GSPO is a special instance of the eligibility traces method operating under uniform credit assignment is asserted but no derivation, update-rule comparison, or mathematical contextualization is supplied, which is load-bearing for the paper's theoretical framing.

    Authors: The full manuscript derives this in Section 3.3 by showing that the GSPO update rule is recovered exactly when the eligibility trace is set to uniform (non-selective) credit assignment, with explicit comparison of the advantage propagation and trust-region terms. The abstract condenses the result for brevity. To strengthen the abstract's theoretical framing, we will revise it to include a one-sentence reference to this derivation and the uniform-credit special case. revision: yes

  2. Referee: [Abstract] Abstract: the reported percentage gains (0.49%, 3.16%, 2.98%) and efficiency improvements are presented without error bars, statistical significance tests, ablation studies, or full experimental protocol details, undermining assessment of whether the improvements are robust or attributable to the selective masking mechanism.

    Authors: The abstract summarizes headline numbers; the manuscript reports full protocols in Section 4.1, ablation studies isolating the masking mechanism in Section 4.4 and Figure 5, and results averaged over multiple random seeds. We agree the abstract would benefit from explicit robustness indicators. We will revise the abstract to report gains with standard deviations and note that improvements remain consistent and statistically significant across seeds and model scales. revision: yes

  3. Referee: [Abstract] Abstract: the central assumption that selectively masking low-entropy tokens produces fine-grained credit assignment without discarding critical reasoning information (and that the partial trust-region intuition remains valid under the critic-free RLVR objective) receives no supporting analysis, trajectory inspection, or counter-example check, yet this selectivity rule directly determines the eligibility trace sparsity and advantage signal.

    Authors: Section 3.2 derives the partial trust-region preservation for P-trace and its extension to S-trace under the critic-free objective. Section 4.3 and Appendix D supply trajectory inspections, entropy histograms, and counter-example cases demonstrating that low-entropy masking targets non-critical tokens while preserving reasoning steps. We will add a concise clause to the abstract referencing this supporting analysis to make the assumption's grounding explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces P-trace and S-trace as novel extensions of eligibility traces for critic-free RLVR, with GSPO positioned as a special case under uniform credit assignment. No step reduces a claimed result or prediction to a fitted parameter or self-citation by construction; the entropy-based masking rule is presented as an empirical heuristic rather than a derived necessity, and performance gains are reported as experimental outcomes without tautological re-derivation of inputs. The central claims rest on algorithmic definitions and empirical validation rather than self-referential loops.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that partial trust-region preservation permits stable critic-free updates and on the ad-hoc choice of an entropy threshold for token masking; no new physical entities are postulated.

free parameters (1)
  • entropy threshold for selective masking
    The cutoff that decides which tokens receive eligibility traces is a tunable hyperparameter required to implement the sparse mechanism.
axioms (1)
  • domain assumption Partial trust region preservation allows stable learning when eligibility traces are applied in a critic-free setting.
    Invoked to justify the introduction of P-trace before adding selectivity.

pith-pipeline@v0.9.0 · 5538 in / 1387 out tokens · 67463 ms · 2026-05-09T15:36:58.883284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  2. [2]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  3. [3]

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomek Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip J.K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Michau...

  4. [4]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  5. [5]

    Inter- preting emergent planning in model-free reinforcement learning

    Thomas Bush, Stephen Chung, Usman Anwar, Adrià Garriga-Alonso, and David Krueger. Inter- preting emergent planning in model-free reinforcement learning. InThe Thirteenth International Conference on Learning Representations, 2025

  6. [6]

    Hanna and Nicholas E

    Josiah P. Hanna and Nicholas E. Corrado. When can model-free reinforcement learning be enough for thinking? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  7. [7]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  8. [8]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  9. [9]

    A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research, 2024

    Eduardo Pignatelli, Johan Ferret, Matthieu Geist, Thomas Mesnard, Hado van Hasselt, and Laura Toni. A survey of temporal credit assignment in deep reinforcement learning.Transactions on Machine Learning Research, 2024. Survey Certification

  10. [10]

    Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

    Richard S Sutton. Learning to predict by the methods of temporal differences.Machine learning, 3(1):9–44, 1988

  11. [11]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  12. [12]

    Expected eligibility traces

    Hado van Hasselt, Sephora Madjiheurem, Matteo Hessel, David Silver, André Barreto, and Diana Borsa. Expected eligibility traces. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 9997–10005, 2021

  13. [13]

    From past to future: rethinking eligibility traces

    Dhawal Gupta, Scott M Jordan, Shreyas Chaudhari, Bo Liu, Philip S Thomas, and Bruno Castro da Silva. From past to future: rethinking eligibility traces. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12253–12260, 2024. 10

  14. [14]

    An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function

    Hajime Kimura and Shigenobu Kobayashi. An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value function. InProceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, page 278–286, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc

  15. [15]

    Grpo-λ: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025

    Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, and Sarath Chandar. Grpo-λ: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025

  16. [16]

    Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-Hui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for LLM reasoning. InThe Thirty-ninth Annual Confe...

  17. [17]

    Selective credit assignment

    Veronica Chelu, Diana Borsa, Doina Precup, and Hado van Hasselt. Selective credit assignment. CoRR, abs/2202.09699, 2022

  18. [18]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  19. [19]

    Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models

    Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models. InForty-first International Conference on Machine Learning, 2024

  20. [20]

    Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational...

  21. [21]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models.arXiv preprint arXiv:2501.03262, 2025

  22. [22]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling, 2025

  23. [23]

    On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

    Yaru Hao, Li Dong, Xun Wu, Shaohan Huang, Zewen Chi, and Furu Wei. On-policy rl with optimal reward baseline.arXiv preprint arXiv:2505.23585, 2025

  24. [24]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  25. [25]

    R1-reward: Training multimodal reward model through stable reinforcement learning.CoRR, abs/2505.02835, 2025

    Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025

  26. [26]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  27. [27]

    Gepo: Group expectation policy optimization for stable heterogeneous reinforcement learning.arXiv preprint arXiv:2508.17850, 2025

    Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, et al. Gepo: Group expectation policy optimization for stable heterogeneous reinforcement learning.arXiv preprint arXiv:2508.17850, 2025. 11

  28. [28]

    Learning to reason under off-policy guidance

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  29. [29]

    Exgrpo: Learning to reason from experience

    Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245, 2025

  30. [30]

    RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

    Yihong Dong, Xue Jiang, Yongding Tao, Huanyu Liu, Kechi Zhang, Lili Mou, Rongyu Cao, Yingwei Ma, Jue Chen, Binhua Li, et al. Countering capability boundary collapse of llms in reinforcement learning with hybrid-policy optimization.arXiv preprint arXiv:2508.00222, 2025

  31. [31]

    Tapered off-policy reinforce-stable and efficient reinforcement learning for large language models

    Nicolas Le Roux, Marc G Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alexandre Fréchette, Carolyne Pelletier, Eric Thibodeau-Laufer, Sándor Tóth, and Sam Work. Tapered off-policy reinforce-stable and efficient reinforcement learning for large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  32. [32]

    Asymmetric REINFORCE for off-policy reinforcement learning: Balancing positive and negative rewards

    Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, and Remi Munos. Asymmetric REINFORCE for off-policy reinforcement learning: Balancing positive and negative rewards. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  33. [33]

    Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

  34. [34]

    arXiv preprint arXiv:2509.03646 , year=

    Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, and Wenhu Chen. Emergent hierarchical reasoning in llms through reinforcement learning.arXiv preprint arXiv:2509.03646, 2025

  35. [35]

    KTAE: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning

    Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. KTAE: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  36. [36]

    Process reward model with q-value rankings

    Wendi Li and Yixuan Li. Process reward model with q-value rankings. InThe Thirteenth International Conference on Learning Representations, 2025

  37. [37]

    Stop summation: Min-form credit assignment is all process reward model needs for reasoning

    Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, and Fei-Yue Wang. Stop summation: Min-form credit assignment is all process reward model needs for reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  38. [38]

    Do we need to verify step by step? rethinking process supervision from a theoretical perspective

    Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Do we need to verify step by step? rethinking process supervision from a theoretical perspective. InForty-second International Conference on Machine Learning, 2025

  39. [39]

    VinePPO: Refining credit assignment in RL training of LLMs

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. VinePPO: Refining credit assignment in RL training of LLMs. InForty-second International Conference on Machine Learning, 2025

  40. [40]

    Trajectory bellman residual minimization: A simple value-based method for LLM reasoning

    Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, and Tengyang Xie. Trajectory bellman residual minimization: A simple value-based method for LLM reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  41. [41]

    Sutton, and Satinder P

    Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for off-policy policy evaluation. InProceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, page 759–766, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc

  42. [42]

    Machado, Adam White, and Martha White

    Esraa Elelimy, Brett Daley, Andrew Patterson, Marlos C. Machado, Adam White, and Martha White. Deep reinforcement learning with gradient eligibility traces. InReinforcement Learning Conference, 2025. 12

  43. [43]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015

  44. [44]

    Selective dyna-style planning under limited model capacity

    Zaheer Abbas, Samuel Sokota, Erin Talvitie, and Martha White. Selective dyna-style planning under limited model capacity. InInternational Conference on Machine Learning, pages 1–10. PMLR, 2020

  45. [45]

    Preferential temporal difference learning

    Nishanth Anand and Doina Precup. Preferential temporal difference learning. InInternational Conference on Machine Learning, pages 286–296. PMLR, 2021

  46. [46]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 1279–1297, New York, NY , USA, 2025. Association for Computing Machinery

  47. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  48. [48]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

  49. [49]

    Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

    Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

  50. [50]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

  51. [51]

    ByteDance Seed, Jiaze Chen, Tiantian Fan, Xin Liu, Lingjun Liu, Zhiqi Lin, Mingxuan Wang, Chengyi Wang, Xiangpeng Wei, Wenyuan Xu, et al. Seed1. 5-thinking: Advancing superb reasoning models with reinforcement learning.arXiv preprint arXiv:2504.13914, 2025

  52. [52]

    Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

  53. [53]

    Making memories last: the synaptic tagging and capture hypothesis.Nature reviews neuroscience, 12(1):17–30, 2011

    Roger L Redondo and Richard GM Morris. Making memories last: the synaptic tagging and capture hypothesis.Nature reviews neuroscience, 12(1):17–30, 2011

  54. [54]

    A minimalist approach to llm reasoning: from rejection sampling to reinforce, 2025

    Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong, et al. A minimalist approach to llm reasoning: from rejection sampling to reinforce.arXiv preprint arXiv:2504.11343, 2025. 13 A Implementation Details In this section, we provide comprehensive details on the experimental setup, followed by an ...

  55. [55]

    |o|X t=1 rt(θ) ˆAt # 16 =E x∼Q o∼πθold(·|x)

    To elucidate the underlying dynamics, we compared the clipping behavior of GRPO( λ)-0.99 and P-trace-0.99 on the DAPO-Math-14k dataset. As evidenced in Figure 6a, the clipping fraction associated with GRPO(λ)-0.99 is substantially higher than that of P-trace-0.99, notably dwarfing the latter by nearly 24 times around step 200. We deem that the stochastic ...

  56. [56]

    80/20 rule

    leverages gradient sparsity (effectively masking out low-entropy tokens) to invoke the “80/20 rule” , our LOWO formulation avoids this confounding factor. By ensuring that the gradient update for every token explicitly includes the intrinsic contribution from its own importance weight, we guarantee that any observed performance gains are strictly attribut...