pith. machine review for the scientific record. sign in

arxiv: 2604.16995 · v1 · submitted 2026-04-18 · 💻 cs.CL · cs.LG

Recognition: unknown

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

Yifu Huo , Chenglong Wang , Ziming Zhu , Shunjie Xing , Peinan Feng , Tongran Liu , Qiaozhi He , Tianhua Zhou , Xiaojia Chang , Jingbo Zhu , Zhengtao Yu , Tong Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords reinforcement learninginverse reinforcement learninglarge language modelsexplorationreasoningPass@ktrajectory distributionprobability squeezing
0
0 comments X

The pith

SPS interleaves standard RL with inverse RL on its own rollouts to counteract probability squeezing and expand exploration in reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard reinforcement learning improves single-answer accuracy but squeezes probability mass onto a narrow set of high-reward trajectories, limiting the diversity needed for multi-sample performance. The paper proposes Steering Probability Squeezing, which alternates conventional RL updates with inverse RL steps that treat the model's current on-policy rollouts as demonstrations. This reshaping step redistributes probability across a broader set of valid trajectories without any external labels or rewards. Experiments on five reasoning benchmarks show gains in Pass@k, and the work also identifies an empirical upper bound on attainable Pass@k under pure RL. The central mechanism therefore offers a self-supervised route to sustained exploration inside the same training loop.

Core claim

The limitation in RL for reasoning LLMs arises from a squeezing effect that concentrates probability on few trajectories; SPS counters this by interleaving RL with IRL that uses the model's own on-policy rollouts as demonstrations to explicitly reshape the induced trajectory distribution, thereby increasing exploration and raising Pass@k without external supervision.

What carries the argument

Steering Probability Squeezing (SPS), the interleaving of RL optimization with IRL steps that treat current on-policy rollouts as demonstrations to reshape the trajectory probability distribution.

If this is right

  • SPS raises Pass@k on standard reasoning benchmarks while preserving or improving Pass@1.
  • The approach requires no additional labeled data or external reward models beyond the original rule-based signals.
  • RL learning dynamics exhibit an empirical upper bound on Pass@k that pure RL training cannot exceed.
  • Alternating RL and IRL phases provides a practical pathway for extending exploration capacity in reasoning-oriented models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identified upper bound on Pass@k implies that some exploration ceilings may be intrinsic to the RL objective itself rather than fixable by hyperparameter tuning alone.
  • SPS-style alternation could be tested in non-reasoning RL settings where mode collapse similarly restricts output diversity.
  • If the reshaping step generalizes, it may reduce the need for separate exploration bonuses or entropy regularization in LLM RL pipelines.

Load-bearing premise

The primary obstacle to exploration is probability squeezing that can be reversed by applying IRL to the model's own rollouts without introducing new biases or supervision.

What would settle it

Apply SPS to a reasoning task and measure no measurable increase in the number of distinct high-reward trajectories sampled or no improvement in Pass@k relative to standard RL training.

Figures

Figures reproduced from arXiv: 2604.16995 by Chenglong Wang, Jingbo Zhu, Peinan Feng, Qiaozhi He, Shunjie Xing, Tianhua Zhou, Tongran Liu, Tong Xiao, Xiaojia Chang, Yifu Huo, Zhengtao Yu, Ziming Zhu.

Figure 1
Figure 1. Figure 1: Illustration of squeezing effect. y ∗ n denotes the sequence that dominates the output distribution (i.e., the sequence consistently sampled by greedy decoding). Subfigure (a) shows the normal RL case, where proba￾bility mass shifts along the gradient direction. Subfigure (b) shows that when the distribution is already imbal￾anced, the updates further concentrate probability mass into the dominant peak, a … view at source ↗
Figure 2
Figure 2. Figure 2: Partial results of the preliminary study. Subfigures (a) and (b) show the effect of GRPO on average [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An overview of our SPS approach. Our training pipeline follows an iterative loop consisting of two [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of the sampling size on SPS perfor [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy distribution variation of Qwen2.5-Math-1.5B [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has emerged as a promising paradigm for training reasoning-oriented models by leveraging rule-based reward signals. However, RL training typically tends to improve single-sample success rates (i.e., Pass@1) while offering limited exploration of diverse reasoning trajectories, which is crucial for multi-sample performance (i.e., Pass@k). Our preliminary analysis reveals that this limitation stems from a fundamental squeezing effect, whereby probability mass is excessively concentrated on a narrow subset of high-reward trajectories, restricting genuine exploration and constraining attainable performance under RL training. To address this issue, in this work, we propose Steering Probability Squeezing (SPS), a training paradigm that interleaves conventional RL with inverse reinforcement learning (IRL). SPS treats on-policy rollouts as demonstrations and employs IRL to explicitly reshape the induced trajectory distribution, thereby enhancing exploration without introducing external supervision. Experiments on five commonly used reasoning benchmarks demonstrate that SPS can enable better exploration and improve Pass@k. Beyond algorithmic contributions, we provide an analysis of RL learning dynamics and identify an empirical upper bound on Pass@k, shedding light on intrinsic exploration limits in RL-based reasoning models. Our findings suggest that alternating between RL and IRL offers an effective pathway toward extending the exploration capacity of reasoning-oriented large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Steering Probability Squeezing (SPS), a training paradigm that interleaves standard RL with inverse reinforcement learning (IRL) for reasoning-oriented LLMs. It identifies a 'squeezing effect' in which RL concentrates probability mass on a narrow set of high-reward trajectories, limiting Pass@k. SPS treats the model's own on-policy rollouts as IRL demonstrations to reshape the induced trajectory distribution and improve exploration without external supervision. Experiments on five reasoning benchmarks report gains in Pass@k, accompanied by an analysis of RL dynamics and an empirical upper bound on attainable Pass@k.

Significance. If the central mechanism is shown to genuinely enlarge trajectory support rather than merely reweight within it, the work would offer a practical, supervision-free route to better exploration in RL for LLM reasoning. The empirical upper bound on Pass@k constitutes a concrete, falsifiable contribution that can guide future analyses of intrinsic limits. The interleaving approach and benchmark results, if robust, would be of interest to the RL-for-LLMs community.

major comments (2)
  1. [§3] §3 (SPS algorithm): The central claim that IRL applied to on-policy rollouts 'explicitly reshape[s] the induced trajectory distribution' to enhance exploration lacks a supporting argument or metric showing expansion of support. Because the demonstrations are drawn from the current (already squeezed) policy and the underlying reward is rule-based, standard IRL reduces to reweighting trajectories already present in the support; no divergence term, new sampling procedure, or proof of increased effective support is provided to counter the skeptic's concern that the method inherits the same exploration limit.
  2. [§5] §5 (Experiments): The reported Pass@k improvements are presented without statistical significance tests, variance across random seeds, or ablations that isolate the IRL component from the choice of interleaving frequency and weighting. Without these controls it is impossible to determine whether gains arise from the claimed distribution reshaping or from incidental effects of the interleaving schedule.
minor comments (2)
  1. [§2] The preliminary analysis of the squeezing effect (likely §2) would benefit from an explicit definition of the trajectory distribution and a quantitative measure (e.g., entropy or support size) before and after RL steps.
  2. [§3] Notation for the IRL objective and the combined RL+IRL update rule should be unified with the RL objective to avoid ambiguity in how the two steps interact.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, offering clarifications on the SPS mechanism and committing to specific revisions that strengthen the empirical support without altering the core claims.

read point-by-point responses
  1. Referee: [§3] §3 (SPS algorithm): The central claim that IRL applied to on-policy rollouts 'explicitly reshape[s] the induced trajectory distribution' to enhance exploration lacks a supporting argument or metric showing expansion of support. Because the demonstrations are drawn from the current (already squeezed) policy and the underlying reward is rule-based, standard IRL reduces to reweighting trajectories already present in the support; no divergence term, new sampling procedure, or proof of increased effective support is provided to counter the skeptic's concern that the method inherits the same exploration limit.

    Authors: We appreciate the referee's concern regarding the theoretical grounding of the reshaping effect. While the demonstrations come from the current policy, the IRL step derives a reward model from these trajectories that is then used to guide the subsequent RL phase; this alternation prevents rapid collapse by periodically re-emphasizing a broader set of demonstrated behaviors under the rule-based reward. The observed Pass@k gains across benchmarks provide indirect evidence of expanded effective support, as higher multi-sample performance requires successful trajectories beyond the narrow mode favored by pure RL. That said, we agree a direct metric (such as trajectory entropy or the count of distinct high-reward paths) would make the support-expansion claim more explicit. We will add this analysis and a clarifying paragraph on the interleaving dynamics in the revision. revision: partial

  2. Referee: [§5] §5 (Experiments): The reported Pass@k improvements are presented without statistical significance tests, variance across random seeds, or ablations that isolate the IRL component from the choice of interleaving frequency and weighting. Without these controls it is impossible to determine whether gains arise from the claimed distribution reshaping or from incidental effects of the interleaving schedule.

    Authors: We acknowledge that the current experimental section would benefit from greater statistical rigor and targeted controls. In the revised manuscript we will report means and standard deviations over multiple random seeds, include paired statistical significance tests (e.g., t-tests) on the Pass@k deltas, and add ablations that vary interleaving frequency and the RL/IRL weighting hyperparameter while holding other factors fixed. These additions will allow readers to isolate the contribution of the IRL reshaping step from the interleaving schedule itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper identifies an empirical squeezing effect via preliminary analysis of RL dynamics, then proposes SPS as an interleaving of standard RL with IRL that treats on-policy rollouts as demonstrations to reshape trajectory distributions. This is presented as a new training paradigm and validated through experiments on five external reasoning benchmarks measuring Pass@k improvements. No load-bearing mathematical derivation, uniqueness theorem, or ansatz is shown to reduce by construction to the inputs; the central claim rests on algorithmic description plus independent empirical outcomes rather than self-definition, fitted inputs renamed as predictions, or self-citation chains. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that probability squeezing is the dominant cause of limited Pass@k and that IRL applied to on-policy data can reshape distributions productively. No explicit free parameters or invented entities are named in the abstract.

free parameters (1)
  • interleaving frequency and weighting between RL and IRL steps
    The method requires choosing how often and how strongly to apply IRL relative to RL; these choices are not detailed in the abstract.
axioms (1)
  • domain assumption On-policy rollouts constitute valid demonstrations for IRL that can reshape the trajectory distribution without external supervision
    Invoked directly in the description of SPS.

pith-pipeline@v0.9.0 · 5564 in / 1190 out tokens · 34378 ms · 2026-05-10T06:54:15.157880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 26 canonical work pages · 10 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. 2025. Matharena: Evaluating llms on uncontaminated math competitions

  4. [4]

    BRUMO. 2025. Brown university math olympiad 2025 ( BrUMO )

  5. [5]

    Zhipeng Chen, Xiaobo Qin, Youbin Wu, Yue Ling, Qinghao Ye, Wayne Xin Zhao, and Guang Shi. 2025. Pass@k training for adaptively balancing exploration and exploitation of large reasoning models. ArXiv preprint, abs/2508.10751

  6. [6]

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Hao-Si Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. 2025. The entropy mechanism of reinforcement learning for reasoning language models. ArXiv preprint, abs/2505.22617

  7. [7]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Jun-Mei Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiaoling Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, and 179 others. 2025. https://api.semanticscholar.org/CorpusID:275789950 Deepseek-r1 incentivizes reasoning in llms through rein...

  8. [8]

    Zhirui Deng, Zhicheng Dou, Yutao Zhu, Ji-Rong Wen, Ruibin Xiong, Mang Wang, and Weipeng Chen. 2024. From novice to expert: Llm agent policy optimization via step-wise reinforcement learning. ArXiv preprint, abs/2411.03817

  9. [9]

    Hugging Face. 2025. Open r1: A fully open reproduction of deepseek-r1

  10. [10]

    Bogdan Georgiev, Javier G'omez-Serrano, Terence Tao, and Adam Zsolt Wagner. 2025. Mathematical exploration and discovery at scale. ArXiv preprint, abs/2511.02864

  11. [11]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Xiaodong Song, and Jacob Steinhardt. 2020. https://api.semanticscholar.org/CorpusID:221516475 Measuring massive multitask language understanding . ArXiv, abs/2009.03300

  12. [12]

    HMMT. 2025. Harvard-mit mathematics tournaments ( HMMT )

  13. [13]

    Yifu Huo, Chenglong Wang, Qiren Zhu, Shunjie Xing, Tong Xiao, Chunliang Zhang, Tongran Liu, and Jingbo Zhu. 2025. Heal: A hypothesis-based preference-aware analysis framework. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8901--8919

  14. [14]

    Xingguang Ji, Yahui Liu, Qi Wang, Jingyuan Zhang, Yang Yue, Rui Shi, Chenxi Sun, Fuzheng Zhang, Guorui Zhou, and Kun Gai. 2025. Leanabell-prover-v2: Verifier-integrated reasoning for formal theorem proving via reinforcement learning. ArXiv preprint, abs/2507.08649

  15. [15]

    Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, and Ge Li. 2025. Coderl+: Improving code generation via reinforcement with execution semantics alignment. ArXiv preprint, abs/2510.18471

  16. [16]

    Gonzalez, Haotong Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Haotong Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles

  17. [17]

    Dacheng Li, Shiyi Cao, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph Gonzalez, and Ion Stoica. 2025. S*: Test time scaling for code generation. ArXiv preprint, abs/2502.14382

  18. [18]

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let's verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

  19. [19]

    Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. 2025. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models. ArXiv preprint, abs/2505.24864

  20. [20]

    Liu, and Jialu Liu

    Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. 2024. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

  21. [21]

    Yingfeng Luo, Tong Zheng, Yongyu Mu, Bei Li, Qinghong Zhang, Yongqi Gao, Ziqiang Xu, Peinan Feng, Xiaoqian Liu, Tong Xiao, and Jingbo Zhu. 2025. Beyond decoder-only: Large language models can be good encoders for machine translation. ArXiv preprint, abs/2503.06594

  22. [22]

    MAA. 2025. American invitational mathematics examination ( AIME ). Mathematics Competition Series

  23. [23]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with hum...

  24. [24]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. 2023. https://api.semanticscholar.org/CorpusID:265295009 Gpqa: A graduate-level google-proof q&a benchmark . ArXiv, abs/2311.12022

  25. [25]

    Learning dynamics of llm finetuning.arXiv preprint arXiv:2407.10490,

    Yi Ren and Danica J. Sutherland. 2024. Learning dynamics of llm finetuning. ArXiv preprint, abs/2407.10490

  26. [26]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. ArXiv preprint, abs/1707.06347

  27. [27]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun-Mei Song, Mingchuan Zhang, Y. K. Li, Yu Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv preprint, abs/2402.03300

  28. [28]

    Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan J. Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. 2020. Learning to summarize from human feedback. ArXiv preprint, abs/2009.01325

  29. [29]

    Hao Sun. 2024. Supervised fine-tuning as inverse reinforcement learning. ArXiv preprint, abs/2403.12017

  30. [30]

    Hao Sun, Thomas Pouplin, Nicol \'a s Astorga, Tennison Liu, and Mihaela van der Schaar. 2024. Improving llm generation with inverse and forward alignment: Reward modeling, prompting, fine-tuning, and inference-time optimization. In The First Workshop on System-2 Reasoning at Scale, NeurIPS'24

  31. [31]

    Hao Sun and Mihaela van der Schaar. 2024. Inverse-rlignment: Inverse reinforcement learning from demonstrations for llm alignment. ArXiv preprint, abs/2405.15624

  32. [32]

    Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, and Jun Zhou. 2025. Rethinking sample polarity in reinforcement learning with verifiable rewards

  33. [33]

    ModelScope Team. 2024. EvalScope : Evaluation framework for large models

  34. [34]

    Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Qiaozhi He, Murun Yang, Bei Li, Tong Xiao, Chunliang Zhang, Tongran Liu, and 1 others. 2025 a . Gram: A generative foundation reward model for reward generalization. ArXiv preprint, abs/2506.14175

  35. [35]

    Chenglong Wang, Yang Gan, Yifu Huo, Yongyu Mu, Murun Yang, Qiaozhi He, Tong Xiao, Chunliang Zhang, Tongran Liu, and Jingbo Zhu. 2025 b . Rovrm: A robust visual reward model optimized via auxiliary textual preference data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25336--25344

  36. [36]

    Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, and 1 others. 2026. Msrl: Scaling generative multimodal reward modeling via multi-stage reinforcement learning. ArXiv preprint, abs/2603.25108

  37. [37]

    Chenglong Wang, Hang Zhou, Kaiyan Chang, Bei Li, Yongyu Mu, Tong Xiao, Tongran Liu, and JingBo Zhu. 2024 a . Hybrid alignment training for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11389--11403, Bangkok, Thailand. Association for Computational Linguistics

  38. [38]

    Chenglong Wang, Hang Zhou, Yimin Hu, Yifu Huo, Bei Li, Tongran Liu, Tong Xiao, and Jingbo Zhu. 2024 b . ESRL: efficient sampling-based reinforcement learning for sequence generation. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth S...

  39. [39]

    Guojian Wang, Faguo Wu, Xiao Zhang, and Jianxiang Liu. 2023. Learning diverse policies with soft self-generated guidance. International Journal of Intelligent Systems, 2023(1):4705291

  40. [40]

    Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, and Jian-Yun Nie. 2025. It takes two: Your grpo is secretly dpo. ArXiv preprint, abs/2510.00977

  41. [41]

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. 2025. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning. ArXiv preprint, abs/2502.14768

  42. [42]

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. 2025. Learning to reason under off-policy guidance. ArXiv preprint, abs/2504.14945

  43. [43]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. 2024. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. ArXiv preprint, abs/2409.12122

  44. [44]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, and 16 others. 2025. Dapo: An open-source llm reinforcement learning system at scale. ArXiv preprint, abs/2503.14476

  45. [45]

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. 2025. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? ArXiv preprint, abs/2504.13837

  46. [46]

    Linfeng Zhang, Chenglong Bao, and Kaisheng Ma. 2021. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4388--4403

  47. [47]

    Yifan Zhang and Team Math-AI. 2024. American invitational mathematics examination (aime) 2024

  48. [48]

    Yifan Zhang and Team Math-AI. 2025. American invitational mathematics examination (aime) 2025

  49. [49]

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. 2025. Group sequence policy optimization. ArXiv preprint, abs/2507.18071

  50. [50]

    Hang Zhou, Chenglong Wang, Yimin Hu, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. 2024. Prior constraints-based reward model training for aligning large language models. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 1395--1407