pith. machine review for the scientific record. sign in

arxiv: 2605.08873 · v1 · submitted 2026-05-09 · 💻 cs.LG · stat.AP· stat.ML

Recognition: no theorem link

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

Ananda Theertha Suresh, Himanshu Jain, Sanjiv Kumar, Soo Min Kwon, Ziteng Sun

Pith reviewed 2026-05-12 00:52 UTC · model grok-4.3

classification 💻 cs.LG stat.APstat.ML
keywords CoDistill-GRPOGRPOco-distillationknowledge distillationpolicy optimizationlanguage model reasoningmathematical benchmarkstraining efficiency
0
0 comments X

The pith

Co-distillation lets small and large language models train each other to improve math reasoning while cutting rollout costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoDistill-GRPO as a way to overcome the sparse-reward problem that prevents standard Group Relative Policy Optimization from helping small models on hard reasoning tasks. Instead of relying on a separate large oracle model, the method trains a large and small model at the same time so each supplies useful signals to the other. The small model receives on-policy knowledge-distillation rewards drawn from the large model's distribution, while the large model receives updates from the small model's rollouts corrected by importance reweighting. Experiments show the small model gains more than 11 points over its base version and 6 points over plain GRPO on the Minerva dataset, and the large model nearly matches standard GRPO performance at roughly 18 percent lower compute. If the mutual-training loop works reliably, it removes the need for an external teacher and makes strong reasoning training feasible for smaller models.

Core claim

CoDistill-GRPO simultaneously maximizes GRPO objectives for both models: the small model is updated with an on-policy knowledge-distillation reward that aligns it to the large model's output distribution, and the large model is updated on trajectories generated by the small model after importance reweighting corrects for the distribution shift. This mutual loop produces a small Qwen2.5-Math-1.5B model whose accuracy rises more than 11.6 points above the base model and 6 points above ordinary GRPO on Minerva, while the paired 7B model reaches nearly the same final performance as standard GRPO despite training only on small-model rollouts.

What carries the argument

The co-distillation loop that pairs an on-policy knowledge-distillation reward for the small model with importance-reweighted policy updates for the large model on small-model rollouts.

If this is right

  • Small models achieve substantially higher accuracy on mathematical reasoning benchmarks without access to an external teacher model.
  • Large models reach nearly the same final performance while using cheaper rollouts generated by the smaller partner.
  • The same recipe works across different base families including Qwen and Llama variants.
  • Overall training time for the larger model drops by roughly 18 percent compared with running GRPO in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be chained so that a tiny model guides a medium one that in turn guides a large one, further lowering rollout costs at each step.
  • If the importance reweighting remains stable, the same mutual-training pattern might apply to other policy-optimization methods such as direct preference optimization.
  • The method might generalize beyond mathematics to other sparse-reward domains like coding or multi-step planning where small models currently struggle.

Load-bearing premise

Importance reweighting fully corrects the distribution shift from small-model rollouts so the large model still converges to the same high-performing policy without added bias or instability.

What would settle it

If the large model trained with CoDistill-GRPO on small-model rollouts ends up with clearly lower accuracy on math benchmarks than the same large model trained with standard GRPO using its own rollouts, the efficiency claim fails.

read the original abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes CoDistill-GRPO, a co-distillation algorithm for Group Relative Policy Optimization (GRPO) that jointly trains a large and small language model. The small model learns via an on-policy knowledge-distillation reward derived from the large model's distribution, while the large model is updated on rollouts generated by the small model using importance reweighting to reduce rollout computation. Empirical results on mathematical reasoning benchmarks (e.g., Minerva) report that Qwen2.5-Math-1.5B achieves +11.6 pp over the base model and +6.0 pp over standard GRPO, while the 7B model nearly matches standard GRPO performance with an approximate 18% training speedup.

Significance. If the empirical gains and the unbiasedness of the importance-reweighted updates hold under scrutiny, the method would provide a practical, lower-cost route to improving reasoning in small models without a separate oracle and would simultaneously accelerate large-model GRPO training. This could be of independent interest for efficient scaling of RL-based reasoning methods.

major comments (2)
  1. [Abstract] Abstract: the headline claim that the 7B model 'nearly matches standard GRPO performance despite training on small-model rollouts' rests on the assertion that importance reweighting fully compensates for the off-policy distribution shift. No verification (weight histograms, effective sample size, gradient bias diagnostics, or ablation removing reweighting) is described, leaving open the possibility that high-variance or clipped weights introduce bias or instability that would undermine the speedup claim.
  2. [Abstract] Abstract: concrete accuracy lifts (+11.6 pp, +6.0 pp) and the 18% speedup are reported without accompanying implementation details, hyper-parameter settings for the reweighting, or ablation studies that isolate the contribution of co-distillation versus standard GRPO. This absence prevents assessment of whether the reported gains are robust or reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to incorporate the requested verifications, details, and ablations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that the 7B model 'nearly matches standard GRPO performance despite training on small-model rollouts' rests on the assertion that importance reweighting fully compensates for the off-policy distribution shift. No verification (weight histograms, effective sample size, gradient bias diagnostics, or ablation removing reweighting) is described, leaving open the possibility that high-variance or clipped weights introduce bias or instability that would undermine the speedup claim.

    Authors: We agree that explicit verification strengthens the claim. In the revised manuscript we have added weight histograms, effective sample size statistics, and gradient bias diagnostics to the appendix, along with an ablation that removes importance reweighting. These additions show that weights remain well-behaved (no extreme variance after clipping) and that reweighting is necessary to preserve the observed performance parity, supporting the reported speedup. revision: yes

  2. Referee: [Abstract] Abstract: concrete accuracy lifts (+11.6 pp, +6.0 pp) and the 18% speedup are reported without accompanying implementation details, hyper-parameter settings for the reweighting, or ablation studies that isolate the contribution of co-distillation versus standard GRPO. This absence prevents assessment of whether the reported gains are robust or reproducible.

    Authors: We acknowledge the need for greater transparency. The revised manuscript now includes a dedicated implementation subsection with all hyper-parameter values for reweighting (clipping threshold, temperature, etc.) and new ablation studies in Section 4.4 that isolate co-distillation from standard GRPO. These changes make the source of the reported gains and the 18% speedup explicit and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction evaluated empirically on external benchmarks

full rationale

The paper introduces CoDistill-GRPO as a co-training procedure in which the small model receives an on-policy KD reward from the large model and the large model is updated on small-model rollouts via importance reweighting. All reported gains (e.g., +11.6 pp over base and +6.0 pp over GRPO on Minerva for Qwen2.5-Math-1.5B) are presented as measured outcomes on held-out mathematical benchmarks, not as quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are shown to reduce to the inputs; the importance-reweighting step is a standard off-policy correction whose stability is asserted to be confirmed by the final performance numbers rather than presupposed. The derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters or invented entities; the approach appears to rest on standard RL assumptions about rollout sampling and advantage estimation.

axioms (1)
  • domain assumption Rollouts can be sampled from the small model and importance weights can be computed to produce unbiased updates for the large model.
    Implicit in the description of the large-model update step.

pith-pipeline@v0.9.0 · 5629 in / 1246 out tokens · 49306 ms · 2026-05-12T00:52:09.088539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 11 internal anchors

  1. [1]

    Agarwal, N

    R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3zKtaqxLhW

  2. [2]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    A. Ahmadian, C. Cremer, M. Gallé, M. Fadaee, J. Kreutzer, O. Pietquin, A. Üstün, and S. Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024. URL https://arxiv.org/abs/2402.14740

  3. [3]

    Allen-Zhu and Y

    Z. Allen-Zhu and Y. Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Uuf2q9TfXGA

  4. [4]

    J. H. Cho and B. Hariharan. On the efficacy of knowledge distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4793--4801, 2019. URL https://api.semanticscholar.org/CorpusID:203642130

  5. [5]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  6. [6]

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wa...

  7. [7]

    Y. Gu, L. Dong, F. Wei, and M. Huang. Mini LLM : Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ

  8. [8]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe

  9. [9]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. URL https://arxiv.org/abs/1503.02531

  10. [10]

    J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025. URL https://arxiv.org/abs/2503.24290

  11. [11]

    Huang, S

    T. Huang, S. You, F. Wang, C. Qian, and C. Xu. Knowledge distillation from a stronger teacher. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=157Usp_kbi

  12. [12]

    Kaplun, eran malach, P

    G. Kaplun, eran malach, P. Nakkiran, and S. Shalev-Shwartz. Knowledge distillation: Bad models can be good role models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=0ISChqjlrq

  13. [14]

    Lambert, J

    N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, X. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training. In Second Conference on Langua...

  14. [15]

    B. Liu, A. Wang, Z. Min, L. Yao, H. Zhang, Y. Liu, X. Han, P. Li, A. Zeng, and J. Su. Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts. arXiv preprint arXiv:2509.23232, 2026. URL https://arxiv.org/abs/2509.23232

  15. [16]

    Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective. In Conference on Language Modeling (COLM), 2025

  16. [17]

    On-policy distillation

    K. Lu and T. M. Lab. On-policy distillation. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

  17. [18]

    A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar. A statistical perspective on distillation. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632--7642. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/menon21a.html

  18. [19]

    MiniMax, :, A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, C. Xiao, C. Du, C. Zhang, C. Qiao, C. Zhang, C. Du, C. Guo, D. Chen, D. Ding, D. Sun, D. Li, E. Jiao, H. Zhou, H. Zhang, H. Ding, H. Sun, H. Feng, H. Cai, H. Zhu, J. Sun, J. Zhuang, J. Cai, J. Song, J. Zhu, J. Li, J. Tian, J. Liu, J. Xu, J. Yan, J. Liu, J. He,...

  19. [20]

    S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:212908749

  20. [21]

    Nagarajan, A

    V. Nagarajan, A. K. Menon, S. Bhojanapalli, H. Mobahi, and S. Kumar. On student-teacher deviations in distillation: does it pay to disobey? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=7UdVPRmpif

  21. [22]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Ad...

  22. [23]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, ...

  23. [24]

    Phuong and C

    M. Phuong and C. Lampert. Towards understanding knowledge distillation. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5142--5151. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.press/v97/phuong19a.html

  24. [25]

    A. S. Rawat, V. Sadhanala, A. Rostamizadeh, A. Chakrabarti, W. Jitkrittum, V. Feinberg, S. Kim, H. Harutyunyan, N. Saunshi, Z. Nado, R. Shivanna, S. J. Reddi, A. K. Menon, R. Anil, and S. Kumar. A little help goes a long way: Efficient llm training by leveraging small lms. arXiv preprint arXiv:2410.18779, 2024. URL https://arxiv.org/abs/2410.18779

  25. [26]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300

  26. [27]

    Sample more to think less: Group filtered policy optimiza- tion for concise reasoning

    V. Shrivastava, A. Awadallah, V. Balachandran, S. Garg, H. Behl, and D. Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. arXiv preprint arXiv:2508.09726, 2025. URL https://arxiv.org/abs/2508.09726

  27. [28]

    Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

    C. Snell, D. Klein, and R. Zhong. Learning by distilling context. arXiv preprint arXiv:2209.15189, 2022. URL https://arxiv.org/abs/2209.15189

  28. [29]

    X. Wen, Z. Liu, S. Zheng, S. Ye, Z. Wu, Y. Wang, Z. Xu, X. Liang, J. Li, Z. Miao, J. Bian, and M. Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLM s. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=jGbRWwIidy

  29. [30]

    C. Xiao, M. Zhang, and Y. Cao. Bnpo: Beta normalization policy optimization. arXiv preprint arXiv:2506.02864, 2025. URL https://arxiv.org/abs/2506.02864

  30. [31]

    H. Xu, Q. Zhu, H. Deng, J. Li, L. Hou, Y. Wang, L. Shang, R. Xu, and F. Mi. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning. arXiv preprint arXiv:2506.02208, 2025 a . URL https://arxiv.org/abs/2506.02208

  31. [32]

    X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024. URL https://arxiv.org/abs/2402.13116

  32. [33]

    Y. E. Xu, Y. Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning. arXiv preprint arXiv:2504.13818, 2025 b . URL https://arxiv.org/abs/2504.13818

  33. [34]

    Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y. Song, X. Wei, H. Zhou, J. Liu, W.-Y. Ma, Y.-Q. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang. Dapo: An open-source llm reinforcement learning sys...

  34. [35]

    W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2025. URL https://arxiv.org/abs/2401.10020

  35. [36]

    Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4OsgYD7em5

  36. [37]

    Zhang, Y

    K. Zhang, Y. Hong, J. Bao, H. Jiang, yang song, H. Dingqian, and H. Xiong. GVPO : Group variance policy optimization for large language model post-training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=cCYUFaR6En

  37. [38]

    Zhang, Z

    X. Zhang, Z. Huang, Y. Li, C. Ni, J. Chen, and S. Oymak. BREAD : Branched rollouts from expert anchors bridge SFT & RL for reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview.net/forum?id=NUDaln2vCe

  38. [39]

    Zhang, S

    X. Zhang, S. Wu, Y. Zhu, H. Tan, S. Yu, Z. He, and J. Jia. Scaf- GRPO : Scaffolded group relative policy optimization for enhancing LLM reasoning. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=bOwVr0yr7r

  39. [40]

    Y. Zhao, Y. Liu, J. Liu, J. Chen, X. Wu, Y. Hao, T. Lv, S. Huang, L. Cui, Q. Ye, F. Wan, and F. Wei. Geometric-mean policy optimization. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=nCEs0tSwc2

  40. [41]

    Group Sequence Policy Optimization

    C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, J. Zhou, and J. Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025 a . URL https://arxiv.org/abs/2507.18071

  41. [42]

    Zheng, Y

    H. Zheng, Y. Zhou, B. R. Bartoldson, B. Kailkhura, F. Lai, J. Zhao, and B. Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview.net/forum?id=x5lITYXmW2

  42. [43]

    Mastering the game of

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=

  43. [44]

    2024 , journal=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , journal=

  44. [45]

    Conference on Language Modeling (COLM) , year=

    Understanding r1-zero-like training: A critical perspective , author=. Conference on Language Modeling (COLM) , year=

  45. [46]

    2025 , journal=

    BNPO: Beta Normalization Policy Optimization , author=. 2025 , journal=

  46. [47]

    2025 , journal=

    Group Sequence Policy Optimization , author=. 2025 , journal=

  47. [48]

    2025 , journal=

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , journal=

  48. [49]

    2025 , journal=

    Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning , author=. 2025 , journal=

  49. [50]

    2025 , journal=

    Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning , author=. 2025 , journal=

  50. [51]

    2025 , url=

    Xuechen Zhang and Zijian Huang and Yingcong Li and Chenshun Ni and Jiasi Chen and Samet Oymak , booktitle=. 2025 , url=

  51. [52]

    Advances in Neural Information Processing Systems , editor=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  52. [53]

    2025 , journal=

    KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning , author=. 2025 , journal=

  53. [54]

    Training language models to follow instructions with human feedback , url =

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  54. [55]

    2025 , journal=

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , journal=

  55. [56]

    The Twelfth International Conference on Learning Representations , year=

    On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

  56. [57]

    Measuring Mathematical Problem Solving With the

    Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

  57. [58]

    Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=

  58. [59]

    2024 , journal=

    A Survey on Knowledge Distillation of Large Language Models , author=. 2024 , journal=

  59. [60]

    2025 , journal=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , journal=

  60. [61]

    Xichen Zhang and Sitong Wu and Yinghao Zhu and Haoru Tan and Shaozuo Yu and Ziyi He and Jiaya Jia , booktitle=. Scaf-. 2026 , url=

  61. [62]

    Advances in Neural Information Processing Systems , editor=

    Knowledge Distillation from A Stronger Teacher , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  62. [63]

    2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

    On the Efficacy of Knowledge Distillation , author=. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

  63. [64]

    AAAI Conference on Artificial Intelligence , year=

    Improved Knowledge Distillation via Teacher Assistant , author=. AAAI Conference on Artificial Intelligence , year=

  64. [65]

    2025 , url=

    Kaichen Zhang and Yuzhong Hong and Junwei Bao and Hongfei Jiang and yang song and Hong Dingqian and Hui Xiong , booktitle=. 2025 , url=

  65. [66]

    2024 , journal=

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author=. 2024 , journal=

  66. [67]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base

    Xumeng Wen and Zihan Liu and Shun Zheng and Shengyu Ye and Zhirong Wu and Yang Wang and Zhijian Xu and Xiao Liang and Junjie Li and Ziming Miao and Jiang Bian and Mao Yang , booktitle=. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base. 2026 , url=

  67. [68]

    2021 , journal=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , journal=

  68. [69]

    Second Conference on Language Modeling , year=

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Second Conference on Language Modeling , year=

  69. [70]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in

    Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=

  70. [71]

    2025 , journal=

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author=. 2025 , journal=

  71. [72]

    The Fourteenth International Conference on Learning Representations , year=

    Geometric-Mean Policy Optimization , author=. The Fourteenth International Conference on Learning Representations , year=

  72. [73]

    2015 , journal=

    Distilling the Knowledge in a Neural Network , author=. 2015 , journal=

  73. [74]

    doi: 10.18653/v1/D16-1139

    Kim, Yoon and Rush, Alexander M. Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1139

  74. [75]

    Thinking Machines Lab: Connectionism , year =

    Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

  75. [76]

    2022 , journal=

    Learning by Distilling Context , author=. 2022 , journal=

  76. [77]

    2025 , journal=

    Self-Rewarding Language Models , author=. 2025 , journal=

  77. [78]

    The Eleventh International Conference on Learning Representations , year=

    Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , author=. The Eleventh International Conference on Learning Representations , year=

  78. [79]

    2026 , journal=

    SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts , author=. 2026 , journal=

  79. [80]

    Bartoldson and Bhavya Kailkhura and Fan Lai and Jiawei Zhao and Beidi Chen , booktitle=

    Haizhong Zheng and Yang Zhou and Brian R. Bartoldson and Bhavya Kailkhura and Fan Lai and Jiawei Zhao and Beidi Chen , booktitle=. Act Only When It Pays: Efficient Reinforcement Learning for. 2025 , url=

  80. [81]

    Proceedings of the 36th International Conference on Machine Learning , pages =

    Towards Understanding Knowledge Distillation , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

Showing first 80 references.