arxiv: 2605.08873 · v1 · submitted 2026-05-09 · 💻 cs.LG · stat.AP· stat.ML

Recognition: no theorem link

CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization

Ananda Theertha Suresh, Himanshu Jain, Sanjiv Kumar, Soo Min Kwon, Ziteng Sun

Pith reviewed 2026-05-12 00:52 UTC · model grok-4.3

classification 💻 cs.LG stat.APstat.ML

keywords CoDistill-GRPOGRPOco-distillationknowledge distillationpolicy optimizationlanguage model reasoningmathematical benchmarkstraining efficiency

0 comments

The pith

Co-distillation lets small and large language models train each other to improve math reasoning while cutting rollout costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoDistill-GRPO as a way to overcome the sparse-reward problem that prevents standard Group Relative Policy Optimization from helping small models on hard reasoning tasks. Instead of relying on a separate large oracle model, the method trains a large and small model at the same time so each supplies useful signals to the other. The small model receives on-policy knowledge-distillation rewards drawn from the large model's distribution, while the large model receives updates from the small model's rollouts corrected by importance reweighting. Experiments show the small model gains more than 11 points over its base version and 6 points over plain GRPO on the Minerva dataset, and the large model nearly matches standard GRPO performance at roughly 18 percent lower compute. If the mutual-training loop works reliably, it removes the need for an external teacher and makes strong reasoning training feasible for smaller models.

Core claim

CoDistill-GRPO simultaneously maximizes GRPO objectives for both models: the small model is updated with an on-policy knowledge-distillation reward that aligns it to the large model's output distribution, and the large model is updated on trajectories generated by the small model after importance reweighting corrects for the distribution shift. This mutual loop produces a small Qwen2.5-Math-1.5B model whose accuracy rises more than 11.6 points above the base model and 6 points above ordinary GRPO on Minerva, while the paired 7B model reaches nearly the same final performance as standard GRPO despite training only on small-model rollouts.

What carries the argument

The co-distillation loop that pairs an on-policy knowledge-distillation reward for the small model with importance-reweighted policy updates for the large model on small-model rollouts.

If this is right

Small models achieve substantially higher accuracy on mathematical reasoning benchmarks without access to an external teacher model.
Large models reach nearly the same final performance while using cheaper rollouts generated by the smaller partner.
The same recipe works across different base families including Qwen and Llama variants.
Overall training time for the larger model drops by roughly 18 percent compared with running GRPO in isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be chained so that a tiny model guides a medium one that in turn guides a large one, further lowering rollout costs at each step.
If the importance reweighting remains stable, the same mutual-training pattern might apply to other policy-optimization methods such as direct preference optimization.
The method might generalize beyond mathematics to other sparse-reward domains like coding or multi-step planning where small models currently struggle.

Load-bearing premise

Importance reweighting fully corrects the distribution shift from small-model rollouts so the large model still converges to the same high-performing policy without added bias or instability.

What would settle it

If the large model trained with CoDistill-GRPO on small-model rollouts ends up with clearly lower accuracy on math benchmarks than the same large model trained with standard GRPO using its own rollouts, the efficiency claim fails.

read the original abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoDistill-GRPO gives a mutual training loop that boosts small models on math reasoning and cuts rollout cost for large ones, but the importance reweighting step is under-checked.

read the letter

The main point is a co-distillation setup for GRPO where a small model and a large one train together: the small model gets on-policy KD rewards drawn from the large model's distribution, while the large model updates on the small model's rollouts using importance reweighting to correct the distribution shift. On Qwen2.5-Math-1.5B this yields over 11.6 points above the base model and 6 points above plain GRPO on Minerva, with the 7B version staying close to standard GRPO performance at roughly 18% lower cost. The same pattern holds across Qwen and Llama families on math benchmarks. What is new is the closed loop itself, with the small model learning on-policy from the large one and the large one avoiding its own rollout generation. Prior GRPO work and standard KD do not combine the two directions this way. The paper does well by delivering concrete accuracy numbers and a practical speedup claim that could matter for people who want stronger reasoning without scaling compute linearly. The soft spot is the importance reweighting. If the small policy drifts from the large one, the ratio can produce high-variance or extreme weights, which risks noisy or biased gradients for the large model. The abstract gives no weight histograms, effective sample size numbers, or ablation that removes the reweighting, so it is not possible to tell how stable the correction actually is. The stress-test note flags exactly this risk, and nothing in the summary rules it out. This paper is for researchers working on efficient RL fine-tuning for reasoning models. A reader who wants to try cheaper GRPO variants would get a usable recipe and some numbers worth replicating. I would send it to peer review because the empirical claims are specific enough to test and the compute-saving angle is worth checking even if the reweighting needs tighter validation.

Referee Report

2 major / 0 minor

Summary. The paper proposes CoDistill-GRPO, a co-distillation algorithm for Group Relative Policy Optimization (GRPO) that jointly trains a large and small language model. The small model learns via an on-policy knowledge-distillation reward derived from the large model's distribution, while the large model is updated on rollouts generated by the small model using importance reweighting to reduce rollout computation. Empirical results on mathematical reasoning benchmarks (e.g., Minerva) report that Qwen2.5-Math-1.5B achieves +11.6 pp over the base model and +6.0 pp over standard GRPO, while the 7B model nearly matches standard GRPO performance with an approximate 18% training speedup.

Significance. If the empirical gains and the unbiasedness of the importance-reweighted updates hold under scrutiny, the method would provide a practical, lower-cost route to improving reasoning in small models without a separate oracle and would simultaneously accelerate large-model GRPO training. This could be of independent interest for efficient scaling of RL-based reasoning methods.

major comments (2)

[Abstract] Abstract: the headline claim that the 7B model 'nearly matches standard GRPO performance despite training on small-model rollouts' rests on the assertion that importance reweighting fully compensates for the off-policy distribution shift. No verification (weight histograms, effective sample size, gradient bias diagnostics, or ablation removing reweighting) is described, leaving open the possibility that high-variance or clipped weights introduce bias or instability that would undermine the speedup claim.
[Abstract] Abstract: concrete accuracy lifts (+11.6 pp, +6.0 pp) and the 18% speedup are reported without accompanying implementation details, hyper-parameter settings for the reweighting, or ablation studies that isolate the contribution of co-distillation versus standard GRPO. This absence prevents assessment of whether the reported gains are robust or reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to incorporate the requested verifications, details, and ablations.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that the 7B model 'nearly matches standard GRPO performance despite training on small-model rollouts' rests on the assertion that importance reweighting fully compensates for the off-policy distribution shift. No verification (weight histograms, effective sample size, gradient bias diagnostics, or ablation removing reweighting) is described, leaving open the possibility that high-variance or clipped weights introduce bias or instability that would undermine the speedup claim.

Authors: We agree that explicit verification strengthens the claim. In the revised manuscript we have added weight histograms, effective sample size statistics, and gradient bias diagnostics to the appendix, along with an ablation that removes importance reweighting. These additions show that weights remain well-behaved (no extreme variance after clipping) and that reweighting is necessary to preserve the observed performance parity, supporting the reported speedup. revision: yes
Referee: [Abstract] Abstract: concrete accuracy lifts (+11.6 pp, +6.0 pp) and the 18% speedup are reported without accompanying implementation details, hyper-parameter settings for the reweighting, or ablation studies that isolate the contribution of co-distillation versus standard GRPO. This absence prevents assessment of whether the reported gains are robust or reproducible.

Authors: We acknowledge the need for greater transparency. The revised manuscript now includes a dedicated implementation subsection with all hyper-parameter values for reweighting (clipping threshold, temperature, etc.) and new ablation studies in Section 4.4 that isolate co-distillation from standard GRPO. These changes make the source of the reported gains and the 18% speedup explicit and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction evaluated empirically on external benchmarks

full rationale

The paper introduces CoDistill-GRPO as a co-training procedure in which the small model receives an on-policy KD reward from the large model and the large model is updated on small-model rollouts via importance reweighting. All reported gains (e.g., +11.6 pp over base and +6.0 pp over GRPO on Minerva for Qwen2.5-Math-1.5B) are presented as measured outcomes on held-out mathematical benchmarks, not as quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are shown to reduce to the inputs; the importance-reweighting step is a standard off-policy correction whose stability is asserted to be confirmed by the final performance numbers rather than presupposed. The derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters or invented entities; the approach appears to rest on standard RL assumptions about rollout sampling and advantage estimation.

axioms (1)

domain assumption Rollouts can be sampled from the small model and importance weights can be computed to produce unbiased updates for the large model.
Implicit in the description of the large-model update step.

pith-pipeline@v0.9.0 · 5629 in / 1246 out tokens · 49306 ms · 2026-05-12T00:52:09.088539+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 11 internal anchors

[1]

Agarwal, N

R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3zKtaqxLhW

work page 2024
[2]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

A. Ahmadian, C. Cremer, M. Gallé, M. Fadaee, J. Kreutzer, O. Pietquin, A. Üstün, and S. Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024. URL https://arxiv.org/abs/2402.14740

work page internal anchor Pith review arXiv 2024
[3]

Allen-Zhu and Y

Z. Allen-Zhu and Y. Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Uuf2q9TfXGA

work page 2023
[4]

J. H. Cho and B. Hariharan. On the efficacy of knowledge distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4793--4801, 2019. URL https://api.semanticscholar.org/CorpusID:203642130

work page 2019
[5]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Y. Gu, L. Dong, F. Wei, and M. Huang. Mini LLM : Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ

work page 2024
[8]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe

work page 2021
[9]

Distilling the Knowledge in a Neural Network

G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025. URL https://arxiv.org/abs/2503.24290

work page internal anchor Pith review arXiv 2025
[11]

Huang, S

T. Huang, S. You, F. Wang, C. Qian, and C. Xu. Knowledge distillation from a stronger teacher. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=157Usp_kbi

work page 2022
[12]

Kaplun, eran malach, P

G. Kaplun, eran malach, P. Nakkiran, and S. Shalev-Shwartz. Knowledge distillation: Bad models can be good role models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=0ISChqjlrq

work page 2022
[14]

Lambert, J

N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, X. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training. In Second Conference on Langua...

work page 2025
[15]

B. Liu, A. Wang, Z. Min, L. Yao, H. Zhang, Y. Liu, X. Han, P. Li, A. Zeng, and J. Su. Spec-rl: Accelerating on-policy reinforcement learning with speculative rollouts. arXiv preprint arXiv:2509.23232, 2026. URL https://arxiv.org/abs/2509.23232

work page arXiv 2026
[16]

Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective. In Conference on Language Modeling (COLM), 2025

work page 2025
[17]

On-policy distillation

K. Lu and T. M. Lab. On-policy distillation. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[18]

A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar. A statistical perspective on distillation. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632--7642. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/menon21a.html

work page 2021
[19]

MiniMax, :, A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, C. Xiao, C. Du, C. Zhang, C. Qiao, C. Zhang, C. Du, C. Guo, D. Chen, D. Ding, D. Sun, D. Li, E. Jiao, H. Zhou, H. Zhang, H. Ding, H. Sun, H. Feng, H. Cai, H. Zhu, J. Sun, J. Zhuang, J. Cai, J. Song, J. Zhu, J. Li, J. Tian, J. Liu, J. Xu, J. Yan, J. Liu, J. He,...

work page internal anchor Pith review arXiv 2025
[20]

S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:212908749

work page 2019
[21]

Nagarajan, A

V. Nagarajan, A. K. Menon, S. Bhojanapalli, H. Mobahi, and S. Kumar. On student-teacher deviations in distillation: does it pay to disobey? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=7UdVPRmpif

work page 2023
[22]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Ad...

work page 2022
[23]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, ...

work page 2022
[24]

Phuong and C

M. Phuong and C. Lampert. Towards understanding knowledge distillation. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5142--5151. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.press/v97/phuong19a.html

work page 2019
[25]

A. S. Rawat, V. Sadhanala, A. Rostamizadeh, A. Chakrabarti, W. Jitkrittum, V. Feinberg, S. Kim, H. Harutyunyan, N. Saunshi, Z. Nado, R. Shivanna, S. J. Reddi, A. K. Menon, R. Anil, and S. Kumar. A little help goes a long way: Efficient llm training by leveraging small lms. arXiv preprint arXiv:2410.18779, 2024. URL https://arxiv.org/abs/2410.18779

work page arXiv 2024
[26]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Sample more to think less: Group filtered policy optimiza- tion for concise reasoning

V. Shrivastava, A. Awadallah, V. Balachandran, S. Garg, H. Behl, and D. Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. arXiv preprint arXiv:2508.09726, 2025. URL https://arxiv.org/abs/2508.09726

work page arXiv 2025
[28]

Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022

C. Snell, D. Klein, and R. Zhong. Learning by distilling context. arXiv preprint arXiv:2209.15189, 2022. URL https://arxiv.org/abs/2209.15189

work page arXiv 2022
[29]

X. Wen, Z. Liu, S. Zheng, S. Ye, Z. Wu, Y. Wang, Z. Xu, X. Liang, J. Li, Z. Miao, J. Bian, and M. Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLM s. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=jGbRWwIidy

work page 2026
[30]

C. Xiao, M. Zhang, and Y. Cao. Bnpo: Beta normalization policy optimization. arXiv preprint arXiv:2506.02864, 2025. URL https://arxiv.org/abs/2506.02864

work page arXiv 2025
[31]

H. Xu, Q. Zhu, H. Deng, J. Li, L. Hou, Y. Wang, L. Shang, R. Xu, and F. Mi. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning. arXiv preprint arXiv:2506.02208, 2025 a . URL https://arxiv.org/abs/2506.02208

work page arXiv 2025
[32]

X. Xu, M. Li, C. Tao, T. Shen, R. Cheng, J. Li, C. Xu, D. Tao, and T. Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024. URL https://arxiv.org/abs/2402.13116

work page arXiv 2024
[33]

Y. E. Xu, Y. Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning. arXiv preprint arXiv:2504.13818, 2025 b . URL https://arxiv.org/abs/2504.13818

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y. Song, X. Wei, H. Zhou, J. Liu, W.-Y. Ma, Y.-Q. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang. Dapo: An open-source llm reinforcement learning sys...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2025. URL https://arxiv.org/abs/2401.10020

work page internal anchor Pith review arXiv 2025
[36]

Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4OsgYD7em5

work page 2025
[37]

Zhang, Y

K. Zhang, Y. Hong, J. Bao, H. Jiang, yang song, H. Dingqian, and H. Xiong. GVPO : Group variance policy optimization for large language model post-training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=cCYUFaR6En

work page 2025
[38]

Zhang, Z

X. Zhang, Z. Huang, Y. Li, C. Ni, J. Chen, and S. Oymak. BREAD : Branched rollouts from expert anchors bridge SFT & RL for reasoning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview.net/forum?id=NUDaln2vCe

work page 2025
[39]

Zhang, S

X. Zhang, S. Wu, Y. Zhu, H. Tan, S. Yu, Z. He, and J. Jia. Scaf- GRPO : Scaffolded group relative policy optimization for enhancing LLM reasoning. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=bOwVr0yr7r

work page 2026
[40]

Y. Zhao, Y. Liu, J. Liu, J. Chen, X. Wu, Y. Hao, T. Lv, S. Huang, L. Cui, Q. Ye, F. Wan, and F. Wei. Geometric-mean policy optimization. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=nCEs0tSwc2

work page 2026
[41]

Group Sequence Policy Optimization

C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, J. Zhou, and J. Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025 a . URL https://arxiv.org/abs/2507.18071

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Zheng, Y

H. Zheng, Y. Zhou, B. R. Bartoldson, B. Kailkhura, F. Lai, J. Zhao, and B. Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview.net/forum?id=x5lITYXmW2

work page 2025
[43]

Mastering the game of

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=

work page 2016
[44]

2024 , journal=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , journal=

work page 2024
[45]

Conference on Language Modeling (COLM) , year=

Understanding r1-zero-like training: A critical perspective , author=. Conference on Language Modeling (COLM) , year=

work page
[46]

2025 , journal=

BNPO: Beta Normalization Policy Optimization , author=. 2025 , journal=

work page 2025
[47]

2025 , journal=

Group Sequence Policy Optimization , author=. 2025 , journal=

work page 2025
[48]

2025 , journal=

DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , journal=

work page 2025
[49]

2025 , journal=

Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning , author=. 2025 , journal=

work page 2025
[50]

2025 , journal=

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning , author=. 2025 , journal=

work page 2025
[51]

2025 , url=

Xuechen Zhang and Zijian Huang and Yingcong Li and Chenshun Ni and Jiasi Chen and Samet Oymak , booktitle=. 2025 , url=

work page 2025
[52]

Advances in Neural Information Processing Systems , editor=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[53]

2025 , journal=

KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning , author=. 2025 , journal=

work page 2025
[54]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

work page
[55]

2025 , journal=

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , journal=

work page 2025
[56]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

work page
[57]

Measuring Mathematical Problem Solving With the

Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

work page 2021
[58]

Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=

work page 2024
[59]

2024 , journal=

A Survey on Knowledge Distillation of Large Language Models , author=. 2024 , journal=

work page 2024
[60]

2025 , journal=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , journal=

work page 2025
[61]

Xichen Zhang and Sitong Wu and Yinghao Zhu and Haoru Tan and Shaozuo Yu and Ziyi He and Jiaya Jia , booktitle=. Scaf-. 2026 , url=

work page 2026
[62]

Advances in Neural Information Processing Systems , editor=

Knowledge Distillation from A Stronger Teacher , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[63]

2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

On the Efficacy of Knowledge Distillation , author=. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=

work page 2019
[64]

AAAI Conference on Artificial Intelligence , year=

Improved Knowledge Distillation via Teacher Assistant , author=. AAAI Conference on Artificial Intelligence , year=

work page
[65]

2025 , url=

Kaichen Zhang and Yuzhong Hong and Junwei Bao and Hongfei Jiang and yang song and Hong Dingqian and Hui Xiong , booktitle=. 2025 , url=

work page 2025
[66]

2024 , journal=

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author=. 2024 , journal=

work page 2024
[67]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base

Xumeng Wen and Zihan Liu and Shun Zheng and Shengyu Ye and Zhirong Wu and Yang Wang and Zhijian Xu and Xiao Liang and Junjie Li and Ziming Miao and Jiang Bian and Mao Yang , booktitle=. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base. 2026 , url=

work page 2026
[68]

2021 , journal=

Training Verifiers to Solve Math Word Problems , author=. 2021 , journal=

work page 2021
[69]

Second Conference on Language Modeling , year=

Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Second Conference on Language Modeling , year=

work page
[70]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in

Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=

work page 2025
[71]

2025 , journal=

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author=. 2025 , journal=

work page 2025
[72]

The Fourteenth International Conference on Learning Representations , year=

Geometric-Mean Policy Optimization , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[73]

2015 , journal=

Distilling the Knowledge in a Neural Network , author=. 2015 , journal=

work page 2015
[74]

doi: 10.18653/v1/D16-1139

Kim, Yoon and Rush, Alexander M. Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1139

work page doi:10.18653/v1/d16-1139 2016
[75]

Thinking Machines Lab: Connectionism , year =

Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =

work page
[76]

2022 , journal=

Learning by Distilling Context , author=. 2022 , journal=

work page 2022
[77]

2025 , journal=

Self-Rewarding Language Models , author=. 2025 , journal=

work page 2025
[78]

The Eleventh International Conference on Learning Representations , year=

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[79]

2026 , journal=

SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts , author=. 2026 , journal=

work page 2026
[80]

Bartoldson and Bhavya Kailkhura and Fan Lai and Jiawei Zhao and Beidi Chen , booktitle=

Haizhong Zheng and Yang Zhou and Brian R. Bartoldson and Bhavya Kailkhura and Fan Lai and Jiawei Zhao and Beidi Chen , booktitle=. Act Only When It Pays: Efficient Reinforcement Learning for. 2025 , url=

work page 2025
[81]

Proceedings of the 36th International Conference on Machine Learning , pages =

Towards Understanding Knowledge Distillation , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019

Showing first 80 references.