Recognition: no theorem link
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
Pith reviewed 2026-05-12 00:52 UTC · model grok-4.3
The pith
Co-distillation lets small and large language models train each other to improve math reasoning while cutting rollout costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoDistill-GRPO simultaneously maximizes GRPO objectives for both models: the small model is updated with an on-policy knowledge-distillation reward that aligns it to the large model's output distribution, and the large model is updated on trajectories generated by the small model after importance reweighting corrects for the distribution shift. This mutual loop produces a small Qwen2.5-Math-1.5B model whose accuracy rises more than 11.6 points above the base model and 6 points above ordinary GRPO on Minerva, while the paired 7B model reaches nearly the same final performance as standard GRPO despite training only on small-model rollouts.
What carries the argument
The co-distillation loop that pairs an on-policy knowledge-distillation reward for the small model with importance-reweighted policy updates for the large model on small-model rollouts.
If this is right
- Small models achieve substantially higher accuracy on mathematical reasoning benchmarks without access to an external teacher model.
- Large models reach nearly the same final performance while using cheaper rollouts generated by the smaller partner.
- The same recipe works across different base families including Qwen and Llama variants.
- Overall training time for the larger model drops by roughly 18 percent compared with running GRPO in isolation.
Where Pith is reading between the lines
- The approach could be chained so that a tiny model guides a medium one that in turn guides a large one, further lowering rollout costs at each step.
- If the importance reweighting remains stable, the same mutual-training pattern might apply to other policy-optimization methods such as direct preference optimization.
- The method might generalize beyond mathematics to other sparse-reward domains like coding or multi-step planning where small models currently struggle.
Load-bearing premise
Importance reweighting fully corrects the distribution shift from small-model rollouts so the large model still converges to the same high-performing policy without added bias or instability.
What would settle it
If the large model trained with CoDistill-GRPO on small-model rollouts ends up with clearly lower accuracy on math benchmarks than the same large model trained with standard GRPO using its own rollouts, the efficiency claim fails.
read the original abstract
Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving the reasoning capabilities of language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works mitigate this issue by leveraging a larger model, either to provide hints for rollouts or to provide dense reward signals through knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing carefully designed GRPO objectives. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model with importance reweighting, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small model performance over standard GRPO on mathematical benchmarks across both Qwen and Llama models. Specifically, with Qwen2.5-Math-1.5B, we observe an accuracy increase of over 11.6 percentage points over the base model and an additional 6.0 percentage points over GRPO on the Minerva dataset. Interestingly, the larger model (Qwen2.5-Math-7B) trained with CoDistill-GRPO nearly matches standard GRPO performance despite training on small-model rollouts. This highlights CoDistill-GRPO as a cost-effective alternative to GRPO for larger models, yielding an approximate 18% speedup, which may be of independent interest.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoDistill-GRPO, a co-distillation algorithm for Group Relative Policy Optimization (GRPO) that jointly trains a large and small language model. The small model learns via an on-policy knowledge-distillation reward derived from the large model's distribution, while the large model is updated on rollouts generated by the small model using importance reweighting to reduce rollout computation. Empirical results on mathematical reasoning benchmarks (e.g., Minerva) report that Qwen2.5-Math-1.5B achieves +11.6 pp over the base model and +6.0 pp over standard GRPO, while the 7B model nearly matches standard GRPO performance with an approximate 18% training speedup.
Significance. If the empirical gains and the unbiasedness of the importance-reweighted updates hold under scrutiny, the method would provide a practical, lower-cost route to improving reasoning in small models without a separate oracle and would simultaneously accelerate large-model GRPO training. This could be of independent interest for efficient scaling of RL-based reasoning methods.
major comments (2)
- [Abstract] Abstract: the headline claim that the 7B model 'nearly matches standard GRPO performance despite training on small-model rollouts' rests on the assertion that importance reweighting fully compensates for the off-policy distribution shift. No verification (weight histograms, effective sample size, gradient bias diagnostics, or ablation removing reweighting) is described, leaving open the possibility that high-variance or clipped weights introduce bias or instability that would undermine the speedup claim.
- [Abstract] Abstract: concrete accuracy lifts (+11.6 pp, +6.0 pp) and the 18% speedup are reported without accompanying implementation details, hyper-parameter settings for the reweighting, or ablation studies that isolate the contribution of co-distillation versus standard GRPO. This absence prevents assessment of whether the reported gains are robust or reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and have revised the manuscript to incorporate the requested verifications, details, and ablations.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that the 7B model 'nearly matches standard GRPO performance despite training on small-model rollouts' rests on the assertion that importance reweighting fully compensates for the off-policy distribution shift. No verification (weight histograms, effective sample size, gradient bias diagnostics, or ablation removing reweighting) is described, leaving open the possibility that high-variance or clipped weights introduce bias or instability that would undermine the speedup claim.
Authors: We agree that explicit verification strengthens the claim. In the revised manuscript we have added weight histograms, effective sample size statistics, and gradient bias diagnostics to the appendix, along with an ablation that removes importance reweighting. These additions show that weights remain well-behaved (no extreme variance after clipping) and that reweighting is necessary to preserve the observed performance parity, supporting the reported speedup. revision: yes
-
Referee: [Abstract] Abstract: concrete accuracy lifts (+11.6 pp, +6.0 pp) and the 18% speedup are reported without accompanying implementation details, hyper-parameter settings for the reweighting, or ablation studies that isolate the contribution of co-distillation versus standard GRPO. This absence prevents assessment of whether the reported gains are robust or reproducible.
Authors: We acknowledge the need for greater transparency. The revised manuscript now includes a dedicated implementation subsection with all hyper-parameter values for reweighting (clipping threshold, temperature, etc.) and new ablation studies in Section 4.4 that isolate co-distillation from standard GRPO. These changes make the source of the reported gains and the 18% speedup explicit and reproducible. revision: yes
Circularity Check
No circularity: algorithmic construction evaluated empirically on external benchmarks
full rationale
The paper introduces CoDistill-GRPO as a co-training procedure in which the small model receives an on-policy KD reward from the large model and the large model is updated on small-model rollouts via importance reweighting. All reported gains (e.g., +11.6 pp over base and +6.0 pp over GRPO on Minerva for Qwen2.5-Math-1.5B) are presented as measured outcomes on held-out mathematical benchmarks, not as quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or ansatzes are shown to reduce to the inputs; the importance-reweighting step is a standard off-policy correction whose stability is asserted to be confirmed by the final performance numbers rather than presupposed. The derivation chain therefore remains self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rollouts can be sampled from the small model and importance weights can be computed to produce unbiased updates for the large model.
Reference graph
Works this paper leans on
-
[1]
R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3zKtaqxLhW
work page 2024
-
[2]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
A. Ahmadian, C. Cremer, M. Gallé, M. Fadaee, J. Kreutzer, O. Pietquin, A. Üstün, and S. Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024. URL https://arxiv.org/abs/2402.14740
work page internal anchor Pith review arXiv 2024
-
[3]
Z. Allen-Zhu and Y. Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Uuf2q9TfXGA
work page 2023
-
[4]
J. H. Cho and B. Hariharan. On the efficacy of knowledge distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4793--4801, 2019. URL https://api.semanticscholar.org/CorpusID:203642130
work page 2019
-
[5]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Y. Gu, L. Dong, F. Wei, and M. Huang. Mini LLM : Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ
work page 2024
-
[8]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe
work page 2021
-
[9]
Distilling the Knowledge in a Neural Network
G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. URL https://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
J. Hu, Y. Zhang, Q. Han, D. Jiang, X. Zhang, and H.-Y. Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. arXiv preprint arXiv:2503.24290, 2025. URL https://arxiv.org/abs/2503.24290
work page internal anchor Pith review arXiv 2025
- [11]
-
[12]
G. Kaplun, eran malach, P. Nakkiran, and S. Shalev-Shwartz. Knowledge distillation: Bad models can be good role models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=0ISChqjlrq
work page 2022
-
[14]
N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, X. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training. In Second Conference on Langua...
work page 2025
- [15]
-
[16]
Z. Liu, C. Chen, W. Li, P. Qi, T. Pang, C. Du, W. S. Lee, and M. Lin. Understanding r1-zero-like training: A critical perspective. In Conference on Language Modeling (COLM), 2025
work page 2025
-
[17]
K. Lu and T. M. Lab. On-policy distillation. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation
-
[18]
A. K. Menon, A. S. Rawat, S. Reddi, S. Kim, and S. Kumar. A statistical perspective on distillation. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632--7642. PMLR, 18--24 Jul 2021. URL https://proceedings.mlr.press/v139/menon21a.html
work page 2021
-
[19]
MiniMax, :, A. Chen, A. Li, B. Gong, B. Jiang, B. Fei, B. Yang, B. Shan, C. Yu, C. Wang, C. Zhu, C. Xiao, C. Du, C. Zhang, C. Qiao, C. Zhang, C. Du, C. Guo, D. Chen, D. Ding, D. Sun, D. Li, E. Jiao, H. Zhou, H. Zhang, H. Ding, H. Sun, H. Feng, H. Cai, H. Zhu, J. Sun, J. Zhuang, J. Cai, J. Song, J. Zhu, J. Li, J. Tian, J. Liu, J. Xu, J. Yan, J. Liu, J. He,...
work page internal anchor Pith review arXiv 2025
-
[20]
S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:212908749
work page 2019
-
[21]
V. Nagarajan, A. K. Menon, S. Bhojanapalli, H. Mobahi, and S. Kumar. On student-teacher deviations in distillation: does it pay to disobey? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=7UdVPRmpif
work page 2023
-
[22]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Ad...
work page 2022
-
[23]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, ...
work page 2022
-
[24]
M. Phuong and C. Lampert. Towards understanding knowledge distillation. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5142--5151. PMLR, 09--15 Jun 2019. URL https://proceedings.mlr.press/v97/phuong19a.html
work page 2019
-
[25]
A. S. Rawat, V. Sadhanala, A. Rostamizadeh, A. Chakrabarti, W. Jitkrittum, V. Feinberg, S. Kim, H. Harutyunyan, N. Saunshi, Z. Nado, R. Shivanna, S. J. Reddi, A. K. Menon, R. Anil, and S. Kumar. A little help goes a long way: Efficient llm training by leveraging small lms. arXiv preprint arXiv:2410.18779, 2024. URL https://arxiv.org/abs/2410.18779
-
[26]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. URL https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Sample more to think less: Group filtered policy optimiza- tion for concise reasoning
V. Shrivastava, A. Awadallah, V. Balachandran, S. Garg, H. Behl, and D. Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. arXiv preprint arXiv:2508.09726, 2025. URL https://arxiv.org/abs/2508.09726
-
[28]
Learning by distilling context.arXiv preprint arXiv:2209.15189, 2022
C. Snell, D. Klein, and R. Zhong. Learning by distilling context. arXiv preprint arXiv:2209.15189, 2022. URL https://arxiv.org/abs/2209.15189
-
[29]
X. Wen, Z. Liu, S. Zheng, S. Ye, Z. Wu, Y. Wang, Z. Xu, X. Liang, J. Li, Z. Miao, J. Bian, and M. Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLM s. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=jGbRWwIidy
work page 2026
- [30]
- [31]
- [32]
-
[33]
Y. E. Xu, Y. Savani, F. Fang, and J. Z. Kolter. Not all rollouts are useful: Down-sampling rollouts in llm reinforcement learning. arXiv preprint arXiv:2504.13818, 2025 b . URL https://arxiv.org/abs/2504.13818
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y. Song, X. Wei, H. Zhou, J. Liu, W.-Y. Ma, Y.-Q. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang. Dapo: An open-source llm reinforcement learning sys...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
W. Yuan, R. Y. Pang, K. Cho, X. Li, S. Sukhbaatar, J. Xu, and J. Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2025. URL https://arxiv.org/abs/2401.10020
work page internal anchor Pith review arXiv 2025
-
[36]
Y. Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, Y. Yue, S. Song, and G. Huang. Does reinforcement learning really incentivize reasoning capacity in LLM s beyond the base model? In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4OsgYD7em5
work page 2025
-
[37]
K. Zhang, Y. Hong, J. Bao, H. Jiang, yang song, H. Dingqian, and H. Xiong. GVPO : Group variance policy optimization for large language model post-training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 a . URL https://openreview.net/forum?id=cCYUFaR6En
work page 2025
- [38]
- [39]
-
[40]
Y. Zhao, Y. Liu, J. Liu, J. Chen, X. Wu, Y. Hao, T. Lv, S. Huang, L. Cui, Q. Ye, F. Wan, and F. Wei. Geometric-mean policy optimization. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=nCEs0tSwc2
work page 2026
-
[41]
Group Sequence Policy Optimization
C. Zheng, S. Liu, M. Li, X.-H. Chen, B. Yu, C. Gao, K. Dang, Y. Liu, R. Men, A. Yang, J. Zhou, and J. Lin. Group sequence policy optimization. arXiv preprint arXiv:2507.18071, 2025 a . URL https://arxiv.org/abs/2507.18071
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
H. Zheng, Y. Zhou, B. R. Bartoldson, B. Kailkhura, F. Lai, J. Zhao, and B. Chen. Act only when it pays: Efficient reinforcement learning for LLM reasoning via selective rollouts. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 b . URL https://openreview.net/forum?id=x5lITYXmW2
work page 2025
-
[43]
Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of. 2016 , publisher=
work page 2016
-
[44]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , journal=
work page 2024
-
[45]
Conference on Language Modeling (COLM) , year=
Understanding r1-zero-like training: A critical perspective , author=. Conference on Language Modeling (COLM) , year=
-
[46]
BNPO: Beta Normalization Policy Optimization , author=. 2025 , journal=
work page 2025
- [47]
-
[48]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale , author=. 2025 , journal=
work page 2025
-
[49]
Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning , author=. 2025 , journal=
work page 2025
-
[50]
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning , author=. 2025 , journal=
work page 2025
-
[51]
Xuechen Zhang and Zijian Huang and Yingcong Li and Chenshun Ni and Jiasi Chen and Samet Oymak , booktitle=. 2025 , url=
work page 2025
-
[52]
Advances in Neural Information Processing Systems , editor=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[53]
KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning , author=. 2025 , journal=
work page 2025
-
[54]
Training language models to follow instructions with human feedback , url =
Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
-
[55]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model , author=. 2025 , journal=
work page 2025
-
[56]
The Twelfth International Conference on Learning Representations , year=
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=
-
[57]
Measuring Mathematical Problem Solving With the
Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=
work page 2021
-
[58]
Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=
work page 2024
-
[59]
A Survey on Knowledge Distillation of Large Language Models , author=. 2024 , journal=
work page 2024
-
[60]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , journal=
work page 2025
-
[61]
Xichen Zhang and Sitong Wu and Yinghao Zhu and Haoru Tan and Shaozuo Yu and Ziyi He and Jiaya Jia , booktitle=. Scaf-. 2026 , url=
work page 2026
-
[62]
Advances in Neural Information Processing Systems , editor=
Knowledge Distillation from A Stronger Teacher , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
work page 2022
-
[63]
2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
On the Efficacy of Knowledge Distillation , author=. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) , year=
work page 2019
-
[64]
AAAI Conference on Artificial Intelligence , year=
Improved Knowledge Distillation via Teacher Assistant , author=. AAAI Conference on Artificial Intelligence , year=
-
[65]
Kaichen Zhang and Yuzhong Hong and Junwei Bao and Hongfei Jiang and yang song and Hong Dingqian and Hui Xiong , booktitle=. 2025 , url=
work page 2025
-
[66]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author=. 2024 , journal=
work page 2024
-
[67]
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base
Xumeng Wen and Zihan Liu and Shun Zheng and Shengyu Ye and Zhirong Wu and Yang Wang and Zhijian Xu and Xiao Liang and Junjie Li and Ziming Miao and Jiang Bian and Mao Yang , booktitle=. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base. 2026 , url=
work page 2026
-
[68]
Training Verifiers to Solve Math Word Problems , author=. 2021 , journal=
work page 2021
-
[69]
Second Conference on Language Modeling , year=
Tulu 3: Pushing Frontiers in Open Language Model Post-Training , author=. Second Conference on Language Modeling , year=
-
[70]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Yang Yue and Shiji Song and Gao Huang , booktitle=. Does Reinforcement Learning Really Incentivize Reasoning Capacity in. 2025 , url=
work page 2025
-
[71]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention , author=. 2025 , journal=
work page 2025
-
[72]
The Fourteenth International Conference on Learning Representations , year=
Geometric-Mean Policy Optimization , author=. The Fourteenth International Conference on Learning Representations , year=
-
[73]
Distilling the Knowledge in a Neural Network , author=. 2015 , journal=
work page 2015
-
[74]
Kim, Yoon and Rush, Alexander M. Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016. doi:10.18653/v1/D16-1139
-
[75]
Thinking Machines Lab: Connectionism , year =
Kevin Lu and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =
- [76]
- [77]
-
[78]
The Eleventh International Conference on Learning Representations , year=
Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , author=. The Eleventh International Conference on Learning Representations , year=
-
[79]
SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts , author=. 2026 , journal=
work page 2026
-
[80]
Bartoldson and Bhavya Kailkhura and Fan Lai and Jiawei Zhao and Beidi Chen , booktitle=
Haizhong Zheng and Yang Zhou and Brian R. Bartoldson and Bhavya Kailkhura and Fan Lai and Jiawei Zhao and Beidi Chen , booktitle=. Act Only When It Pays: Efficient Reinforcement Learning for. 2025 , url=
work page 2025
-
[81]
Proceedings of the 36th International Conference on Machine Learning , pages =
Towards Understanding Knowledge Distillation , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.