Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Guo Yu; Han-Jia Ye; Hao-Xuan Ma; Jun-Peng Jiang; Wenlin Liu; Yulan Hu

arxiv: 2606.13657 · v1 · pith:AKJYG3H5new · submitted 2026-06-11 · 💻 cs.LG

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Guo Yu , Wenlin Liu , Yulan Hu , Hao-Xuan Ma , Jun-Peng Jiang , Han-Jia Ye This is my paper

Pith reviewed 2026-06-27 07:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords on-policy distillationparameter updatessparsityupdate geometrylanguage modelsvision-language modelssubnetwork trainingpost-training

0 comments

The pith

On-policy distillation yields sparse coordinate updates that avoid principal weight directions even with dense teacher supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes parameter changes during on-policy distillation across language and vision-language models. It shows that updates remain small and sparse across coordinates, concentrated in feed-forward layers, and distributed through the network rather than becoming dense rewrites. A discovered subnetwork can be trained alone to recover nearly the same gains as full distillation. Geometrically the changes are full-rank yet spectrally narrow, lying outside the main singular directions of the original weights and landing mostly on near-zero coordinates. These patterns indicate that dense supervision preserves distinctive signatures of on-policy post-training instead of erasing them.

Core claim

Across several language and vision-language model pairs, on-policy distillation produces coordinate-sparse updates that are distributed across layers and usually FFN-heavy. These updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. Training only the discovered subnetwork recovers nearly the same performance as full OPD. The sparsity-inducing SGD optimizer underperforms AdamW because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where adaptive scaling remains useful.

What carries the argument

Coordinate sparsity combined with spectral concentration of the parameter updates away from principal singular subspaces.

If this is right

Training only the sparse subnetwork identified by the update locations recovers nearly full OPD performance.
AdamW remains preferable to plain SGD under dense supervision because it handles heterogeneous gradient scales.
Updates consistently avoid the principal singular subspaces of the original weights.
Updates land disproportionately on coordinates where source weights are near zero.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed sparsity pattern could be used to design cheaper post-training pipelines that update only a small fraction of parameters from the start.
Similar geometric signatures may appear in other forms of on-policy alignment or preference optimization beyond distillation.
Inspecting the magnitude and direction of early updates might allow early detection of whether a distillation run will succeed without completing the full schedule.

Load-bearing premise

The sparsity and geometric patterns observed in the selected language and vision-language model pairs generalize beyond the specific models, tasks, and training setups examined.

What would settle it

Observing dense, full-rank updates that fill the principal singular subspaces of the source weights when the same OPD procedure is run on a new model family or task would falsify the claim that OPD retains on-policy geometric signatures.

Figures

Figures reproduced from arXiv: 2606.13657 by Guo Yu, Han-Jia Ye, Hao-Xuan Ma, Jun-Peng Jiang, Wenlin Liu, Yulan Hu.

**Figure 1.** Figure 1: Checkpoint deltas show that OPD-style updates are small, coordinate-sparse, spectrally concentrated, and off-principal. Gray bars are reference runs rather than OPD runs: Qwen-Math Distill is an offline distillation-style contrast, and the remaining gray bars are RLVR references. relative delta norm r = ∥∆W∥F/∥Wsrc∥F measures update size relative to the source tensor, while visible coordinate sparsity sϵ =… view at source ↗

**Figure 2.** Figure 2: Layerwise and per-parameter-matrix update sparsity for DS-Qwen OPD. Sparsity [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of OPD reasoning performance along the training process, reported as average validation accuracy over AIME24 and AIME25. Left: full OPD versus OPD trained with the OPD, RLVR, or random masks. Right: AdamW versus SGD in the JustRL-teacher OPD setting. Benchmark-wise curves are shown in Appendix A.7. accuracy. Thus, the fact that OPD produces sparse final deltas is not sufficient to transfer the R… view at source ↗

**Figure 4.** Figure 4: Benchmark-wise breakdown of the subnetwork-masked [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Benchmark-wise breakdown of the AdamW-versus-SGD experiment in Figure [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: AdamW optimizer-state diagnostics for JustRL-teacher [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Qwen2.5-VL subnetwork-masked OPD on Geo3K. The full-OPD mask closely tracks the full run, while density-matched random masks and the smaller GRPO mask underperform. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPD keeps coordinate-sparse, FFN-heavy, spectrally off-principal updates even with dense teacher signals, but the patterns are shown only on a narrow set of model pairs.

read the letter

The main thing here is that on-policy distillation does not collapse into ordinary dense rewriting. The updates stay small and coordinate-sparse, land mostly in FFN layers, concentrate in a few spectral directions away from the source weight principal subspaces, and hit coordinates where the original weights are already near zero. These are presented as direct measurements across the language and vision-language pairs they tested.

The concrete observations on sparsity structure and geometry are new relative to the cited work, and the subnetwork experiment is a practical plus: freezing everything outside the discovered sparse set recovers nearly the same performance. The optimizer ablation is also direct; SGD loses to AdamW because the dense supervision keeps heterogeneous per-coordinate gradient magnitudes.

The soft spot is scope. The patterns are reported on several but still limited model pairs and tasks, with no numbers, error bars, or exclusion details visible in the abstract. If the sparsity and spectral concentration are artifacts of the chosen scales or data distributions, the claim that OPD retains distinct on-policy geometric signatures does not follow. The paper does not test whether the same structure appears at larger scales or different architectures.

This is for readers already working on distillation or post-training analysis who want empirical handles on what changes in the weights. It is coherent on its own terms and engages the literature without circularity, so it clears the bar for a serious referee even if the experiments need expansion on generality.

Referee Report

2 major / 0 minor

Summary. The paper analyzes parameter updates under on-policy distillation (OPD) across several language and vision-language model pairs. It reports that OPD produces small, coordinate-sparse updates that are distributed across layers and FFN-heavy; these updates are numerically full-rank but spectrally concentrated, lie away from the principal singular subspaces of the source weights, and disproportionately affect coordinates where source weights are near zero. An optimizer ablation finds that the sparsity-inducing SGD underperforms AdamW. The central interpretation is that dense teacher supervision does not convert OPD into ordinary dense rewriting but preserves geometric signatures of on-policy post-training. Training only the discovered sparse subnetwork is claimed to recover nearly the same performance as full OPD.

Significance. If the reported sparsity patterns and geometric signatures hold and generalize, the work would provide a mechanistic explanation for OPD's behavior and could guide the design of sparse or subnetwork-based post-training procedures that retain on-policy benefits while reducing compute.

major comments (2)

[Abstract] Abstract: the claim that 'training only the discovered subnetwork recovers nearly the same performance as full OPD' is presented without any quantitative metrics, error bars, dataset details, or exclusion criteria, so the degree of support for the operational usefulness of the sparse structure cannot be assessed.
[Abstract] Abstract: the central interpretation that OPD 'retains important geometric signatures of on-policy post-training' rests on patterns observed in a handful of model pairs; no results or discussion address whether the coordinate-sparsity, FFN-heavy distribution, spectral concentration, or near-zero-weight preference persist under different scales, architectures, or task distributions, which is load-bearing for the claim that the patterns are not artifacts of the chosen setups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'training only the discovered subnetwork recovers nearly the same performance as full OPD' is presented without any quantitative metrics, error bars, dataset details, or exclusion criteria, so the degree of support for the operational usefulness of the sparse structure cannot be assessed.

Authors: We agree that the abstract would benefit from additional quantitative detail to support this claim. The full manuscript reports specific performance metrics on the evaluated benchmarks and datasets, including comparisons between full OPD and subnetwork training. We will revise the abstract to incorporate key quantitative results (e.g., relative performance recovery percentages), reference the datasets, and note any error bars or exclusion criteria from our experiments. revision: yes
Referee: [Abstract] Abstract: the central interpretation that OPD 'retains important geometric signatures of on-policy post-training' rests on patterns observed in a handful of model pairs; no results or discussion address whether the coordinate-sparsity, FFN-heavy distribution, spectral concentration, or near-zero-weight preference persist under different scales, architectures, or task distributions, which is load-bearing for the claim that the patterns are not artifacts of the chosen setups.

Authors: We acknowledge that our analysis covers a selection of model pairs and that we do not present explicit experiments testing persistence across arbitrary scales, architectures, or task distributions. The interpretation is based on the consistent patterns observed in the reported setups. We will add a discussion paragraph qualifying the empirical scope of the findings and noting that broader validation would strengthen the claim that the signatures are not setup-specific artifacts. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivation chain

full rationale

The paper reports direct experimental observations on sparsity (coordinate-sparse, FFN-heavy updates) and geometry (spectrally concentrated, away from principal subspaces, on near-zero weights) of OPD updates in language and vision-language models. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation load-bearing claims appear in the provided text. All central claims rest on measured patterns from the experiments themselves, with no reduction to inputs by construction. This is a standard non-circular empirical analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical measurement study; the abstract introduces no new free parameters, mathematical axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5762 in / 1012 out tokens · 23988 ms · 2026-06-27T07:09:42.703787+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 1 canonical work pages · 1 internal anchor

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pp. 21246–21263,

2024
[2]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer

URLhttps://arxiv.org/abs/2502.13923. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28,

Pith/arXiv arXiv
[3]

DeepSeek-AI

URLhttps://arxiv.org/abs/2510.00553. DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence,

arXiv
[4]

Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, and Sijia Liu

URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro. Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, and Sijia Liu. Rethinking muon beyond pretraining: Spectral failures and high-pass remedies for vla and rlvr.arXiv preprint arXiv:2605.19282,

Pith/arXiv arXiv
[5]

The lottery ticket hypothesis: Finding sparse, train- able neural networks.arXiv preprint arXiv:1803.03635,

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, train- able neural networks.arXiv preprint arXiv:1803.03635,

Pith/arXiv arXiv
[6]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, volume 2024, pp. 32694–32717,

2024
[7]

Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

Pith/arXiv arXiv
[8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al

10 Preprint. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv
[9]

Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649,

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649,

arXiv
[10]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

Pith/arXiv arXiv
[11]

Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

Pith/arXiv arXiv
[12]

Rein- forcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

Jonas H ¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Rein- forcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

Pith/arXiv arXiv
[13]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089,

Pith/arXiv arXiv
[14]

Yoon Kim and Alexander M Rush

URL https://kellerjordan.github.io/posts/muon/. Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,

2016
[15]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

Pith/arXiv arXiv
[16]

Rethinking on-policy distilla- tion of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distilla- tion of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Pith/arXiv arXiv
[17]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

Pith/arXiv arXiv
[18]

Parameter-efficient orthogonal finetuning via butterfly factorization

Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, et al. Parameter-efficient orthogonal finetuning via butterfly factorization. InInternational Conference on Learning Representations, volume 2024, pp. 38317–38350,

2024
[19]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Pith/arXiv arXiv
[20]

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng

11 Preprint. Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models.Advances in Neural Information Processing Systems, 38:132119–132138, 2026a. Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-T ¨ur, and Hao Peng. Do we need adam? surprisingly strong and sparse...

arXiv
[21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20250929
[22]

Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259,

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259,

Pith/arXiv arXiv
[23]

A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

Pith/arXiv arXiv
[24]

Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

Pith/arXiv arXiv
[25]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

Pith/arXiv arXiv
[26]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Pith/arXiv arXiv
[27]

On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,

Pith/arXiv arXiv
[28]

Glm-5: from vibe coding to agentic engineer- ing.arXiv preprint arXiv:2602.15763,

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineer- ing.arXiv preprint arXiv:2602.15763,

Pith/arXiv arXiv
[29]

Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

Pith/arXiv arXiv
[30]

The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

arXiv
[31]

12 Preprint. A Appendix A.1 Related work On-policy distillation.Early OPD formulations emphasize the exposure-mismatch prob- lem in offline distillation and train the student on its own sampled trajectories with dense teacher feedback (Agarwal et al., 2024; Gu et al., 2024). Recent work shows that this idea has become a practical post-training component f...

2024
[32]

Both OPD styles differ from SEQKD by training on student-generated trajectories, and differ from RLVR by replacing sparse scalar rewards with dense teacher-derived feedback

Thus, if k1 is detached and −k1(a; c) is used as the policy- gradient advantage, the score-function loss LPG(a; c) =sg[k 1(a; c)]logπ θ(a|c) has gradient k1(a; c)∇θ logπ θ(a|c) , giving an unbiased single-sample estimator of the reverse-KL gra- dient up to practical modifications such as clipping, off-policy importance ratios, and advantage normalization....

2025

[1] [1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pp. 21246–21263,

2024

[2] [2]

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer

URLhttps://arxiv.org/abs/2502.13923. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks.Advances in neural information processing systems, 28,

Pith/arXiv arXiv

[3] [3]

DeepSeek-AI

URLhttps://arxiv.org/abs/2510.00553. DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence,

arXiv

[4] [4]

Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, and Sijia Liu

URLhttps://huggingface.co/deepseek-ai/DeepSeek-V4-Pro. Chongyu Fan, Gaowen Liu, Mingyi Hong, Ramana Rao Kompella, and Sijia Liu. Rethinking muon beyond pretraining: Spectral failures and high-pass remedies for vla and rlvr.arXiv preprint arXiv:2605.19282,

Pith/arXiv arXiv

[5] [5]

The lottery ticket hypothesis: Finding sparse, train- able neural networks.arXiv preprint arXiv:1803.03635,

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, train- able neural networks.arXiv preprint arXiv:1803.03635,

Pith/arXiv arXiv

[6] [6]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, volume 2024, pp. 32694–32717,

2024

[7] [7]

Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178,

Pith/arXiv arXiv

[8] [8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al

10 Preprint. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv

[9] [9]

Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649,

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. Justrl: Scaling a 1.5 b llm with a simple rl recipe.arXiv preprint arXiv:2512.16649,

arXiv

[10] [10]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

Pith/arXiv arXiv

[11] [11]

Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685,

Pith/arXiv arXiv

[12] [12]

Rein- forcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

Jonas H ¨ubotter, Frederike L¨ubeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Rein- forcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

Pith/arXiv arXiv

[13] [13]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089,

Pith/arXiv arXiv

[14] [14]

Yoon Kim and Alexander M Rush

URL https://kellerjordan.github.io/posts/muon/. Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pp. 1317–1327,

2016

[15] [15]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

Pith/arXiv arXiv

[16] [16]

Rethinking on-policy distilla- tion of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distilla- tion of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016,

Pith/arXiv arXiv

[17] [17]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

Pith/arXiv arXiv

[18] [18]

Parameter-efficient orthogonal finetuning via butterfly factorization

Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, et al. Parameter-efficient orthogonal finetuning via butterfly factorization. InInternational Conference on Learning Representations, volume 2024, pp. 38317–38350,

2024

[19] [19]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

Pith/arXiv arXiv

[20] [20]

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng

11 Preprint. Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models.Advances in Neural Information Processing Systems, 38:132119–132138, 2026a. Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-T ¨ur, and Hao Peng. Do we need adam? surprisingly strong and sparse...

arXiv

[21] [21]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.64434/tml.20250929

[22] [22]

Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259,

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259,

Pith/arXiv arXiv

[23] [23]

A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626,

Pith/arXiv arXiv

[24] [24]

Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780,

Pith/arXiv arXiv

[25] [25]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

Pith/arXiv arXiv

[26] [26]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Pith/arXiv arXiv

[27] [27]

On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,

Pith/arXiv arXiv

[28] [28]

Glm-5: from vibe coding to agentic engineer- ing.arXiv preprint arXiv:2602.15763,

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineer- ing.arXiv preprint arXiv:2602.15763,

Pith/arXiv arXiv

[29] [29]

Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

Pith/arXiv arXiv

[30] [30]

The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

Hanqing Zhu, Zhenyu Zhang, Hanxian Huang, DiJia Su, Zechun Liu, Jiawei Zhao, Igor Fedorov, Hamed Pirsiavash, Zhizhou Sha, Jinwon Lee, et al. The path not taken: Rlvr provably learns off the principals.arXiv preprint arXiv:2511.08567,

arXiv

[31] [31]

12 Preprint. A Appendix A.1 Related work On-policy distillation.Early OPD formulations emphasize the exposure-mismatch prob- lem in offline distillation and train the student on its own sampled trajectories with dense teacher feedback (Agarwal et al., 2024; Gu et al., 2024). Recent work shows that this idea has become a practical post-training component f...

2024

[32] [32]

Both OPD styles differ from SEQKD by training on student-generated trajectories, and differ from RLVR by replacing sparse scalar rewards with dense teacher-derived feedback

Thus, if k1 is detached and −k1(a; c) is used as the policy- gradient advantage, the score-function loss LPG(a; c) =sg[k 1(a; c)]logπ θ(a|c) has gradient k1(a; c)∇θ logπ θ(a|c) , giving an unbiased single-sample estimator of the reverse-KL gra- dient up to practical modifications such as clipping, off-policy importance ratios, and advantage normalization....

2025