Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Runyuan Cai; Xiaodong Zeng; Yiming Wang; Yu Lin

arxiv: 2605.28139 · v1 · pith:NVBI3BK5new · submitted 2026-05-27 · 💻 cs.AI

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Yu Lin , Yiming Wang , Runyuan Cai , Xiaodong Zeng This is my paper

Pith reviewed 2026-06-29 11:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords automatic speech recognitionon-policy distillationdata-efficient trainingmodel distillationMandarin ASREnglish ASRsupport overlap

0 comments

The pith

On-policy distillation from a larger teacher lets a 0.6B ASR model beat its same-scale baseline on four of five benchmarks after training on 100k hours of speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether on-policy distillation can transfer additional recognition capability from a strong Qwen-ASR teacher to a compact 0.6B audio-conditioned language model that has already seen 100k hours of speech. The combined recipe improves results over supervised fine-tuning by itself and exceeds the matching-scale baseline on four of five Mandarin and English evaluation sets. This occurs with far less supervised audio than the 20M hours reported for a competing larger encoder. A support-overlap diagnostic indicates that the teacher stage raises local compatibility between student and teacher outputs. If the claim holds, compact ASR models can reach competitive accuracy without relying on the massive audio collections used by some current systems.

Core claim

The authors claim that teacher-guided on-policy training substantially closes the performance gap for compact ASR models under a much smaller audio budget, with the proposed recipe improving over supervised fine-tuning alone across benchmarks and outperforming the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets while the 1.7B model remains stronger.

What carries the argument

On-policy distillation, where the student generates its own outputs and the teacher provides guidance on those outputs, together with a support-overlap diagnostic that measures local student-teacher compatibility.

If this is right

Compact models can narrow much of the accuracy gap to models three times larger while using orders of magnitude less supervised audio.
The support-overlap diagnostic can serve as a practical signal for deciding when distillation is likely to help.
ASR specialization and reproduction become feasible with far smaller data budgets than previously reported.
The same training pattern may allow repeated teacher-guided refinement without collecting new labeled audio each time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be applied to other audio tasks such as speaker verification or spoken language understanding to test whether data reduction generalizes.
Similar on-policy guidance might lower data needs in related sequence tasks outside speech, such as text-to-speech or machine translation.
Holding the data budget fixed while varying student size could expose new scaling relationships between model capacity and distillation benefit.

Load-bearing premise

The performance gains are produced by the on-policy distillation step itself rather than by other details of data selection or hyperparameter choices.

What would settle it

Retrain the student using the identical supervised fine-tuning stage but without the on-policy distillation stage and check whether the reported advantage over the same-scale baseline disappears.

Figures

Figures reproduced from arXiv: 2605.28139 by Runyuan Cai, Xiaodong Zeng, Yiming Wang, Yu Lin.

**Figure 2.** Figure 2: Ark-ASR OPD training flow. The student generates transcripts on its own audio-conditioned [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

On-policy distillation gives a 0.6B ASR model some gains over fine-tuning and a same-size baseline with far less data, but the results do not isolate the distillation step from other recipe choices.

read the letter

The paper takes on-policy distillation, already used in other domains, and applies it to a 0.6B audio-conditioned LM for ASR. With 100k hours they report consistent lifts over supervised fine-tuning and better numbers than the matched Qwen3-ASR-0.6B baseline on four of five Mandarin and English sets. The support-overlap diagnostic is a reasonable addition to check when the teacher and student align locally.

What stands out is the data-efficiency angle: closing most of the gap to a model trained on 20M hours while staying at the same parameter count. That practical framing is useful for anyone trying to build specialized ASR without massive labeled sets.

The soft spot is the missing isolation. The abstract and summary give no ablations that hold data, hyperparameters, and architecture fixed while turning the distillation on or off. Without those, or without error bars and statistical tests, it is hard to credit the on-policy step itself rather than data selection or other details in the training recipe. The 100k-hour versus 20M-hour comparison also needs explicit checks on data quality and preprocessing to hold up.

This is for ASR groups that already run distillation pipelines and want a concrete data point on compact models. A reader looking for a new framework or first-principles result will not find it. The work is coherent on its own terms and shows honest engagement with the efficiency problem, so it deserves a serious referee to see the full methods, controls, and numbers.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Ark-ASR, a 0.6B-parameter audio-conditioned language model trained on 100k hours of speech. It examines on-policy distillation from a Qwen-ASR teacher and claims that the proposed training recipe consistently improves over supervised fine-tuning alone while outperforming the same-scale Qwen3-ASR-0.6B baseline on four of five Mandarin and English ASR benchmarks. This is achieved with far less data than the 20M hours reported for the Qwen3-Omni AuT encoder. A support-overlap diagnostic is introduced to indicate improved local student-teacher compatibility.

Significance. If the reported gains can be rigorously attributed to on-policy distillation, the work would demonstrate a practical route to data-efficient improvement of compact ASR models, narrowing the gap to larger systems under a much smaller audio budget. The support-overlap diagnostic could also supply a reusable tool for predicting when teacher-guided training is effective.

major comments (3)

[Abstract] Abstract and results presentation: performance improvements on benchmarks are stated without error bars, number of runs, statistical tests, ablation controls, or description of baseline matching, so the data-to-claim link cannot be evaluated.
[§4] Experiments section: no ablation studies hold data, hyperparameters, and architecture fixed while toggling only the on-policy distillation stage, leaving the attribution of WER gains to the distillation step (rather than data selection or other recipe details) unproven.
[§5] Support-overlap diagnostic: the claim that the metric identifies when the teacher-data stage improves compatibility is presented without quantitative validation, correlation analysis against actual WER improvements, or controls showing it outperforms simpler overlap measures.

minor comments (1)

[Abstract] The abstract contains a line-break hyphenation artifact ('au- dio').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in statistical rigor, ablation design, and validation of the diagnostic. We will revise the manuscript to address these points where possible, strengthening the attribution of gains and the support-overlap analysis. Responses to each major comment follow.

read point-by-point responses

Referee: [Abstract] Abstract and results presentation: performance improvements on benchmarks are stated without error bars, number of runs, statistical tests, ablation controls, or description of baseline matching, so the data-to-claim link cannot be evaluated.

Authors: We agree the current presentation lacks error bars, run counts, and statistical tests, weakening the evidential link. The manuscript reports single-run WERs on the five benchmarks. In revision we will add: (i) a clear description of baseline matching (identical 0.6B architecture, same 100k-hour training distribution, and identical decoding settings for the SFT and on-policy variants); (ii) error bars computed from three independent runs for the key comparisons; and (iii) a brief note on the absence of formal significance testing due to compute limits, while still reporting the observed deltas. Ablation controls will be expanded in §4 rather than the abstract. revision: partial
Referee: [§4] Experiments section: no ablation studies hold data, hyperparameters, and architecture fixed while toggling only the on-policy distillation stage, leaving the attribution of WER gains to the distillation step (rather than data selection or other recipe details) unproven.

Authors: The referee is correct that the existing comparison (SFT vs. full recipe) does not isolate the distillation stage while freezing data, hyperparameters, and architecture. The manuscript therefore cannot yet rigorously attribute gains solely to on-policy distillation. We will add a controlled ablation in the revised §4 that trains two models on identical data and hyperparameters, differing only in the presence of the on-policy distillation objective, and report the resulting WER deltas on the same five benchmarks. revision: yes
Referee: [§5] Support-overlap diagnostic: the claim that the metric identifies when the teacher-data stage improves compatibility is presented without quantitative validation, correlation analysis against actual WER improvements, or controls showing it outperforms simpler overlap measures.

Authors: We acknowledge that the support-overlap diagnostic is introduced without the requested quantitative validation. The current text offers only a qualitative interpretation. In revision we will add: (i) a correlation plot and coefficient between support-overlap scores and per-benchmark WER reductions across multiple teacher-student configurations; (ii) a direct comparison against simpler baselines such as token-level or n-gram overlap; and (iii) a short discussion of whether the metric provides predictive value beyond those simpler measures. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results with no derivation chain

full rationale

The paper reports empirical ASR training outcomes using on-policy distillation on 100k hours of data, with benchmark WER comparisons to baselines. No equations, fitted parameters, or derivations are presented that reduce to inputs by construction. Claims rest on observed performance deltas rather than any self-definitional, fitted-input, or self-citation load-bearing steps. The support-overlap diagnostic is described as suggestive but not used as a mathematical reduction. This matches the expected non-finding for an empirical methods paper whose central results are externally falsifiable via replication on the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete information on free parameters, background axioms, or newly postulated entities; a full manuscript would be required to audit these items.

pith-pipeline@v0.9.1-grok · 5727 in / 1203 out tokens · 32226 ms · 2026-06-29T11:59:31.340883+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end speech recognition.arXiv preprint arXiv:1412.5567, 2014. doi: 10.48550/arXiv.1412.5567. URLhttps://arxiv.org/abs/1412.5567

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.5567 2014
[2]

Conformer: Convolution-augmented transformer 7 for speech recognition.arXiv preprint arXiv:2005.08100, 2020

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer 7 for speech recognition.arXiv preprint arXiv:2005.08100, 2020. doi: 10.48550/arXiv.2005.08100. URL https://arxiv.org/abs/2005.08100

work page doi:10.48550/arxiv.2005.08100 2005
[3]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.arXiv preprint arXiv:2212.04356, 2022. doi: 10.48550/arXiv.2212.04356. URLhttps://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.04356 2022
[4]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. doi: 10.48550/arXiv.2509.17765. URLhttps://arxiv.org/abs/2509.17765

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.17765 2025
[5]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. doi: 10.48550/arXiv.1503.02531. URL https://arxiv.org/abs/ 1503.02531

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531 2015
[6]

Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush. Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling.arXiv preprint arXiv:2311.00430, 2023. doi: 10.48550/ arXiv.2311.00430. URLhttps://arxiv.org/abs/2311.00430

work page arXiv 2023
[7]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026. doi: 10.48550/arXiv.2604.13016. URLhttps://arxiv.org/abs/2604.13016

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13016 2026
[8]

AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline.arXiv preprint arXiv:1709.05522, 2017. doi: 10.48550/arXiv.1709.05522. URLhttps://arxiv.org/abs/1709.05522

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1709.05522 2017
[9]

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition.arXiv preprint arXiv:2110.03370, 2022

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition.arXiv preprint arXiv:2110.03370, 2022. doi: 10.48550/arXiv.2110.03370. URLhttps://arxiv.org/abs/2110.03370

work page doi:10.48550/arxiv.2110.03370 2022
[10]

Librispeech: An asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964. URL https://doi.org/10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[11]

In: Medical Imaging with Deep Learning (MIDL)

Xun Gong, Zhikai Zhou, and Yanmin Qian. Knowledge transfer and distillation from autoregressive to non-autoregressive speech recognition.arXiv preprint arXiv:2207.10600, 2022. doi: 10.48550/arXiv. 2207.10600. URLhttps://arxiv.org/abs/2207.10600

work page internal anchor Pith review doi:10.48550/arxiv 2022
[12]

KL for a KL: On-Policy Distillation with Control Variate Baseline

Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, and Yohan Jo. Kl for a kl: On-policy distillation with control variate baseline.arXiv preprint arXiv:2605.07865, 2026. doi: 10.48550/arXiv.2605.07865. URLhttps://arxiv.org/abs/2605.07865

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.07865 2026
[13]

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng, and Jieping Ye. Prefix teach, suffix fade: Local teachability collapse in strong-to-weak on-policy distillation.arXiv preprint arXiv:2605.13643, 2026. doi: 10.48550/arXiv.2605.13643. URL https://arxiv.org/abs/2605. 13643

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.13643 2026
[14]

GLM-ASR: A robust, open-source speech recognition model

Z.ai. GLM-ASR: A robust, open-source speech recognition model. GitHub repository, 2025. URL https://github.com/zai-org/GLM-ASR. 8

2025
[15]

GlmAsr model documentation

Hugging Face. GlmAsr model documentation. Transformers documentation, 2025. URL https: //huggingface.co/docs/transformers/model_doc/glmasr. 9

2025

[1] [1]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deep speech: Scaling up end-to-end speech recognition.arXiv preprint arXiv:1412.5567, 2014. doi: 10.48550/arXiv.1412.5567. URLhttps://arxiv.org/abs/1412.5567

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1412.5567 2014

[2] [2]

Conformer: Convolution-augmented transformer 7 for speech recognition.arXiv preprint arXiv:2005.08100, 2020

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. Conformer: Convolution-augmented transformer 7 for speech recognition.arXiv preprint arXiv:2005.08100, 2020. doi: 10.48550/arXiv.2005.08100. URL https://arxiv.org/abs/2005.08100

work page doi:10.48550/arxiv.2005.08100 2005

[3] [3]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.arXiv preprint arXiv:2212.04356, 2022. doi: 10.48550/arXiv.2212.04356. URLhttps://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.04356 2022

[4] [4]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. doi: 10.48550/arXiv.2509.17765. URLhttps://arxiv.org/abs/2509.17765

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.17765 2025

[5] [5]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. doi: 10.48550/arXiv.1503.02531. URL https://arxiv.org/abs/ 1503.02531

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531 2015

[6] [6]

Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush. Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling.arXiv preprint arXiv:2311.00430, 2023. doi: 10.48550/ arXiv.2311.00430. URLhttps://arxiv.org/abs/2311.00430

work page arXiv 2023

[7] [7]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026. doi: 10.48550/arXiv.2604.13016. URLhttps://arxiv.org/abs/2604.13016

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.13016 2026

[8] [8]

AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline

Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline.arXiv preprint arXiv:1709.05522, 2017. doi: 10.48550/arXiv.1709.05522. URLhttps://arxiv.org/abs/1709.05522

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1709.05522 2017

[9] [9]

Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition.arXiv preprint arXiv:2110.03370, 2022

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition.arXiv preprint arXiv:2110.03370, 2022. doi: 10.48550/arXiv.2110.03370. URLhttps://arxiv.org/abs/2110.03370

work page doi:10.48550/arxiv.2110.03370 2022

[10] [10]

Librispeech: An asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964. URL https://doi.org/10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015

[11] [11]

In: Medical Imaging with Deep Learning (MIDL)

Xun Gong, Zhikai Zhou, and Yanmin Qian. Knowledge transfer and distillation from autoregressive to non-autoregressive speech recognition.arXiv preprint arXiv:2207.10600, 2022. doi: 10.48550/arXiv. 2207.10600. URLhttps://arxiv.org/abs/2207.10600

work page internal anchor Pith review doi:10.48550/arxiv 2022

[12] [12]

KL for a KL: On-Policy Distillation with Control Variate Baseline

Minjae Oh, Sangjun Song, Gyubin Choi, Yunho Choi, and Yohan Jo. Kl for a kl: On-policy distillation with control variate baseline.arXiv preprint arXiv:2605.07865, 2026. doi: 10.48550/arXiv.2605.07865. URLhttps://arxiv.org/abs/2605.07865

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.07865 2026

[13] [13]

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng, and Jieping Ye. Prefix teach, suffix fade: Local teachability collapse in strong-to-weak on-policy distillation.arXiv preprint arXiv:2605.13643, 2026. doi: 10.48550/arXiv.2605.13643. URL https://arxiv.org/abs/2605. 13643

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.13643 2026

[14] [14]

GLM-ASR: A robust, open-source speech recognition model

Z.ai. GLM-ASR: A robust, open-source speech recognition model. GitHub repository, 2025. URL https://github.com/zai-org/GLM-ASR. 8

2025

[15] [15]

GlmAsr model documentation

Hugging Face. GlmAsr model documentation. Transformers documentation, 2025. URL https: //huggingface.co/docs/transformers/model_doc/glmasr. 9

2025