arxiv: 2605.05940 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.CL

Recognition: unknown

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

Miao Rang , Zhenni Bi , Hang Zhou , Kai Han , Xuechun Wang , An Xiao , Xinghao Chen , Yunhe Wang

show 1 more author

Hanting Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:15 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords knowledge distillationon-policy learningautoregressive modelssequence packingreinforcement learninglanguage model trainingsample filtering

0 comments

The pith

Asynchronous generation with selective filtering lets on-policy distillation run at SFT speeds while staying near the student policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Near-Policy Distillation to solve the speed bottleneck in on-policy knowledge distillation for autoregressive models. Standard on-policy methods require expensive RL-style rollouts at every step, but NPD decouples student token generation from the training loop so that sequences can be packed and trained with ordinary SFT machinery. To keep the resulting data from drifting too far off-policy, the method adds sparse student updates plus a Δ-IFD filter that discards extreme out-of-distribution samples. Experiments show this combination yields an 8.1 times wall-clock speedup over on-policy baselines, an 8 percent gain over plain SFT, and a new state-of-the-art score for a 1 B parameter model on downstream tasks.

Core claim

Near-Policy Distillation reformulates on-policy distillation as an asynchronous process: the student generates tokens independently, the data is packed for efficient supervised training, and two lightweight safeguards—sparse student refreshes and Δ-IFD sample filtering—keep the optimization inside a proximal zone around the current policy. The resulting procedure delivers the efficiency of SFT while preserving the distribution-matching benefits that on-policy methods normally obtain only through costly RL loops.

What carries the argument

The Δ-IFD filtering mechanism, which selects training samples whose importance-weighted deviation stays within a safe threshold, paired with asynchronous student generation and periodic sparse updates to counteract policy lag.

If this is right

Training throughput rises by roughly 8 times because sequence packing and SFT-style loops replace per-step RL rollouts.
A 1 B model trained this way can surpass a 1.7 B baseline on the reported benchmark, showing that narrower exploration space transfers to better final performance.
The method supplies a cheap warm-start policy that subsequent RL stages can refine without starting from a fully off-policy SFT checkpoint.
Because the safeguards are heuristic and lightweight, they can be added to existing distillation pipelines with minimal code change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling pattern could be applied to other on-policy algorithms that currently pay a heavy rollout cost, such as certain preference optimization loops.
If the Δ-IFD threshold can be tuned automatically rather than by hand, the approach might generalize across model scales without per-task retuning.
The observed narrowing of the exploration space suggests the filtered data acts as a soft curriculum; testing whether this effect compounds over multiple distillation rounds would be a direct next experiment.

Load-bearing premise

The combination of sparse student updates and Δ-IFD filtering is enough to keep the training data close enough to the current student policy that the optimization remains beneficial rather than noisy or biased.

What would settle it

Run the same downstream tasks with an unfiltered asynchronous baseline and measure whether performance collapses below the on-policy reference or whether the filtered version still matches the reference within statistical noise.

Figures

Figures reproduced from arXiv: 2605.05940 by An Xiao, Hang Zhou, Hanting Chen, Kai Han, Miao Rang, Xinghao Chen, Xuechun Wang, Yunhe Wang, Zhenni Bi.

**Figure 1.** Figure 1: Overview of the Near-Policy Distillation Loop with ∆-IFD data selection and student updates. The pipeline consists of three phases: (1) Generation & Filtering: The Student LLM generates raw responses from the initial prompt pool, which are then filtered using the ∆-IFD metric. Samples falling outside the ∆IFD threshold are rejected to ensure data quality. (2) Teacher Annotation: The selected samples are se… view at source ↗

**Figure 2.** Figure 2: Illustration of Data Curation Zones defined by ∆-IFD. In the near-policy setting, response data is generated by the student model itself. While these student-generated outputs align closely with the student’s intrinsic distribution, they inevitably introduce quality noise compared to groundtruth data. To empirically filter out noise while retaining training signals that drive improvement, we introduce … view at source ↗

**Figure 3.** Figure 3: Comparison of NPD and On-Policy Distillation on Math500. view at source ↗

**Figure 4.** Figure 4: Comparison of training loss with and without ∆IFD data filtering. Constraint Analysis and Stability. Building on the efficacy of ∆IFD, we further analyze its internal constraints. As shown in view at source ↗

**Figure 5.** Figure 5: KL divergence over training steps. Unlike standard off-policy methods that suffer from view at source ↗

read the original abstract

Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the $\Delta$-IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, $\Delta$-IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical async workaround for on-policy distillation that claims big speedups, but the evidence that it truly stays near-policy rests on an unproven filter.

read the letter

The main takeaway is that Near-Policy Distillation decouples generation from training to let them use SFT with packing, then adds sparse updates and a Δ-IFD filter to limit drift from the lag. That combination is the concrete new piece, and it produces the reported 8.1x speedup over standard on-policy runs plus an 8% lift over plain SFT. The downstream result on openPangu-Embedded-1B reaching 68.73% is also a clear win if it holds, since it beats a larger model after the distillation step narrows the space for later RL.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Near-Policy Distillation (NPD), an asynchronous framework for knowledge distillation of autoregressive models. It decouples student generation from training to enable efficient SFT with sequence packing, while using sparse student updates and a Δ-IFD filtering heuristic to counteract policy lag and sample noise, thereby keeping optimization near-policy. The central empirical claims are an 8.1× speedup over on-policy baselines, an 8.09% improvement over standard SFT, and a new SOTA score of 68.73% on openPangu-Embedded-1B that surpasses the larger Qwen3-1.7B model.

Significance. If the stabilization mechanisms prove robust, the work could meaningfully accelerate on-policy distillation pipelines that currently rely on expensive RL frameworks, making them more practical for large-scale language model training. The reported ability to narrow the exploration space for subsequent RL stages and achieve SOTA with a 1B model is potentially impactful for efficient model development.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the reported 8.1× speedup and 8.09% gain over SFT are presented without any mention of the number of runs, random seeds, variance, statistical significance tests, or precise baseline implementations. This absence directly undermines evaluation of whether the sparse-update + Δ-IFD combination reliably prevents off-policy drift.
[§3.2 (Δ-IFD Filtering)] §3.2 (Δ-IFD Filtering): the mechanism is introduced as a heuristic that filters extreme out-of-distribution samples to keep updates in a 'safe proximal learning zone,' yet no explicit formula, threshold derivation, or ablation isolating its contribution versus simple sparse updates is provided. Without this, the claim that it prevents selection bias while narrowing exploration space remains unverified.

minor comments (1)

[Abstract] The abstract states 'Codes will be released soon' but provides no link or repository placeholder; this should be clarified or removed for a camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The points raised highlight important aspects of reproducibility and methodological clarity that we will address in the revised manuscript.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the reported 8.1× speedup and 8.09% gain over SFT are presented without any mention of the number of runs, random seeds, variance, statistical significance tests, or precise baseline implementations. This absence directly undermines evaluation of whether the sparse-update + Δ-IFD combination reliably prevents off-policy drift.

Authors: We agree that the absence of these details limits the ability to assess reliability. In the revised version, we will report all main results as averages over three independent random seeds, including mean and standard deviation. We will also document the precise baseline implementations (including any re-implementation choices and hyperparameters) and note that formal statistical significance tests were omitted due to computational constraints, while variance metrics are provided to support evaluation of the stabilization mechanisms. revision: yes
Referee: [§3.2 (Δ-IFD Filtering)] §3.2 (Δ-IFD Filtering): the mechanism is introduced as a heuristic that filters extreme out-of-distribution samples to keep updates in a 'safe proximal learning zone,' yet no explicit formula, threshold derivation, or ablation isolating its contribution versus simple sparse updates is provided. Without this, the claim that it prevents selection bias while narrowing exploration space remains unverified.

Authors: We acknowledge that the current presentation of Δ-IFD is insufficiently detailed. The revised manuscript will include the explicit mathematical formula for the Δ-IFD score, explain the threshold selection process based on the empirical distribution of IFD values across generated batches, and add a dedicated ablation study in §4 comparing performance with and without Δ-IFD (as well as against sparse updates alone). These additions will directly verify the mechanism's contribution to maintaining near-policy behavior and mitigating selection bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical engineering claims

full rationale

The paper presents NPD as an asynchronous engineering reformulation of on-policy distillation, relying on sparse updates and the heuristic Δ-IFD filter to maintain near-policy behavior. No equations, derivations, fitted parameters, or uniqueness theorems appear in the provided abstract or referenced full text. All central claims (8.1× speedup, +8.09% over SFT, SOTA score) are framed as experimental outcomes rather than reductions to self-defined inputs or self-citations. The method is self-contained against external benchmarks with no load-bearing step that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The claims depend on the empirical success of these introduced mechanisms and the assumption that they keep the process near-policy, with no independent theoretical grounding provided in the abstract.

invented entities (2)

Δ-IFD filtering mechanism no independent evidence
purpose: Heuristic sample selection to filter extreme out-of-distribution samples and stabilize training
Introduced in the paper to counteract policy lag and noise in asynchronous updates.
Near-Policy Distillation (NPD) no independent evidence
purpose: Asynchronous on-policy distillation framework
Newly proposed method combining several techniques.

pith-pipeline@v0.9.0 · 5554 in / 1227 out tokens · 66480 ms · 2026-05-08T14:15:47.795520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 23 canonical work pages · 13 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review arXiv
[2]

Gkd: Generalized knowledge distillation for auto- regressive sequence models,

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InThe twelfth international conference on learning representations, 2024a. Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, M...

work page arXiv
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review arXiv
[4]

Hanting Chen, Yasheng Wang, Kai Han, Dong Li, Lin Li, Zhenni Bi, Jinpeng Li, Haoyu Wang, Fei Mi, Mingjian Zhu, et al

URL https://api.semanticscholar.org/CorpusID: 11253972. Hanting Chen, Yasheng Wang, Kai Han, Dong Li, Lin Li, Zhenni Bi, Jinpeng Li, Haoyu Wang, Fei Mi, Mingjian Zhu, et al. Pangu embedded: An efficient dual-system llm reasoner with metacognition. arXiv preprint arXiv:2505.22375,

work page arXiv
[5]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé, Jared Kaplan, Harrison Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo Bavarian, Clemens Winter, Phi...

work page internal anchor Pith review arXiv
[6]

MiniLLM: On-Policy Distillation of Large Language Models

URLhttps://arxiv.org/abs/2306.08543. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review arXiv
[7]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review arXiv
[8]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review arXiv
[9]

arXiv preprint arXiv:2305.08322 , year=

Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Fanchao Qi, Yao Fu, Maosong Sun, and Junxian He. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models.ArXiv, abs/2305.08322,

work page arXiv
[10]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016a. Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. 2016b. URL https: //arxiv.org/abs/1606.07947. Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun....

work page arXiv 2016
[11]

Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

URLhttps://arxiv.org/abs/2402.03898. Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se- Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms

work page arXiv
[12]

Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

URL https://arxiv.org/abs/2503.07067. Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, and Timothy Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese.arXiv preprint arXiv:2306.09212,

work page arXiv
[13]

From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning

Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech...

2024
[14]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review arXiv
[15]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review arXiv 1910
[16]

Pangu ultra moe: How to train your big moe on ascend npus,

Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, et al. Pangu ultra moe: How to train your big moe on ascend npus. arXiv preprint arXiv:2505.04519,

work page arXiv
[17]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review arXiv
[18]

Clue: A chinese language understanding evaluation benchmark.arXiv preprint arXiv:2004.05986,

Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, et al. Clue: A chinese language understanding evaluation benchmark.arXiv preprint arXiv:2004.05986,

work page arXiv 2004
[19]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jian- hong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122,

work page internal anchor Pith review arXiv
[20]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review arXiv
[21]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review arXiv
[22]

Instruction-Following Evaluation for Large Language Models

12 Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911,

work page internal anchor Pith review arXiv
[23]

The evaluation metric is the average zero-shot accuracy across eight benchmarks

Table 11: Effect of KD loss weight on the average benchmark performance. The evaluation metric is the average zero-shot accuracy across eight benchmarks. All models use greedy decoding with a decode length of 8K. λKD 0.5 0.6 0.7 0.80.91.0 Average 53.94 54.07 53.88 54.4355.2954.44 Top-k Value Effect.To explore the impact of the number of top tokens used in...

work page arXiv
[24]

generate-then-filter

Hence, we adoptλ KD = 0.9and top-k=10 by default unless otherwise specified. Table 12: Effect of top- k value on the average benchmark performance. All models use greedy decoding with a decode length of 8K, and the KD loss weight is fixed atλ KD = 0.9. Top-k51015 20 Average Performance 54.7855.2954.24 54.40 C Theoretical and Empirical Analysis C.1 Analysi...

work page arXiv