Recognition: 2 theorem links
· Lean TheoremMiniLLM: On-Policy Distillation of Large Language Models
Pith reviewed 2026-05-12 17:34 UTC · model grok-4.3
The pith
MiniLLM distills large language models using reverse KL divergence to create smaller models with superior generation quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Switching the distillation objective to reverse Kullback-Leibler divergence and optimizing it on-policy lets a smaller student language model match the teacher's generative distribution more closely than standard forward-KL methods, avoiding overestimation of low-probability tokens and yielding higher-quality, better-calibrated outputs.
What carries the argument
The on-policy optimization derived to minimize reverse KLD between the teacher and student autoregressive distributions.
If this is right
- MiniLLM produces more precise responses with higher overall quality than standard KD baselines.
- The student models show lower exposure bias and better calibration on generative tasks.
- Long-text generation performance improves relative to forward-KL distillation.
- The approach scales across model families with sizes from 120M to 13B parameters.
Where Pith is reading between the lines
- The same reverse-KL on-policy approach could be tested on non-instruction tasks such as code completion or summarization to check if the quality gains generalize.
- Applying the method to models larger than 13B would test whether the reported stability holds at greater scale.
- Reverse KL distillation may reduce the need for heavy prompt engineering in student models because it discourages overgeneration of low-probability continuations.
Load-bearing premise
The on-policy optimization for reverse KLD stays stable and effective when student sizes and teacher families change without new instabilities appearing.
What would settle it
Training a 120M student from a 13B teacher outside the tested families produces divergence or no gain in response precision compared with forward-KL baselines.
read the original abstract
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective on-policy optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MiniLLM, a knowledge distillation method for white-box LLMs that replaces the standard forward KL divergence objective with reverse KL divergence to avoid overestimating low-probability regions of the teacher, then derives an on-policy optimization procedure that samples from the student to estimate the expectation. In instruction-following experiments, the resulting student models (120M–13B parameters) are reported to outperform baselines on precision, overall quality, exposure bias, calibration, and long-text generation, with claims of scalability across model families.
Significance. If the on-policy reverse-KLD estimator proves stable and the reported gains are robust, the work would offer a principled and practical route to distilling open-source LLMs, directly addressing exposure bias and calibration issues that standard KD methods struggle with in generative settings.
major comments (2)
- [§3.2] §3.2 (on-policy estimator derivation): the unbiasedness of the reverse-KLD estimator obtained by sampling from the student holds only under sufficient support overlap with the teacher; the manuscript provides no variance bounds, bias analysis, or failure-mode discussion for the 10–100× size gaps (e.g., 120M vs. 13B) that are central to the scalability claim.
- [§4] §4 (experimental validation): the reported gains rest on a limited set of student–teacher size/family pairs without a systematic sweep of the size gap or cross-family transfer; this leaves open the possibility that the observed stability and metric improvements are regime-specific rather than generally reliable.
minor comments (2)
- [Table 1] Table 1 and Figure 3: error bars or statistical significance tests are absent, making it difficult to assess whether the reported improvements are reliable across random seeds.
- [§2.2] §2.2: the motivation for preferring reverse over forward KL is clear, but a short comparison table of the two objectives under the same on-policy estimator would help readers see the concrete difference.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the theoretical foundations and experimental scope of MiniLLM. We address each major comment below with honest assessment of the manuscript's current state and planned revisions.
read point-by-point responses
-
Referee: [§3.2] §3.2 (on-policy estimator derivation): the unbiasedness of the reverse-KLD estimator obtained by sampling from the student holds only under sufficient support overlap with the teacher; the manuscript provides no variance bounds, bias analysis, or failure-mode discussion for the 10–100× size gaps (e.g., 120M vs. 13B) that are central to the scalability claim.
Authors: We agree that practical stability of the on-policy reverse-KL estimator depends on adequate support overlap, even though the Monte Carlo estimator itself is unbiased for the expectation under the student distribution. The manuscript did not provide variance bounds, bias analysis, or explicit failure-mode discussion for large capacity gaps. In revision we will add a dedicated paragraph to §3.2 that (i) recalls the full-support property of softmax-based language models, (ii) reports observed gradient variance from our training runs, and (iii) notes the risk of mode collapse or high variance when the student support becomes too narrow relative to the teacher. Deriving general theoretical bounds remains an open question and is acknowledged as a limitation. revision: partial
-
Referee: [§4] §4 (experimental validation): the reported gains rest on a limited set of student–teacher size/family pairs without a systematic sweep of the size gap or cross-family transfer; this leaves open the possibility that the observed stability and metric improvements are regime-specific rather than generally reliable.
Authors: We acknowledge that the main experiments focus on a modest number of LLaMA-family pairs plus a few GPT-2 variants, without an exhaustive sweep across all size gaps or cross-family transfers. To strengthen the scalability claim we will add two new results to §4: distillation from a 7B LLaMA teacher to a 120M student (larger gap) and one cross-family transfer using an OPT teacher. These additions provide further evidence of robustness, although a fully exhaustive sweep is computationally prohibitive and is noted as future work. revision: yes
Circularity Check
Derivation of reverse-KLD on-policy estimator is self-contained from first principles
full rationale
The paper replaces forward KLD with reverse KLD and derives an on-policy sampling estimator directly from the objective; the resulting student models are evaluated on held-out instruction-following metrics that are not algebraically identical to the fitted parameters or the sampling procedure itself. No equations reduce claimed gains (precision, calibration, long-text quality) to quantities defined inside the same optimization loop, and no load-bearing self-citation chain or ansatz smuggling is required to reach the central construction. The scalability claim across 120M–13B models is presented as an empirical observation rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reverse KL divergence is a valid and preferable objective for matching a student generative distribution to a teacher distribution.
Forward citations
Cited by 27 Pith papers
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
-
Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access
Top-K logit censoring bounds the total-variation diameter of compatible teacher distributions by U_K but permits substantial capability transfer via distillation even when KL divergence is near zero.
-
KL for a KL: On-Policy Distillation with Control Variate Baseline
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Teacher-Guided Policy Optimization for LLM Distillation
TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
When Less is Enough: Efficient Inference via Collaborative Reasoning
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
Hybrid Policy Distillation for LLMs
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...
-
Temporally Extended Mixture-of-Experts Models
Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.
-
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
ExecTune: Effective Steering of Black-Box LLMs with Guide Models
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
-
Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.
-
Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey
The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co...
Reference graph
Works this paper leans on
-
[1]
[ADF+23] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexan- dre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
[A VS+23] Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. GKD: Generalized knowledge distillation for auto-regressive se- quence models.arXiv preprint arXiv:2306.13649,
-
[3]
On the Opportunities and Risks of Foundation Models
[BHA+21] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Syd- ney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brun- skill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
[BJN+22] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das- Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
[BSA+23] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373,
-
[6]
Scaling Instruction-Finetuned Language Models
[CHL+22] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fe- dus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
PaLM: Scaling Language Modeling with Pathways
[CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
arXiv preprint arXiv:2305.15717 , year =
[GWS+23] Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms.arXiv preprint arXiv:2305.15717,
-
[9]
How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?
[Hus15] Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likeli- hood, adversary?arXiv preprint arXiv:1511.05101,
-
[10]
Distilling the Knowledge in a Neural Network
[HVD15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Scaling Laws for Neural Language Models
[KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Re- won Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[12]
ROUGE: A package for automatic evaluation of summaries
[Lin04] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InPro- ceedings of Text Summarization Branches Out (ACL 2004),
work page 2004
-
[13]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
[LKTF20] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforce- ment learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[14]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
12 [LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[15]
[PLH+23] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruc- tion tuning with GPT-4.arXiv preprint arXiv:2304.03277,
work page internal anchor Pith review arXiv
-
[16]
[RCG+15] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation.arXiv preprint arXiv:1511.06295,
-
[17]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
[SDCW19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[18]
A deep reinforcement learning chatbot.arXiv preprint arXiv:1709.02349,
[SSG+17] Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rose- mary Ke, et al. A deep reinforcement learning chatbot.arXiv preprint arXiv:1709.02349,
-
[19]
[SST+20] Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, and Tie-Yan Liu. LightPAFF: A two-stage distillation framework for pre-training and fine-tuning.arXiv preprint arXiv:2004.12817,
-
[20]
Proximal Policy Optimization Algorithms
[SWD+17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
LaMDA: Language Models for Dialog Applications
[TDFH+22] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kul- shreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
LLaMA: Open and Efficient Foundation Language Models
[TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
[WWZ+23] Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Al- ham Fikri Aji. Lamini-lm: A diverse herd of distilled models from large-scale in- structions.arXiv preprint arXiv:2304.14402,
-
[24]
OPT: Open Pre-trained Transformer Language Models
[ZRG+22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
[ZSL+23] Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky, Marc Na- jork, and Chao Zhang. Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation.arXiv preprint arXiv:2305.05010,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.