pith. machine review for the scientific record. sign in

arxiv: 2306.08543 · v6 · submitted 2023-06-14 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MiniLLM: On-Policy Distillation of Large Language Models

Furu Wei, Li Dong, Minlie Huang, Yuxian Gu

Pith reviewed 2026-05-12 17:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords knowledge distillationlarge language modelsreverse KL divergenceon-policy optimizationmodel compressioninstruction followingexposure bias
0
0 comments X

The pith

MiniLLM distills large language models using reverse KL divergence to create smaller models with superior generation quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make knowledge distillation practical for white-box large language models by changing the training objective from forward to reverse Kullback-Leibler divergence. Forward KL tends to make the student assign too much probability mass to regions the teacher barely uses, while reverse KL focuses the student on the teacher's high-probability outputs. The authors derive an on-policy optimization procedure that lets the student sample its own generations during training to minimize this reverse objective. Experiments on instruction-following tasks show the resulting MiniLLM models produce more accurate responses, suffer less exposure bias, calibrate better, and handle longer texts more effectively than prior distillation baselines. The method works across model families and sizes from 120 million to 13 billion parameters.

Core claim

Switching the distillation objective to reverse Kullback-Leibler divergence and optimizing it on-policy lets a smaller student language model match the teacher's generative distribution more closely than standard forward-KL methods, avoiding overestimation of low-probability tokens and yielding higher-quality, better-calibrated outputs.

What carries the argument

The on-policy optimization derived to minimize reverse KLD between the teacher and student autoregressive distributions.

If this is right

  • MiniLLM produces more precise responses with higher overall quality than standard KD baselines.
  • The student models show lower exposure bias and better calibration on generative tasks.
  • Long-text generation performance improves relative to forward-KL distillation.
  • The approach scales across model families with sizes from 120M to 13B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reverse-KL on-policy approach could be tested on non-instruction tasks such as code completion or summarization to check if the quality gains generalize.
  • Applying the method to models larger than 13B would test whether the reported stability holds at greater scale.
  • Reverse KL distillation may reduce the need for heavy prompt engineering in student models because it discourages overgeneration of low-probability continuations.

Load-bearing premise

The on-policy optimization for reverse KLD stays stable and effective when student sizes and teacher families change without new instabilities appearing.

What would settle it

Training a 120M student from a 13B teacher outside the tested families produces divergence or no gain in response precision compared with forward-KL baselines.

read the original abstract

Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs). However, previous KD methods are primarily applied to white-box classification models or training small models to imitate black-box model APIs like ChatGPT. How to effectively distill the knowledge of white-box LLMs into small models is still under-explored, which becomes more important with the prosperity of open-source LLMs. In this work, we propose a KD approach that distills LLMs into smaller language models. We first replace the forward Kullback-Leibler divergence (KLD) objective in the standard KD approaches with reverse KLD, which is more suitable for KD on generative language models, to prevent the student model from overestimating the low-probability regions of the teacher distribution. Then, we derive an effective on-policy optimization approach to learn this objective. The student models are named MiniLLM. Extensive experiments in the instruction-following setting show that MiniLLM generates more precise responses with higher overall quality, lower exposure bias, better calibration, and higher long-text generation performance than the baselines. Our method is scalable for different model families with 120M to 13B parameters. Our code, data, and model checkpoints can be found in https://github.com/microsoft/LMOps/tree/main/minillm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MiniLLM, a knowledge distillation method for white-box LLMs that replaces the standard forward KL divergence objective with reverse KL divergence to avoid overestimating low-probability regions of the teacher, then derives an on-policy optimization procedure that samples from the student to estimate the expectation. In instruction-following experiments, the resulting student models (120M–13B parameters) are reported to outperform baselines on precision, overall quality, exposure bias, calibration, and long-text generation, with claims of scalability across model families.

Significance. If the on-policy reverse-KLD estimator proves stable and the reported gains are robust, the work would offer a principled and practical route to distilling open-source LLMs, directly addressing exposure bias and calibration issues that standard KD methods struggle with in generative settings.

major comments (2)
  1. [§3.2] §3.2 (on-policy estimator derivation): the unbiasedness of the reverse-KLD estimator obtained by sampling from the student holds only under sufficient support overlap with the teacher; the manuscript provides no variance bounds, bias analysis, or failure-mode discussion for the 10–100× size gaps (e.g., 120M vs. 13B) that are central to the scalability claim.
  2. [§4] §4 (experimental validation): the reported gains rest on a limited set of student–teacher size/family pairs without a systematic sweep of the size gap or cross-family transfer; this leaves open the possibility that the observed stability and metric improvements are regime-specific rather than generally reliable.
minor comments (2)
  1. [Table 1] Table 1 and Figure 3: error bars or statistical significance tests are absent, making it difficult to assess whether the reported improvements are reliable across random seeds.
  2. [§2.2] §2.2: the motivation for preferring reverse over forward KL is clear, but a short comparison table of the two objectives under the same on-policy estimator would help readers see the concrete difference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the theoretical foundations and experimental scope of MiniLLM. We address each major comment below with honest assessment of the manuscript's current state and planned revisions.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (on-policy estimator derivation): the unbiasedness of the reverse-KLD estimator obtained by sampling from the student holds only under sufficient support overlap with the teacher; the manuscript provides no variance bounds, bias analysis, or failure-mode discussion for the 10–100× size gaps (e.g., 120M vs. 13B) that are central to the scalability claim.

    Authors: We agree that practical stability of the on-policy reverse-KL estimator depends on adequate support overlap, even though the Monte Carlo estimator itself is unbiased for the expectation under the student distribution. The manuscript did not provide variance bounds, bias analysis, or explicit failure-mode discussion for large capacity gaps. In revision we will add a dedicated paragraph to §3.2 that (i) recalls the full-support property of softmax-based language models, (ii) reports observed gradient variance from our training runs, and (iii) notes the risk of mode collapse or high variance when the student support becomes too narrow relative to the teacher. Deriving general theoretical bounds remains an open question and is acknowledged as a limitation. revision: partial

  2. Referee: [§4] §4 (experimental validation): the reported gains rest on a limited set of student–teacher size/family pairs without a systematic sweep of the size gap or cross-family transfer; this leaves open the possibility that the observed stability and metric improvements are regime-specific rather than generally reliable.

    Authors: We acknowledge that the main experiments focus on a modest number of LLaMA-family pairs plus a few GPT-2 variants, without an exhaustive sweep across all size gaps or cross-family transfers. To strengthen the scalability claim we will add two new results to §4: distillation from a 7B LLaMA teacher to a 120M student (larger gap) and one cross-family transfer using an OPT teacher. These additions provide further evidence of robustness, although a fully exhaustive sweep is computationally prohibitive and is noted as future work. revision: yes

Circularity Check

0 steps flagged

Derivation of reverse-KLD on-policy estimator is self-contained from first principles

full rationale

The paper replaces forward KLD with reverse KLD and derives an on-policy sampling estimator directly from the objective; the resulting student models are evaluated on held-out instruction-following metrics that are not algebraically identical to the fitted parameters or the sampling procedure itself. No equations reduce claimed gains (precision, calibration, long-text quality) to quantities defined inside the same optimization loop, and no load-bearing self-citation chain or ansatz smuggling is required to reach the central construction. The scalability claim across 120M–13B models is presented as an empirical observation rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard properties of KL divergence and the assumption that on-policy sampling from the student produces a usable gradient signal for the reverse KL objective. No new entities are postulated.

axioms (1)
  • domain assumption Reverse KL divergence is a valid and preferable objective for matching a student generative distribution to a teacher distribution.
    Invoked when the authors replace forward KLD with reverse KLD to avoid overestimation of low-probability regions.

pith-pipeline@v0.9.0 · 5541 in / 1259 out tokens · 25320 ms · 2026-05-12T17:34:55.254327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why

    cs.LG 2026-05 unverdicted novelty 7.0

    Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.

  2. Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access

    cs.LG 2026-05 unverdicted novelty 7.0

    Top-K logit censoring bounds the total-variation diameter of compatible teacher distributions by U_K but permits substantial capability transfer via distillation even when KL divergence is near zero.

  3. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  4. Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...

  5. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  6. Depth Adaptive Efficient Visual Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 7.0

    DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

  7. UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

  8. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  9. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  10. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  11. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  12. Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.

  13. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

  14. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  15. When Less is Enough: Efficient Inference via Collaborative Reasoning

    cs.LG 2026-05 conditional novelty 6.0

    A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.

  16. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  17. Hybrid Policy Distillation for LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...

  18. Temporally Extended Mixture-of-Experts Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Temporally extended MoE layers using the option-critic framework with deliberation costs cut switching rates below 5% while retaining most capability on MATH, MMLU, and MMMLU.

  19. Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

  20. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    cs.LG 2026-04 unverdicted novelty 6.0

    On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

  21. ExecTune: Effective Steering of Black-Box LLMs with Guide Models

    cs.LG 2026-04 unverdicted novelty 6.0

    ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on...

  22. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  23. Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

    cs.CL 2026-04 conditional novelty 6.0

    Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.

  24. Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

    cs.LG 2026-05 unverdicted novelty 5.0

    NPD accelerates on-policy distillation 8.1 times faster than baselines by using asynchronous SFT with Δ-IFD filtering, outperforming standard SFT and enabling a 1B model to achieve 68.73% SOTA score.

  25. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.

  26. Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

    cs.CL 2026-04 conditional novelty 5.0

    Widthwise pruning of LVLM language backbones combined with supervised finetuning and hidden-state distillation recovers over 95% performance using just 5% of data across 3B-7B models.

  27. Networking-Aware Energy Efficiency in Agentic AI Inference: A Survey

    eess.SY 2026-04 unverdicted novelty 4.0

    The paper surveys energy efficiency strategies for Agentic AI inference by proposing a new accounting framework and taxonomy that spans model simplification, computation control, input optimization, and cross-layer co...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 25 Pith papers · 15 internal anchors

  1. [1]

    PaLM 2 Technical Report

    [ADF+23] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexan- dre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

  2. [2]

    Gkd: Generalized knowledge distillation for auto-regressive sequence models.arXiv preprint arXiv:2306.13649, 12, 2023

    [A VS+23] Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. GKD: Generalized knowledge distillation for auto-regressive se- quence models.arXiv preprint arXiv:2306.13649,

  3. [3]

    On the Opportunities and Risks of Foundation Models

    [BHA+21] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Syd- ney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brun- skill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258,

  4. [4]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    [BJN+22] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das- Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

  5. [5]

    A., Purohit, S., Prashanth, U

    [BSA+23] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling.arXiv preprint arXiv:2304.01373,

  6. [6]

    Scaling Instruction-Finetuned Language Models

    [CHL+22] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fe- dus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.arXiv preprint arXiv:2210.11416,

  7. [7]

    PaLM: Scaling Language Modeling with Pathways

    [CND+22] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311,

  8. [8]

    arXiv preprint arXiv:2305.15717 , year =

    [GWS+23] Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms.arXiv preprint arXiv:2305.15717,

  9. [9]

    How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?

    [Hus15] Ferenc Huszár. How (not) to train your generative model: Scheduled sampling, likeli- hood, adversary?arXiv preprint arXiv:1511.05101,

  10. [10]

    Distilling the Knowledge in a Neural Network

    [HVD15] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  11. [11]

    Scaling Laws for Neural Language Models

    [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Re- won Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  12. [12]

    ROUGE: A package for automatic evaluation of summaries

    [Lin04] Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InPro- ceedings of Text Summarization Branches Out (ACL 2004),

  13. [13]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    [LKTF20] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforce- ment learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

  14. [14]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    12 [LOG+19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692,

  15. [15]

    Instruction Tuning with GPT-4

    [PLH+23] Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruc- tion tuning with GPT-4.arXiv preprint arXiv:2304.03277,

  16. [16]

    Policy Distillation

    [RCG+15] Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation.arXiv preprint arXiv:1511.06295,

  17. [17]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    [SDCW19] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

  18. [18]

    A deep reinforcement learning chatbot.arXiv preprint arXiv:1709.02349,

    [SSG+17] Iulian V Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rose- mary Ke, et al. A deep reinforcement learning chatbot.arXiv preprint arXiv:1709.02349,

  19. [19]

    Light- paff: A two-stage distillation framework for pre-training and fine-tuning.arXiv preprint arXiv:2004.12817, 2020

    [SST+20] Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, and Tie-Yan Liu. LightPAFF: A two-stage distillation framework for pre-training and fine-tuning.arXiv preprint arXiv:2004.12817,

  20. [20]

    Proximal Policy Optimization Algorithms

    [SWD+17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  21. [21]

    LaMDA: Language Models for Dialog Applications

    [TDFH+22] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kul- shreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications.arXiv preprint arXiv:2201.08239,

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    [TLI+23] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- ple. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  23. [23]

    Lamini-lm: A diverse herd of distilled models from large-scale in- structions.arXiv preprint arXiv:2304.14402,

    [WWZ+23] Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Al- ham Fikri Aji. Lamini-lm: A diverse herd of distilled models from large-scale in- structions.arXiv preprint arXiv:2304.14402,

  24. [24]

    OPT: Open Pre-trained Transformer Language Models

    [ZRG+22] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

  25. [25]

    Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation.arXiv preprint arXiv:2305.05010,

    [ZSL+23] Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky, Marc Na- jork, and Chao Zhang. Do not blindly imitate the teacher: Using perturbed loss for knowledge distillation.arXiv preprint arXiv:2305.05010,