arxiv: 2402.13116 · v4 · pith:ODXWRQYNnew · submitted 2024-02-20 · 💻 cs.CL

A Survey on Knowledge Distillation of Large Language Models

Xiaohan Xu , Ming Li , Chongyang Tao , Tao Shen , Reynold Cheng , Jinyang Li , Can Xu , Dacheng Tao

show 1 more author

Tianyi Zhou

This is my paper

Pith reviewed 2026-05-17 23:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationlarge language modelsmodel compressiondata augmentationself-improvementopen-source LLMssurvey

0 comments

The pith

Knowledge distillation transfers advanced capabilities from proprietary LLMs like GPT-4 to open-source models such as LLaMA and Mistral.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey positions knowledge distillation as a central technique for moving sophisticated abilities from closed large language models to accessible open-source versions while also supporting compression and self-improvement loops. It organizes existing work into three main pillars covering distillation algorithms, targeted skill improvements, and domain-specific applications. The survey further examines how data augmentation generates richer training examples inside the distillation process, allowing smaller models to approach the contextual understanding and alignment seen in larger proprietary systems. A reader would care because the approach provides concrete routes to deploy powerful language capabilities without full-scale training resources or direct access to closed models.

Core claim

What carries the argument

The three foundational pillars of algorithm, skill, and verticalization together with the use of data augmentation to create context-rich training data inside the KD framework.

If this is right

Open-source models gain the ability to approximate contextual adeptness and ethical alignment of proprietary models through data-augmented KD.
Large models can be compressed for more efficient deployment while retaining core capabilities.
Models achieve self-improvement by distilling knowledge from their own generated outputs.
KD techniques become practical across diverse application fields through verticalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stronger data augmentation strategies could accelerate closing the performance gap between open and closed models beyond what scaling laws alone predict.
The same distillation-plus-augmentation pattern may transfer to non-language domains such as vision or multimodal systems.
Legal and ethical compliance requirements noted in the survey point toward needed auditing methods for models that inherit behaviors through distillation.

Load-bearing premise

That data augmentation within the KD framework can reliably enable open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights of proprietary models.

What would settle it

Controlled experiments comparing open-source model performance on semantic depth and ethical alignment benchmarks when trained with versus without data-augmented KD, checking whether the approximated capabilities consistently appear.

read the original abstract

In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful guide for researchers and practitioners, offering a detailed overview of current methodologies in KD and proposing future research directions. Importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A useful organizing survey on KD for LLMs that groups work by algorithms, skills, and applications while stressing data augmentation, but its claims about reliable transfer of ethical alignment and semantic depth rest on thin critical review.

read the letter

This survey pulls together the literature on knowledge distillation for large language models and gives it a workable three-part frame: the algorithms themselves, the specific skills being passed on, and the domains where the technique gets used. That structure plus the repeated focus on data augmentation as a way to create richer training signals is the main organizational contribution. It covers the standard uses—shrinking big models and letting open-source ones like LLaMA learn from GPT-4 style teachers—without unnecessary repetition, and the linked github repo makes the references easy to follow up on. The note on staying within legal and ethical bounds is also straightforward and appropriate for the topic. The paper does a clean job of mapping an active area so that someone new to it can see the main threads quickly. Where it is softer is in the treatment of data augmentation inside the KD loop. The text presents DA-augmented distillation as a route that lets smaller models approximate the contextual feel, ethical alignment, and deeper insights of the proprietary teachers. Yet it does not spend much space on documented cases where synthetic data distorts reasoning chains, amplifies biases, or fails to match real user distributions. Those gaps are mentioned in passing in some of the cited work but not weighed against the positive narrative. A reader looking for balanced guidance on when the transfer actually holds up will have to do extra digging. The piece is aimed at practitioners who want a compact overview before trying to compress or fine-tune open models, and at researchers who need a quick map of current methods and open questions. It is not breaking new technical ground, but the organization is solid enough that a serious editor should send it out for review rather than desk-reject it. I would flag the need for a stronger limitations section on transfer fidelity before publication.

Referee Report

1 major / 2 minor

Summary. The paper surveys Knowledge Distillation (KD) techniques applied to Large Language Models (LLMs). It positions KD as central for transferring capabilities from proprietary models (e.g., GPT-4) to open-source counterparts (e.g., LLaMA, Mistral), for model compression, and for self-improvement via teacher-student setups. The survey is organized around three pillars—algorithm, skill, and verticalization—and stresses the synergy with data augmentation (DA) to generate synthetic data that lets smaller models approximate proprietary models' contextual adeptness, ethical alignment, and semantic insights. It includes a GitHub repository and calls for ethical compliance.

Significance. A well-executed survey in this area would be useful for researchers seeking an organized overview of KD methods, compression strategies, and application domains for LLMs. The explicit linkage of DA to KD and the provision of a curated repository constitute concrete strengths that could accelerate follow-on work if the coverage is balanced.

major comments (1)

[Abstract / DA-KD section] Abstract and the DA-KD interplay discussion: the claim that DA-augmented KD enables open-source models to 'approximate the contextual adeptness, ethical alignment, and deep semantic insights' of proprietary models is presented without a dedicated critical review of transfer-fidelity risks (distribution shift between teacher outputs and downstream user contexts, bias amplification in synthetic data, or loss of implicit chain-of-thought structure). Because this approximation is the central justification for asserting that KD 'transcends traditional boundaries,' the absence of such analysis is load-bearing.

minor comments (2)

The three-pillar structure (algorithm, skill, verticalization) is announced but the manuscript would benefit from an explicit mapping table or section numbers that link each cited work to one or more pillars.
The GitHub link is given; the survey should state the last update date and the criteria used for inclusion/exclusion of papers to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the survey's organization, the linkage between data augmentation and knowledge distillation, and the value of the accompanying repository. We address the single major comment below.

read point-by-point responses

Referee: [Abstract / DA-KD section] Abstract and the DA-KD interplay discussion: the claim that DA-augmented KD enables open-source models to 'approximate the contextual adeptness, ethical alignment, and deep semantic insights' of proprietary models is presented without a dedicated critical review of transfer-fidelity risks (distribution shift between teacher outputs and downstream user contexts, bias amplification in synthetic data, or loss of implicit chain-of-thought structure). Because this approximation is the central justification for asserting that KD 'transcends traditional boundaries,' the absence of such analysis is load-bearing.

Authors: We agree that the current presentation would be strengthened by an explicit discussion of the risks and limitations of DA-augmented KD. In the revised manuscript we will insert a dedicated subsection on challenges within the DA-KD interplay section. This subsection will address distribution shift between teacher-generated outputs and downstream user contexts, the potential for bias amplification in synthetic data, and the risk of losing implicit reasoning structures such as chain-of-thought. The added analysis will qualify the claim that KD 'transcends traditional boundaries' and provide readers with a more balanced view of transfer fidelity. We view this as a substantive improvement that directly responds to the load-bearing nature of the point. revision: yes

Circularity Check

0 steps flagged

No circularity in this literature survey on LLM knowledge distillation

full rationale

This paper is a survey reviewing existing KD methods for LLMs, structured around algorithm, skill, and verticalization pillars with discussion of data augmentation interplay. It contains no new mathematical derivations, equations, fitted parameters, or predictions that could reduce to inputs by construction. All claims summarize cited prior literature without self-referential load-bearing steps or uniqueness theorems imported from the authors' own work. The central narrative on DA-augmented KD enabling approximation of proprietary capabilities is presented as a synthesis of external research rather than an internally derived result, making the work self-contained as a review with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey, the paper introduces no free parameters, axioms, or invented entities; it aggregates existing research without postulating new mechanisms.

pith-pipeline@v0.9.0 · 5631 in / 977 out tokens · 38344 ms · 2026-05-17T23:27:17.705659+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KD emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral, while also enabling model compression and self-improvement.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Rollout On-Policy Distillation via Peer Successes and Failures
cs.LG 2026-05 unverdicted novelty 7.0

MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 7.0

AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...
Logic-Regularized Verifier Elicits Reasoning from LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
Training a Student Expert via Semi-Supervised Foundation Model Distillation
cs.CV 2026-04 conditional novelty 7.0

A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
cs.SE 2025-10 conditional novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
cs.CL 2025-02 unverdicted novelty 7.0

CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than ...
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
cs.LG 2026-05 unverdicted novelty 6.0

PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
cs.CV 2026-04 unverdicted novelty 6.0

DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning
cs.LG 2026-01 unverdicted novelty 6.0

LLM agents iteratively generate and optimize data processing strategies for fine-tuning, delivering over 80% win rates versus unprocessed data and 65% versus LLM-based AutoML baselines while cutting search time by up to 10x.
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 5.0

Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
cs.LG 2026-05 unverdicted novelty 5.0

Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
cs.CL 2026-04 accept novelty 5.0

LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
cs.CV 2026-04 unverdicted novelty 5.0

DeCo-DETR constructs a hierarchical semantic prototype space from LVLM-generated descriptions aligned via CLIP and uses decoupled training streams to separate semantic reasoning from detection, yielding efficient open...
Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
Knowledge Distillation Must Account for What It Loses
cs.LG 2026-04 unverdicted novelty 4.0

Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 18 Pith papers · 27 internal anchors

[1]

Advances in Neural Information Processing Systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

arXiv preprint arXiv:2304.14233 , year=

Large Language Models are Strong Zero-Shot Retriever , author=. arXiv preprint arXiv:2304.14233 , year=

work page arXiv
[3]

arXiv preprint arXiv:2305.07402 , year=

Knowledge Refinement via Interaction Between Search Engines and Large Language Models , author=. arXiv preprint arXiv:2305.07402 , year=

work page arXiv
[4]

arXiv preprint arXiv:2212.10192 , year=

Adam: Dense Retrieval Distillation with Adaptive Dark Examples , author=. arXiv preprint arXiv:2212.10192 , year=

work page arXiv
[5]

The Eleventh International Conference on Learning Representations , year=

HypeR: Multitask Hyper-Prompted Training Enables Large-Scale Retrieval Generalization , author=. The Eleventh International Conference on Learning Representations , year=

work page
[6]

arXiv preprint arXiv:2401.00797 , year=

Distillation is All You Need for Practically Using Different Pre-trained Recommendation Models , author=. arXiv preprint arXiv:2401.00797 , year=

work page arXiv
[7]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Self-instruct: Aligning language model with self generated instructions , author=. arXiv preprint arXiv:2212.10560 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , booktitle =

Cheng. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , booktitle =

work page
[10]

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Orca: Progressive learning from complex explanation traces of gpt-4 , author=. arXiv preprint arXiv:2306.02707 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

2023 , eprint=

Orca 2: Teaching Small Language Models How to Reason , author=. 2023 , eprint=

work page 2023
[12]

Zephyr: Direct Distillation of LM Alignment

Zephyr: Direct distillation of lm alignment , author=. arXiv preprint arXiv:2310.16944 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ultrafeedback: Boosting language models with high-quality feedback , author=. arXiv preprint arXiv:2310.01377 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

McAuley , title =

Canwen Xu and Daya Guo and Nan Duan and Julian J. McAuley , title =

work page
[15]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

Camel: Communicative agents for" mind" exploration of large scale language model society , author=. arXiv preprint arXiv:2303.17760 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Sequence-Level Knowledge Distillation

Sequence-level knowledge distillation , author=. arXiv preprint arXiv:1606.07947 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=

work page 2024
[18]

International Conference on Machine Learning , pages=

Less is more: Task-aware layer-wise distillation for language model compression , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[19]

The Eleventh International Conference on Learning Representations,

Chen Liang and Haoming Jiang and Zheng Li and Xianfeng Tang and Bing Yin and Tuo Zhao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

work page 2023
[20]

arXiv preprint arXiv:2203.10705 , year=

Compression of generative pre-trained language models via quantization , author=. arXiv preprint arXiv:2203.10705 , year=

work page arXiv
[21]

arXiv preprint arXiv:2305.17888 , year=

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models , author=. arXiv preprint arXiv:2305.17888 , year=

work page arXiv
[22]

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

Timiryasov, Inar and Tastet, Jean-Loup. Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning. 2023. doi:10.18653/v1/2023.conll-babylm.24

work page doi:10.18653/v1/2023.conll-babylm.24 2023
[23]

arXiv preprint arXiv:2306.09306 , year=

Propagating Knowledge Updates to LMs Through Distillation , author=. arXiv preprint arXiv:2306.09306 , year=

work page arXiv
[24]

arXiv preprint arXiv:2212.10670 , year=

In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models , author=. arXiv preprint arXiv:2212.10670 , year=

work page arXiv
[25]

Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou , title =

work page
[26]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

work page
[27]

arXiv preprint arXiv:2305.18395 , year=

Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks , author=. arXiv preprint arXiv:2305.18395 , year=

work page arXiv
[28]

arXiv preprint arXiv:2305.15225 , year=

SAIL: Search-Augmented Instruction Learning , author=. arXiv preprint arXiv:2305.15225 , year=

work page arXiv
[29]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. arXiv preprint arXiv:2310.11511 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

arXiv preprint arXiv:2304.11116 , year=

Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT , author=. arXiv preprint arXiv:2304.11116 , year=

work page arXiv
[31]

arXiv preprint arXiv:2312.07000 , year=

Alignment for Honesty , author=. arXiv preprint arXiv:2312.07000 , year=

work page arXiv
[32]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Rlaif: Scaling reinforcement learning from human feedback with ai feedback , author=. arXiv preprint arXiv:2309.00267 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

2022 , eprint=

Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=

work page 2022
[34]

arXiv preprint arXiv:2312.10665 , year=

Silkie: Preference Distillation for Large Visual Language Models , author=. arXiv preprint arXiv:2312.10665 , year=

work page arXiv
[35]

Conference on Robot Learning , pages=

Scaling up and distilling down: Language-guided robot skill acquisition , author=. Conference on Robot Learning , pages=. 2023 , organization=

work page 2023
[36]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Distilling reasoning capabilities into smaller language models , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023
[37]

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Mammoth: Building math generalist models through hybrid instruction tuning , author=. arXiv preprint arXiv:2309.05653 , year=

work page internal anchor Pith review arXiv
[38]

2023 , eprint=

Textbooks Are All You Need , author=. 2023 , eprint=

work page 2023
[39]

Textbooks Are All You Need II: phi-1.5 technical report

Textbooks are all you need ii: phi-1.5 technical report , author=. arXiv preprint arXiv:2309.05463 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Phi-2: The surprising power of small language models , author =

work page
[41]

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

2024 , url=

Kevin Yang and Dan Klein and Asli Celikyilmaz and Nanyun Peng and Yuandong Tian , booktitle=. 2024 , url=

work page 2024
[43]

Jiang, C

Lion: Adversarial Distillation of Closed-Source Large Language Model , author=. arXiv preprint arXiv:2305.12870 , year=

work page arXiv
[44]

arXiv preprint arXiv:2307.11769 , year=

Domain Knowledge Distillation from Large Language Model: An Empirical Study in the Autonomous Driving Domain , author=. arXiv preprint arXiv:2307.11769 , year=

work page arXiv
[45]

arXiv preprint arXiv:2304.06975 , year=

Huatuo: Tuning llama model with chinese medical knowledge , author=. arXiv preprint arXiv:2304.06975 , year=

work page arXiv
[46]

arXiv preprint arXiv:2303.04360 , year=

Does synthetic data generation of llms help clinical text mining? , author=. arXiv preprint arXiv:2303.04360 , year=

work page arXiv
[47]

arXiv preprint arXiv:2305.15062 , year=

Lawyer LLaMA Technical Report , author=. arXiv preprint arXiv:2305.15062 , year=

work page arXiv
[48]

GitHub repository , howpublished =

Hongcheng Liu, Yusheng Liao, Yutong Meng, Yuhao Wang , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[49]

International Journal of Computer Vision , volume=

Knowledge distillation: A survey , author=. International Journal of Computer Vision , volume=. 2021 , publisher=

work page 2021
[50]

CoRR abs/2307.12966(2023)

Aligning large language models with human: A survey , author=. arXiv preprint arXiv:2307.12966 , year=

work page arXiv
[51]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.10997 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[53]

2023 , eprint=

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions , author=. 2023 , eprint=

work page 2023
[54]

arXiv preprint arXiv:2306.08568 , year=

WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. arXiv preprint arXiv:2306.08568 , year=

work page arXiv
[55]

2023 , url =

Xinyang Geng and Arnav Gudibande and Hao Liu and Eric Wallace and Pieter Abbeel and Sergey Levine and Dawn Song , title =. 2023 , url =

work page 2023
[56]

The False Promise of Imitating Proprietary LLMs

The false promise of imitating proprietary llms , author=. arXiv preprint arXiv:2305.15717 , year=

work page internal anchor Pith review arXiv
[57]

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

The flan collection: Designing data and methods for effective instruction tuning , author=. arXiv preprint arXiv:2301.13688 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

2023 , howpublished =

Ye, Seonghyeon and Jo, Yongrae and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Seo, Minjoon , title =. 2023 , howpublished =

work page 2023
[59]

Wang, Guan and Cheng, Sijie and Zhan, Xianyuan and Li, Xiangang and Song, Sen and Liu, Yang , month = sep, year =

work page
[60]

2023 , eprint=

Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks , author=. 2023 , eprint=

work page 2023
[61]

2023 , eprint=

Mixed Distillation Helps Smaller Language Model Better Reasoning , author=. 2023 , eprint=

work page 2023
[62]

2022 , eprint=

Explanations from Large Language Models Make Small Reasoners Better , author=. 2022 , eprint=

work page 2022
[63]

Large Language Models Are Reasoning Teachers , booktitle =

Namgyu Ho and Laura Schmid and Se. Large Language Models Are Reasoning Teachers , booktitle =

work page
[64]

Teaching Small Language Models to Reason

Magister, Lucie Charlotte and Mallinson, Jonathan and Adamek, Jakub and Malmi, Eric and Severyn, Aliaksei. Teaching Small Language Models to Reason. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.151

work page doi:10.18653/v1/2023.acl-short.151 2023
[65]

2023 , eprint=

Specializing Smaller Language Models towards Multi-Step Reasoning , author=. 2023 , eprint=

work page 2023
[66]

Advances in Neural Information Processing Systems , volume=

Principle-driven self-alignment of language models from scratch with minimal human supervision , author=. Advances in Neural Information Processing Systems , volume=

work page
[67]

GitHub repository , howpublished =

Sahil Chaudhary , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[68]

Qingyi Si and Tong Wang and Zheng Lin and Xu Zhang and Yanan Cao and Weiping Wang , title =

work page
[69]

GitHub , year=

Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo , author=. GitHub , year=

work page
[70]

2023 , eprint=

Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases , author=. 2023 , eprint=

work page 2023
[71]

arXiv preprint arXiv:2310.16271 , year=

CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment , author=. arXiv preprint arXiv:2310.16271 , year=

work page arXiv
[72]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[73]

2023 , eprint=

Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2023 , eprint=

work page 2023
[74]

The Twelfth International Conference on Learning Representations , year=

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=

work page
[75]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1910
[76]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[77]

arXiv preprint arXiv:2308.06744 , year=

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models , author=. arXiv preprint arXiv:2308.06744 , year=

work page arXiv
[78]

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Wen, Yuqiao and Li, Zichao and Du, Wenyu and Mou, Lili. f-Divergence Minimization for Sequence-Level Knowledge Distillation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.605

work page doi:10.18653/v1/2023.acl-long.605 2023
[79]

Large Language Models Can Self-Improve

Huang, Jiaxin and Gu, Shixiang and Hou, Le and Wu, Yuexin and Wang, Xuezhi and Yu, Hongkun and Han, Jiawei. Large Language Models Can Self-Improve. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.67

work page doi:10.18653/v1/2023.emnlp-main.67 2023
[80]

Goodman , title =

Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah D. Goodman , title =. NeurIPS , year =

work page

Showing first 80 references.