A Survey on Knowledge Distillation of Large Language Models
Pith reviewed 2026-05-17 23:27 UTC · model grok-4.3
The pith
Knowledge distillation transfers advanced capabilities from proprietary LLMs like GPT-4 to open-source models such as LLaMA and Mistral.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. The survey structures its examination around three foundational pillars: algorithm, skill, and verticalization, while highlighting the interplay between data augmentation and KD to bolster LLMs' performance by generating context-rich, skill-specific training data.
What carries the argument
The three foundational pillars of algorithm, skill, and verticalization together with the use of data augmentation to create context-rich training data inside the KD framework.
If this is right
- Open-source models gain the ability to approximate contextual adeptness and ethical alignment of proprietary models through data-augmented KD.
- Large models can be compressed for more efficient deployment while retaining core capabilities.
- Models achieve self-improvement by distilling knowledge from their own generated outputs.
- KD techniques become practical across diverse application fields through verticalization.
Where Pith is reading between the lines
- Stronger data augmentation strategies could accelerate closing the performance gap between open and closed models beyond what scaling laws alone predict.
- The same distillation-plus-augmentation pattern may transfer to non-language domains such as vision or multimodal systems.
- Legal and ethical compliance requirements noted in the survey point toward needed auditing methods for models that inherit behaviors through distillation.
Load-bearing premise
That data augmentation within the KD framework can reliably enable open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights of proprietary models.
What would settle it
Controlled experiments comparing open-source model performance on semantic depth and ethical alignment benchmarks when trained with versus without data-augmented KD, checking whether the approximated capabilities consistently appear.
read the original abstract
In the era of Large Language Models (LLMs), Knowledge Distillation (KD) emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral. Additionally, as open-source LLMs flourish, KD plays a crucial role in both compressing these models, and facilitating their self-improvement by employing themselves as teachers. This paper presents a comprehensive survey of KD's role within the realm of LLM, highlighting its critical function in imparting advanced knowledge to smaller models and its utility in model compression and self-improvement. Our survey is meticulously structured around three foundational pillars: \textit{algorithm}, \textit{skill}, and \textit{verticalization} -- providing a comprehensive examination of KD mechanisms, the enhancement of specific cognitive abilities, and their practical implications across diverse fields. Crucially, the survey navigates the intricate interplay between data augmentation (DA) and KD, illustrating how DA emerges as a powerful paradigm within the KD framework to bolster LLMs' performance. By leveraging DA to generate context-rich, skill-specific training data, KD transcends traditional boundaries, enabling open-source models to approximate the contextual adeptness, ethical alignment, and deep semantic insights characteristic of their proprietary counterparts. This work aims to provide an insightful guide for researchers and practitioners, offering a detailed overview of current methodologies in KD and proposing future research directions. Importantly, we firmly advocate for compliance with the legal terms that regulate the use of LLMs, ensuring ethical and lawful application of KD of LLMs. An associated Github repository is available at https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys Knowledge Distillation (KD) techniques applied to Large Language Models (LLMs). It positions KD as central for transferring capabilities from proprietary models (e.g., GPT-4) to open-source counterparts (e.g., LLaMA, Mistral), for model compression, and for self-improvement via teacher-student setups. The survey is organized around three pillars—algorithm, skill, and verticalization—and stresses the synergy with data augmentation (DA) to generate synthetic data that lets smaller models approximate proprietary models' contextual adeptness, ethical alignment, and semantic insights. It includes a GitHub repository and calls for ethical compliance.
Significance. A well-executed survey in this area would be useful for researchers seeking an organized overview of KD methods, compression strategies, and application domains for LLMs. The explicit linkage of DA to KD and the provision of a curated repository constitute concrete strengths that could accelerate follow-on work if the coverage is balanced.
major comments (1)
- [Abstract / DA-KD section] Abstract and the DA-KD interplay discussion: the claim that DA-augmented KD enables open-source models to 'approximate the contextual adeptness, ethical alignment, and deep semantic insights' of proprietary models is presented without a dedicated critical review of transfer-fidelity risks (distribution shift between teacher outputs and downstream user contexts, bias amplification in synthetic data, or loss of implicit chain-of-thought structure). Because this approximation is the central justification for asserting that KD 'transcends traditional boundaries,' the absence of such analysis is load-bearing.
minor comments (2)
- The three-pillar structure (algorithm, skill, verticalization) is announced but the manuscript would benefit from an explicit mapping table or section numbers that link each cited work to one or more pillars.
- The GitHub link is given; the survey should state the last update date and the criteria used for inclusion/exclusion of papers to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the survey's organization, the linkage between data augmentation and knowledge distillation, and the value of the accompanying repository. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract / DA-KD section] Abstract and the DA-KD interplay discussion: the claim that DA-augmented KD enables open-source models to 'approximate the contextual adeptness, ethical alignment, and deep semantic insights' of proprietary models is presented without a dedicated critical review of transfer-fidelity risks (distribution shift between teacher outputs and downstream user contexts, bias amplification in synthetic data, or loss of implicit chain-of-thought structure). Because this approximation is the central justification for asserting that KD 'transcends traditional boundaries,' the absence of such analysis is load-bearing.
Authors: We agree that the current presentation would be strengthened by an explicit discussion of the risks and limitations of DA-augmented KD. In the revised manuscript we will insert a dedicated subsection on challenges within the DA-KD interplay section. This subsection will address distribution shift between teacher-generated outputs and downstream user contexts, the potential for bias amplification in synthetic data, and the risk of losing implicit reasoning structures such as chain-of-thought. The added analysis will qualify the claim that KD 'transcends traditional boundaries' and provide readers with a more balanced view of transfer fidelity. We view this as a substantive improvement that directly responds to the load-bearing nature of the point. revision: yes
Circularity Check
No circularity in this literature survey on LLM knowledge distillation
full rationale
This paper is a survey reviewing existing KD methods for LLMs, structured around algorithm, skill, and verticalization pillars with discussion of data augmentation interplay. It contains no new mathematical derivations, equations, fitted parameters, or predictions that could reduce to inputs by construction. All claims summarize cited prior literature without self-referential load-bearing steps or uniqueness theorems imported from the authors' own work. The central narrative on DA-augmented KD enabling approximation of proprietary capabilities is presented as a synthesis of external research rather than an internally derived result, making the work self-contained as a review with no circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
KD emerges as a pivotal methodology for transferring advanced capabilities from leading proprietary LLMs, such as GPT-4, to their open-source counterparts like LLaMA and Mistral, while also enabling model compression and self-improvement.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
CoDistill-GRPO: A Co-Distillation Recipe for Efficient Group Relative Policy Optimization
CoDistill-GRPO lets small and large models mutually improve via co-distillation in GRPO, raising small-model math accuracy by over 11 points while cutting large-model training time by about 18%.
-
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...
-
Logic-Regularized Verifier Elicits Reasoning from LLMs
LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
-
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
-
Training a Student Expert via Semi-Supervised Foundation Model Distillation
A semi-supervised framework distills vision foundation models into compact instance segmentation experts that outperform their teachers by up to 11.9 AP on Cityscapes and 8.6 AP on ADE20K while being 11 times smaller.
-
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...
-
CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
CODI compresses explicit CoT into continuous space via self-distillation and is the first implicit method to match explicit CoT performance on GSM8k at GPT-2 scale with 3.1x compression and 28.2% higher accuracy than ...
-
Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression
PARSE trains a prompt-aware linear router on dense-model outputs to select dynamic SVD ranks, improving accuracy up to 10% at 0.6 compression ratio on LLaMA-7B while delivering 2.5x prefill and 2.4x decode speedups.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR builds hierarchical semantic prototypes offline and uses decoupled training streams to deliver competitive zero-shot open-vocabulary detection with improved inference speed.
-
LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning
LLM agents iteratively generate and optimize data processing strategies for fine-tuning, delivering over 80% win rates versus unprocessed data and 65% versus LLM-based AutoML baselines while cutting search time by up to 10x.
-
ReAD: Reinforcement-Guided Capability Distillation for Large Language Models
ReAD applies a contextual bandit to allocate fixed-token distillation budget across interdependent LLM capabilities, yielding higher task utility and fewer negative spillovers than standard methods.
-
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models
UniSD unifies complementary self-distillation mechanisms for autoregressive LLMs and achieves up to +5.4 point gains over base models and +2.8 over baselines across six benchmarks and six models.
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...
-
Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level
Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
-
DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
DeCo-DETR constructs a hierarchical semantic prototype space from LVLM-generated descriptions aligned via CLIP and uses decoupled training streams to separate semantic reasoning from detection, yielding efficient open...
-
Knowledge Distillation Must Account for What It Loses
Knowledge distillation evaluations must report lost teacher capabilities via a Distillation Loss Statement rather than relying solely on task scores.
-
Knowledge Distillation Must Account for What It Loses
Knowledge distillation should be reframed as a lossy projection and evaluated with a taxonomy of off-metric losses plus a Distillation Loss Statement reporting preserved and lost capabilities.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
arXiv preprint arXiv:2304.14233 , year=
Large Language Models are Strong Zero-Shot Retriever , author=. arXiv preprint arXiv:2304.14233 , year=
-
[3]
arXiv preprint arXiv:2305.07402 , year=
Knowledge Refinement via Interaction Between Search Engines and Large Language Models , author=. arXiv preprint arXiv:2305.07402 , year=
-
[4]
arXiv preprint arXiv:2212.10192 , year=
Adam: Dense Retrieval Distillation with Adaptive Dark Examples , author=. arXiv preprint arXiv:2212.10192 , year=
-
[5]
The Eleventh International Conference on Learning Representations , year=
HypeR: Multitask Hyper-Prompted Training Enables Large-Scale Retrieval Generalization , author=. The Eleventh International Conference on Learning Representations , year=
-
[6]
arXiv preprint arXiv:2401.00797 , year=
Distillation is All You Need for Practically Using Different Pre-trained Recommendation Models , author=. arXiv preprint arXiv:2401.00797 , year=
-
[7]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Self-instruct: Aligning language model with self generated instructions , author=. arXiv preprint arXiv:2212.10560 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Cheng. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes , booktitle =
-
[10]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Orca: Progressive learning from complex explanation traces of gpt-4 , author=. arXiv preprint arXiv:2306.02707 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Orca 2: Teaching Small Language Models How to Reason , author=. 2023 , eprint=
work page 2023
-
[12]
Zephyr: Direct Distillation of LM Alignment
Zephyr: Direct distillation of lm alignment , author=. arXiv preprint arXiv:2310.16944 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
UltraFeedback: Boosting Language Models with Scaled AI Feedback
Ultrafeedback: Boosting language models with high-quality feedback , author=. arXiv preprint arXiv:2310.01377 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [14]
-
[15]
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
Camel: Communicative agents for" mind" exploration of large scale language model society , author=. arXiv preprint arXiv:2303.17760 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Sequence-Level Knowledge Distillation
Sequence-level knowledge distillation , author=. arXiv preprint arXiv:1606.07947 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Yuxian Gu and Li Dong and Furu Wei and Minlie Huang , booktitle=. Mini. 2024 , url=
work page 2024
-
[18]
International Conference on Machine Learning , pages=
Less is more: Task-aware layer-wise distillation for language model compression , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[19]
The Eleventh International Conference on Learning Representations,
Chen Liang and Haoming Jiang and Zheng Li and Xianfeng Tang and Bing Yin and Tuo Zhao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
work page 2023
-
[20]
arXiv preprint arXiv:2203.10705 , year=
Compression of generative pre-trained language models via quantization , author=. arXiv preprint arXiv:2203.10705 , year=
-
[21]
arXiv preprint arXiv:2305.17888 , year=
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models , author=. arXiv preprint arXiv:2305.17888 , year=
-
[22]
Timiryasov, Inar and Tastet, Jean-Loup. Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning. 2023. doi:10.18653/v1/2023.conll-babylm.24
-
[23]
arXiv preprint arXiv:2306.09306 , year=
Propagating Knowledge Updates to LMs Through Distillation , author=. arXiv preprint arXiv:2306.09306 , year=
-
[24]
arXiv preprint arXiv:2212.10670 , year=
In-context Learning Distillation: Transferring Few-shot Learning Ability of Pre-trained Language Models , author=. arXiv preprint arXiv:2212.10670 , year=
-
[25]
Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou , title =
-
[26]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[27]
arXiv preprint arXiv:2305.18395 , year=
Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks , author=. arXiv preprint arXiv:2305.18395 , year=
-
[28]
arXiv preprint arXiv:2305.15225 , year=
SAIL: Search-Augmented Instruction Learning , author=. arXiv preprint arXiv:2305.15225 , year=
-
[29]
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. arXiv preprint arXiv:2310.11511 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
arXiv preprint arXiv:2304.11116 , year=
Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT , author=. arXiv preprint arXiv:2304.11116 , year=
-
[31]
arXiv preprint arXiv:2312.07000 , year=
Alignment for Honesty , author=. arXiv preprint arXiv:2312.07000 , year=
-
[32]
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Rlaif: Scaling reinforcement learning from human feedback with ai feedback , author=. arXiv preprint arXiv:2309.00267 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Constitutional AI: Harmlessness from AI Feedback , author=. 2022 , eprint=
work page 2022
-
[34]
arXiv preprint arXiv:2312.10665 , year=
Silkie: Preference Distillation for Large Visual Language Models , author=. arXiv preprint arXiv:2312.10665 , year=
-
[35]
Conference on Robot Learning , pages=
Scaling up and distilling down: Language-guided robot skill acquisition , author=. Conference on Robot Learning , pages=. 2023 , organization=
work page 2023
-
[36]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Distilling reasoning capabilities into smaller language models , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
work page 2023
-
[37]
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning
Mammoth: Building math generalist models through hybrid instruction tuning , author=. arXiv preprint arXiv:2309.05653 , year=
work page internal anchor Pith review arXiv
- [38]
-
[39]
Textbooks Are All You Need II: phi-1.5 technical report
Textbooks are all you need ii: phi-1.5 technical report , author=. arXiv preprint arXiv:2309.05463 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Phi-2: The surprising power of small language models , author =
-
[41]
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct , author=. arXiv preprint arXiv:2308.09583 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Kevin Yang and Dan Klein and Asli Celikyilmaz and Nanyun Peng and Yuandong Tian , booktitle=. 2024 , url=
work page 2024
- [43]
-
[44]
arXiv preprint arXiv:2307.11769 , year=
Domain Knowledge Distillation from Large Language Model: An Empirical Study in the Autonomous Driving Domain , author=. arXiv preprint arXiv:2307.11769 , year=
-
[45]
arXiv preprint arXiv:2304.06975 , year=
Huatuo: Tuning llama model with chinese medical knowledge , author=. arXiv preprint arXiv:2304.06975 , year=
-
[46]
arXiv preprint arXiv:2303.04360 , year=
Does synthetic data generation of llms help clinical text mining? , author=. arXiv preprint arXiv:2303.04360 , year=
-
[47]
arXiv preprint arXiv:2305.15062 , year=
Lawyer LLaMA Technical Report , author=. arXiv preprint arXiv:2305.15062 , year=
-
[48]
GitHub repository , howpublished =
Hongcheng Liu, Yusheng Liao, Yutong Meng, Yuhao Wang , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[49]
International Journal of Computer Vision , volume=
Knowledge distillation: A survey , author=. International Journal of Computer Vision , volume=. 2021 , publisher=
work page 2021
-
[50]
Aligning large language models with human: A survey , author=. arXiv preprint arXiv:2307.12966 , year=
-
[51]
Retrieval-Augmented Generation for Large Language Models: A Survey
Retrieval-Augmented Generation for Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.10997 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[53]
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions , author=. 2023 , eprint=
work page 2023
-
[54]
arXiv preprint arXiv:2306.08568 , year=
WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. arXiv preprint arXiv:2306.08568 , year=
-
[55]
Xinyang Geng and Arnav Gudibande and Hao Liu and Eric Wallace and Pieter Abbeel and Sergey Levine and Dawn Song , title =. 2023 , url =
work page 2023
-
[56]
The False Promise of Imitating Proprietary LLMs
The false promise of imitating proprietary llms , author=. arXiv preprint arXiv:2305.15717 , year=
work page internal anchor Pith review arXiv
-
[57]
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
The flan collection: Designing data and methods for effective instruction tuning , author=. arXiv preprint arXiv:2301.13688 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Ye, Seonghyeon and Jo, Yongrae and Kim, Doyoung and Kim, Sungdong and Hwang, Hyeonbin and Seo, Minjoon , title =. 2023 , howpublished =
work page 2023
-
[59]
Wang, Guan and Cheng, Sijie and Zhan, Xianyuan and Li, Xiangang and Song, Sen and Liu, Yang , month = sep, year =
-
[60]
Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks , author=. 2023 , eprint=
work page 2023
-
[61]
Mixed Distillation Helps Smaller Language Model Better Reasoning , author=. 2023 , eprint=
work page 2023
-
[62]
Explanations from Large Language Models Make Small Reasoners Better , author=. 2022 , eprint=
work page 2022
-
[63]
Large Language Models Are Reasoning Teachers , booktitle =
Namgyu Ho and Laura Schmid and Se. Large Language Models Are Reasoning Teachers , booktitle =
-
[64]
Teaching Small Language Models to Reason
Magister, Lucie Charlotte and Mallinson, Jonathan and Adamek, Jakub and Malmi, Eric and Severyn, Aliaksei. Teaching Small Language Models to Reason. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.151
-
[65]
Specializing Smaller Language Models towards Multi-Step Reasoning , author=. 2023 , eprint=
work page 2023
-
[66]
Advances in Neural Information Processing Systems , volume=
Principle-driven self-alignment of language models from scratch with minimal human supervision , author=. Advances in Neural Information Processing Systems , volume=
-
[67]
GitHub repository , howpublished =
Sahil Chaudhary , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[68]
Qingyi Si and Tong Wang and Zheng Lin and Xu Zhang and Yanan Cao and Weiping Wang , title =
-
[69]
Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo , author=. GitHub , year=
-
[70]
Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases , author=. 2023 , eprint=
work page 2023
-
[71]
arXiv preprint arXiv:2310.16271 , year=
CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment , author=. arXiv preprint arXiv:2310.16271 , year=
-
[72]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[73]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author=. 2023 , eprint=
work page 2023
-
[74]
The Twelfth International Conference on Learning Representations , year=
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , author=. The Twelfth International Conference on Learning Representations , year=
-
[75]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[76]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[77]
arXiv preprint arXiv:2308.06744 , year=
Token-Scaled Logit Distillation for Ternary Weight Generative Language Models , author=. arXiv preprint arXiv:2308.06744 , year=
-
[78]
f-Divergence Minimization for Sequence-Level Knowledge Distillation
Wen, Yuqiao and Li, Zichao and Du, Wenyu and Mou, Lili. f-Divergence Minimization for Sequence-Level Knowledge Distillation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.605
-
[79]
Large Language Models Can Self-Improve
Huang, Jiaxin and Gu, Shixiang and Hou, Le and Wu, Yuexin and Wang, Xuezhi and Yu, Hongkun and Han, Jiawei. Large Language Models Can Self-Improve. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.67
-
[80]
Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah D. Goodman , title =. NeurIPS , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.