super hub Canonical reference

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals · 2015 · stat.ML · arXiv 1503.02531

Canonical reference. 79% of citing Pith papers cite this work as background.

470 Pith papers citing it

Background 79% of classified citations

open full Pith review browse 470 citing papers more from Geoffrey Hinton arXiv PDF

abstract

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 68 method 14 other 2 dataset 1

citation-polarity summary

background 67 use method 13 unclear 3 support 1 use dataset 1

claims ledger

abstract A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using

authors

and Jeff Dean Geoffrey Hinton Oriol Vinyals

co-cited works

representative citing papers

PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs

cs.AI · 2026-05-11 · conditional · novelty 8.0

PrimeKG-CL supplies the first continual graph learning benchmark using authentic temporal snapshots from nine biomedical databases, showing strong interactions between embedding decoders and learning strategies plus limits of standard metrics on retention versus forgetting.

Inference-Time Refinement Closes the Synthetic-Real Gap in Tabular Diffusion

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Inference-time refinement of pre-trained tabular diffusion models via Bidirectional Chamfer Refinement achieves median 8.6% better downstream performance than real data across 15 benchmarks while preserving fidelity and privacy.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Emerging Properties in Self-Supervised Vision Transformers

cs.CV · 2021-04-29 · conditional · novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Learning Through Noise: Why Subliminal Learning Works and When It Fails

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Subliminal learning occurs via compatible auxiliary and class output heads on task-unrelated inputs, even with random hidden layers or architecture changes, with theory and upper bounds on failure.

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

Slimmable ConvNeXt adapts ConvNeXt for width-adaptive inference using LayerNorm and inverted bottlenecks, reaching 80.8% top-1 at 4.5 GMACs and outperforming HydraViT, MatFormer, and SortedNet on ImageNet-1k.

Visual-Advantage On-Policy Distillation for Vision-Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

VA-OPD improves VLM performance over standard on-policy distillation by reweighting rollouts and separating KL terms according to token-level visual advantage on math and visual benchmarks.

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

cs.LG · 2026-05-20 · conditional · novelty 7.0

X-Token proposes projection-guided P-KL and H-KL losses to fix uncommon-token suppression and over-conservative matching in logit-based cross-tokenizer distillation, yielding gains over GOLD on Llama-3.2-1B.

Layer-wise Token Compression for Efficient Document Reranking

cs.IR · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Proposes weighted aggregation of clusters and self-distillation-driven token pruning to improve both accuracy and efficiency in ViT-based visual place recognition.

Code Generation by Differential Test Time Scaling

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.

When Does Model Collapse Occur in Structured Interactive Learning?

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Model collapse occurs in structured interactive learning if and only if the directed interaction graph satisfies a specific topological condition, with finite-sample guarantees for linear regression and asymptotic results for M-estimators.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Self-Distillation is Optimal Among Spectral Shrinkage Estimators in Spiked Covariance Models

math.ST · 2026-05-18 · unverdicted · novelty 7.0

s-step self-distillation is optimal among spectral shrinkage estimators for s-spiked covariance matrices and necessary for optimality.

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.

When Bits Break Recourse: Counterfactual-Faithful Quantization

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

CFQ trains quantizer parameters and mixed-precision allocation to preserve counterfactual recourse validity, cost, and direction on Adult, German Credit, and COMPAS while matching accuracy of standard quantizers.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Continual Learning of Domain-Invariant Representations

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Introduces replay-based continual learning with sequential invariance alignment to learn domain-invariant representations, outperforming baselines on generalization to unseen domains across six datasets in vision, medicine, manufacturing, and ecology.

DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems

eess.IV · 2026-05-14 · unverdicted · novelty 7.0

DIPA learns preconditioning operators via distillation from a teacher with a better sensing matrix to improve reconstruction quality for the student's physically constrained matrix in imaging inverse problems.

TILT: Target-induced loss tilting under covariate shift

cs.LG · 2026-05-14 · conditional · novelty 7.0

TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.

Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Genetic programming evolves heterogeneous layer-specific scalar functions to approximate layer normalization in pre-trained ViTs, capturing 91.6% variance versus 70.2% for uniform baselines and recovering 84.25% ImageNet Top-1 accuracy after 20 epochs of adaptation.

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.

citing papers explorer

Showing 50 of 81 citing papers after filters.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English? cs.CL · 2023-05-12 · conditional · none · ref 10 · internal anchor
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 24 · internal anchor
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 13 · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 82 · internal anchor
Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding better performance than scratch training.
MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate cs.CL · 2026-05-02 · unverdicted · none · ref 16 · internal anchor
MAD-OPD recasts on-policy distillation teachers as a debating collective to supply better supervision, lifting agentic and code performance over single-teacher OPD across multiple model sizes.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors cs.CL · 2026-04-23 · unverdicted · none · ref 3 · internal anchor
New RPS and AGS metrics show within-family distilled LLM agents have 5.9 pp higher tool-use graph similarity than cross-family pairs, with some models exceeding their teachers.
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian cs.CL · 2026-04-21 · unverdicted · none · ref 8 · internal anchor
RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
MemDLM: Memory-Enhanced DLM Training cs.CL · 2026-03-23 · unverdicted · none · ref 32 · internal anchor
MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective cs.CL · 2026-02-03 · unverdicted · none · ref 4 · internal anchor
A learned transformation matrix minimizes CMI in teacher logits to degrade distillation performance while preserving task accuracy.
Vocab Diet: Reshaping the Vocabulary of LLMs via Vector Arithmetic cs.CL · 2025-10-19 · conditional · none · ref 5 · internal anchor
LLMs can compose surface-form tokens from base embeddings plus learned transformation vectors, freeing 10-40% of vocabulary slots while expanding coverage and preserving downstream performance across five languages.
Federated Co-tuning Framework for Large and Small Language Models cs.CL · 2024-11-18 · unverdicted · none · ref 8 · internal anchor
FedCoLLM is a parameter-efficient federated co-tuning framework that improves client SLMs via server LLMs and enriches LLMs with client domain insights using adapters on NLP text generation tasks.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 19 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence cs.CL · 2026-05-20 · unverdicted · none · ref 5 · internal anchor
A 194M-parameter spiking dual-path model trained on 3B Chinese-English tokens achieves held-out PPL 8.88-8.93 at >89% per-element sparsity, trailing GPT-2 201M by 7.7% while showing that LIF temporal integration outperforms simple top-k masking at matched sparsity.
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents cs.CL · 2026-05-14 · unverdicted · none · ref 86 · internal anchor
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.
Distribution Corrected Offline Data Distillation for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 15 · internal anchor
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head cs.CL · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
PRISM supplies a geometric upper bound on LLM variant risk that splits drift into scale, shape, and head axes and doubles as a differentiable regularizer against forgetting.
SOMA: Efficient Multi-turn LLM Serving via Small Language Model cs.CL · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 27 · 2 links · internal anchor
MELT decouples reasoning depth from memory in looped language models by sharing a single gated KV cache per layer and training it via chunk-wise distillation from Ouro starting models.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation cs.CL · 2026-05-08 · unverdicted · none · ref 1 · 2 links · internal anchor
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CL · 2026-05-08 · conditional · none · ref 5 · internal anchor
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text cs.CL · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.
UniSD: Towards a Unified Self-Distillation Framework for Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 7 · 2 links · internal anchor
UniSD unifies self-distillation components for autoregressive LLMs and its full integrated version improves base models by 5.4 points and baselines by 2.8 points across six benchmarks.
RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization cs.CL · 2026-04-26 · unverdicted · none · ref 1 · internal anchor
RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high quality on benchmarks.
Hybrid Policy Distillation for LLMs cs.CL · 2026-04-22 · unverdicted · none · ref 2 · internal anchor
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve stability, efficiency, and performance.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing cs.CL · 2026-04-21 · unverdicted · none · ref 42 · internal anchor
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
Why Fine-Tuning Encourages Hallucinations and How to Fix It cs.CL · 2026-04-16 · unverdicted · none · ref 3 · internal anchor
Supervised fine-tuning increases LLM hallucinations via interference among overlapping semantic representations; self-distillation mitigates this by regularizing output-distribution drift while freezing parameters preserves performance when new facts are unnecessary.
TEMPER: Testing Emotional Perturbation in Quantitative Reasoning cs.CL · 2026-04-09 · unverdicted · none · ref 5 · internal anchor
Emotional framing in quantitative reasoning problems reduces LLM accuracy by 2-10 percentage points, recoverable by neutralization, unlike neutral paraphrases.
Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion cs.CL · 2026-04-07 · conditional · none · ref 13 · internal anchor
Attention Editing converts pre-trained LLMs to new attention architectures through layer-wise teacher-forced optimization and model-level distillation, preserving performance with efficiency gains.
Content Fuzzing for Escaping Information Cocoons on Digital Social Media cs.CL · 2026-04-07 · unverdicted · none · ref 36 · internal anchor
ContentFuzz rewrites posts with LLM guidance from stance model confidence to flip machine labels without altering human intent, tested across four models and three datasets in two languages.
Attention to Mamba: A Recipe for Cross-Architecture Distillation cs.CL · 2026-04-01 · unverdicted · none · ref 14 · internal anchor
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
Dual-Space Knowledge Distillation with Key-Query Matching for Large Language Models with Vocabulary Mismatch cs.CL · 2026-03-23 · unverdicted · none · ref 15 · internal anchor
The authors introduce DSKD-CMA-GA using generative adversarial learning to fix key-query distribution mismatches in cross-tokenizer knowledge distillation, reporting modest average ROUGE-L gains of 0.37 especially on out-of-distribution data.
Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation cs.CL · 2026-02-24 · unverdicted · none · ref 6 · internal anchor
A modified divergence decouples top-K teacher probabilities from the distribution tail during distillation, yielding competitive performance on decoder models with standard compute.
H\'an D\=an Xu\'e B\`u (Mimicry) or Q\=ing Ch\=u Y\'u L\'an (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models cs.CL · 2026-01-08 · unverdicted · none · ref 7 · internal anchor
Reasoning distillation via SFT induces functional alignment collapse, dropping correlation with human difficulty scaling from 0.64 to 0.34 and often causing negative transfer.
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety cs.CL · 2025-12-08 · unverdicted · none · ref 19 · internal anchor
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations cs.CL · 2025-11-09 · conditional · none · ref 23 · internal anchor
TAQ estimates per-layer importance from hidden representations and output sensitivity on task calibration data to allocate mixed precision in a training-free PTQ setting, outperforming task-agnostic baselines on accuracy-memory ratio across benchmarks.
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding cs.CL · 2025-10-09 · unverdicted · none · ref 46 · internal anchor
MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning cs.CL · 2025-09-26 · conditional · none · ref 4 · internal anchor
CoSpaDi introduces a training-free sparse dictionary learning framework for post-training LLM compression that optimizes functional reconstruction error via activation-derived orthonormalization and achieves improved accuracy-compression trade-offs over SVD and pruning baselines.
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining cs.CL · 2025-09-08 · unverdicted · none · ref 12 · internal anchor
MachineLearningLM uses continued pretraining on SCM-synthesized ML tasks with random-forest distillation to give LLMs robust many-shot in-context learning on tabular classification, reaching random-forest accuracy levels while preserving general chat performance.
WhisperRT -- Turning Whisper into a Causal Streaming Model cs.CL · 2025-08-17 · conditional · none · ref 10 · internal anchor
WhisperRT converts Whisper to a causal streaming ASR model via encoder causality, decoder synchronization on partial states, and fine-tuning, achieving better performance than non-fine-tuned streaming methods on sub-300ms chunks with lower complexity.
LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 125 · internal anchor
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models cs.CL · 2023-12-10 · unverdicted · none · ref 9 · internal anchor
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs cs.CL · 2023-10-03 · conditional · none · ref 38 · internal anchor
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving cs.CL · 2023-09-29 · conditional · none · ref 13 · internal anchor
ToRA trains language models on interactive tool-use trajectories with imitation learning and output shaping to integrate reasoning and external tools, yielding 13-19% gains on math datasets and new highs like 44.6% on MATH for a 7B model.
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models cs.CL · 2023-09-21 · conditional · none · ref 24 · internal anchor
Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.
MiniLLM: On-Policy Distillation of Large Language Models cs.CL · 2023-06-14 · conditional · none · ref 10 · internal anchor
MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints cs.CL · 2023-05-22 · unverdicted · none · ref 46 · internal anchor
Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes cs.CL · 2023-05-03 · conditional · none · ref 76 · internal anchor
Distilling step-by-step uses LLM-generated rationales as additional supervision in a multi-task framework so that 770M-parameter models outperform 540B-parameter models on NLP benchmarks with only 80% of the data.
Large Language Models Can Self-Improve cs.CL · 2022-10-20 · unverdicted · none · ref 7 · internal anchor
A 540B-parameter LLM improves reasoning performance on GSM8K, DROP, OpenBookQA, and ANLI-A3 by fine-tuning on self-generated high-confidence CoT solutions from unlabeled data.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 27 · internal anchor
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.

Distilling the Knowledge in a Neural Network

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer