arxiv: 1910.01108 · v4 · submitted 2019-10-02 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh , Lysandre Debut , Julien Chaumond , Thomas Wolf

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords DistilBERTknowledge distillationBERTmodel compressionpre-traininglanguage modelsnatural language processingon-device computation

0 comments

The pith

DistilBERT is a 40% smaller version of BERT that retains 97% of its language understanding while running 60% faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that knowledge distillation can be applied during the pre-training phase to create a compact general-purpose language model from BERT. By training the smaller student model with a triple loss that includes language modeling, distillation from the teacher, and cosine-distance terms, the authors transfer enough knowledge to preserve most capabilities. This matters because large pre-trained models are difficult to deploy under tight compute budgets on edge devices or in constrained environments. The resulting DistilBERT can later be fine-tuned on downstream tasks without needing the full original model size.

Core claim

We introduce DistilBERT, a smaller general-purpose language representation model pre-trained using knowledge distillation, which reduces the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

What carries the argument

Triple loss combining language modeling, distillation, and cosine-distance losses during pre-training to transfer knowledge to the smaller student model.

Load-bearing premise

The combination of language modeling, distillation, and cosine-distance losses transfers enough knowledge from the full BERT teacher to the smaller student without requiring the full model capacity or additional task-specific supervision.

What would settle it

If DistilBERT's fine-tuned performance on standard NLP benchmarks falls below 97% of BERT's scores or if measured inference speed gains are less than 60% in direct side-by-side tests, the central claims would not hold.

read the original abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DistilBERT shows you can shrink BERT to 40% size while keeping 97% performance and running 60% faster by distilling during pre-training.

read the letter

DistilBERT shows you can shrink BERT to 40% size while keeping 97% performance and running 60% faster by distilling during pre-training. The core move is applying knowledge distillation in the pre-training stage itself rather than only on downstream tasks. They combine the usual masked language modeling loss with a distillation term from the full BERT teacher and a cosine loss on hidden states, drop next-sentence prediction, and initialize the 6-layer student from the teacher's weights. That setup produces a 66M parameter model that hits 97% of BERT-base on GLUE average, does well on SQuAD and IMDB, and delivers the claimed speedups on CPU and GPU across reported batch sizes. Ablations confirm each piece of the triple loss contributes, and the distilled student beats a 6-layer model trained from scratch on the same data. The on-device proof-of-concept adds a practical check. One soft spot is the weight initialization from the teacher, which likely eases the transfer and makes the compression less pure than a cold-start comparison. They also use the same pre-training corpus but with far less capacity, so the compute savings are real but tied to that specific recipe. No error bars or significance tests appear, though the relative scores hold across tasks. This paper is for people who need smaller, faster language models for deployment or limited hardware. Practitioners and researchers working on model compression or efficient transformers will get direct value from the loss design and the concrete numbers. The experiments are straightforward and address a real need, so the work deserves a serious referee to verify the details and place it against later distillation results. I would send it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper introduces DistilBERT, a 6-layer distilled version of BERT-base (66M parameters) pre-trained with a triple loss combining masked language modeling, knowledge distillation, and cosine-distance embedding losses. It claims a 40% size reduction while retaining 97% of BERT's performance on language understanding tasks, 60% faster inference, and suitability for on-device use, supported by evaluations on GLUE (average 97% relative score), SQuAD, IMDB, loss ablations, a from-scratch 6-layer baseline comparison, and CPU/GPU speed measurements with reported batch sizes.

Significance. If the empirical results hold, the work offers a practical pre-training distillation method that enables smaller general-purpose language models without task-specific supervision, directly addressing deployment constraints. Strengths include the ablation evidence in §3.3 showing each loss term's contribution, the underperformance of the non-distilled baseline, and concrete inference timings, which together provide reproducible support for the central efficiency claims.

minor comments (2)

[Abstract] Abstract: performance claims (97% retention, 60% speedup) are stated without reference to the specific downstream tasks or variance; while §3 and tables provide these details, a one-sentence qualifier on evaluation scope would improve standalone readability.
[§3.3] §3.3: ablation results demonstrate the value of each loss component, but the table does not report run-to-run variance or number of seeds; adding this would make the contribution of the cosine term more robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the DistilBERT paper, recognition of its practical contributions to model compression, and recommendation for minor revision. We are pleased that the ablation evidence, baseline comparisons, and concrete speed measurements were noted as providing reproducible support for the claims.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is an empirical training procedure: a 6-layer student model is pre-trained on the same corpus as BERT-base using a composite loss (MLM + distillation + cosine embedding) and then evaluated on GLUE, SQuAD, and IMDB. All reported performance numbers (97 % relative GLUE score, 60 % speed-up, 40 % size reduction) are obtained by direct measurement after training; no equation or prediction is shown to be mathematically identical to a fitted parameter or to a self-citation chain. Ablations in §3.3 and the from-scratch baseline comparison further demonstrate that the result is not forced by construction. The work therefore rests on externally verifiable experimental outcomes rather than on any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The method implicitly assumes standard transformer architecture and knowledge-distillation transferability, which are treated as background rather than paper-specific inventions.

pith-pipeline@v0.9.0 · 5484 in / 1007 out tokens · 24625 ms · 2026-05-11T04:58:48.987359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses.
Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DistilBERT (6 layers, 66M params) is compared to BERT-base on GLUE (avg. 97% relative score), SQuAD, and IMDB; ablations in §3.3 show each term of the triple loss (MLM + distillation + cosine) contributes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
Learning the Signature of Memorization in Autoregressive Language Models
cs.CL 2026-04 accept novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity
cs.LG 2026-05 unverdicted novelty 7.0

Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.
Switchcraft: AI Model Router for Agentic Tool Calling
cs.AI 2026-05 unverdicted novelty 7.0

Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.
TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models
stat.ML 2026-05 unverdicted novelty 7.0

TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
cs.CL 2026-05 unverdicted novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
cs.SE 2026-04 unverdicted novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment
cs.AI 2026-04 conditional novelty 7.0

AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...
Adaptive Head Budgeting for Efficient Multi-Head Attention
cs.LG 2026-04 unverdicted novelty 7.0

BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
cs.CL 2026-04 unverdicted novelty 7.0

RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
GuardPhish: Securing Open-Source LLMs from Phishing Abuse
cs.CR 2026-04 unverdicted novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
Depth Adaptive Efficient Visual Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
SecureRouter: Encrypted Routing for Efficient Secure Inference
cs.CR 2026-04 unverdicted novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data
cs.CL 2026-04 conditional novelty 7.0

Synthetic data of 1M+ multi-label samples across 23 languages trains models that match or exceed English-only specialists on zero-shot benchmarks for emotion classification.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
cs.CL 2026-04 unverdicted novelty 7.0

Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
cs.CL 2026-04 unverdicted novelty 7.0

Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos
cs.CV 2026-04 unverdicted novelty 7.0

Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Accelerating Large Language Model Decoding with Speculative Sampling
cs.CL 2023-02 accept novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
On the Burden of Achieving Fairness in Conformal Prediction
stat.ML 2026-05 unverdicted novelty 6.0

Pooled conformal calibration incurs irreducible group-wise coverage distortion set by cross-group quantile heterogeneity, and Equalized Coverage and Equalized Set Size are in fundamental tension.
Distribution Corrected Offline Data Distillation for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
cs.LG 2026-05 unverdicted novelty 6.0

N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...
BoolXLLM: LLM-Assisted Explainability for Boolean Models
cs.AI 2026-05 unverdicted novelty 6.0

BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.
Unified Approach for Weakly Supervised Multicalibration
stat.ML 2026-05 unverdicted novelty 6.0

A unified framework uses contamination-matrix risk rewrites and witness-based calibration constraints to estimate and correct multicalibration under weak supervision, providing finite-sample guarantees and the WLMC po...
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts
cs.CL 2026-05 conditional novelty 6.0

Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Patch-Effect Graph Kernels for LLM Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape desc...
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression
cs.LG 2026-05 unverdicted novelty 6.0

DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
cs.LG 2026-05 unverdicted novelty 6.0

LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding
cs.LG 2026-05 unverdicted novelty 6.0

Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding
cs.CL 2026-04 unverdicted novelty 6.0

TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.
PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search
cs.DB 2026-04 unverdicted novelty 6.0

PiLLar is the first LLM-guided Monte-Carlo Tree Search framework for joint schema-value matching on pivot tables, achieving 87.94% average accuracy on a new benchmark PTbench derived from real-world domains.
ImproBR: Bug Report Improver Using LLMs
cs.SE 2026-04 unverdicted novelty 6.0

ImproBR combines a hybrid detector with GPT-4o mini and RAG to raise bug report structural completeness from 7.9% to 96.4% and executable steps from 28.8% to 67.6% on 139 Mojira reports.
IAM: Identity-Aware Human Motion and Shape Joint Generation
cs.CV 2026-04 unverdicted novelty 6.0

IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...
RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization
cs.CL 2026-04 unverdicted novelty 6.0

RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high qu...
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
cs.LG 2026-04 unverdicted novelty 6.0

On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
cs.CV 2026-04 unverdicted novelty 6.0

CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need
cs.LG 2026-04 unverdicted novelty 6.0

Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
LOLGORITHM: Funny Comment Generation Agent For Short Videos
cs.CV 2026-04 unverdicted novelty 6.0

LOLGORITHM is a modular multi-agent system for generating stylized funny comments on short videos that achieves 80-84% human preference over baselines on YouTube and Douyin datasets.
A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection
cs.SE 2026-04 unverdicted novelty 6.0

QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics
cs.CL 2026-04 unverdicted novelty 6.0

A framework converts interpretable facial and acoustic features into language descriptions, feeds them to a pretrained LM for semantic embeddings, and uses those embeddings as priors to improve valence and arousal cha...
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge
cs.DC 2026-04 unverdicted novelty 6.0

ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, imp...
ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions
cs.RO 2026-04 unverdicted novelty 6.0

ExpressMM integrates high-level language-guided planning with low-level vision-language-action policies to enable expressive and interruptible mobile manipulation behaviors in human-robot collaboration, shown effectiv...
CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation
cs.CV 2026-03 unverdicted novelty 6.0

CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over ...
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
cs.LG 2024-10 unverdicted novelty 6.0

π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
cs.RO 2024-03 accept novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
MiniLLM: On-Policy Distillation of Large Language Models
cs.CL 2023-06 conditional novelty 6.0

MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
cs.LG 2023-05 accept novelty 6.0

FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 84 Pith papers · 2 internal anchors

[1]

NIPS , year=

Attention Is All You Need , author=. NIPS , year=

work page
[2]

NAACL-HLT , year=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=

work page
[3]

Language Models are Unsupervised Multitask Learners , author=

work page
[4]

ArXiv , year=

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. ArXiv , year=

work page
[5]

ArXiv , year=

Distilling the Knowledge in a Neural Network , author=. ArXiv , year=

work page
[6]

KDD , year=

Model compression , author=. KDD , year=

work page
[7]

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

work page 2016
[8]

International Conference on Learning Representations , year=

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=

work page
[9]

2015 IEEE International Conference on Computer Vision (ICCV) , year=

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , author=. 2015 IEEE International Conference on Computer Vision (ICCV) , year=

work page 2015
[10]

ArXiv , year=

SpanBERT: Improving Pre-training by Representing and Predicting Spans , author=. ArXiv , year=

work page
[11]

Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R

Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Jason Phang and Edouard Grave and Najoung Kim and Phu Mon Htut and Thibault F'

work page
[12]

ICLR , year=

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. ICLR , year=

work page
[13]

and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=

Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=. NAACL , year=

work page
[14]

ACL , year=

Learning Word Vectors for Sentiment Analysis , author=. ACL , year=

work page
[15]

EMNLP , year=

SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=

work page
[16]

ArXiv , year=

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , author=. ArXiv , year=

work page
[17]

ArXiv , year=

Making Neural Machine Reading Comprehension Faster , author=. ArXiv , year=

work page
[18]

ArXiv , year=

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , author=. ArXiv , year=

work page
[19]

ArXiv , year=

Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System , author=. ArXiv , year=

work page
[20]

EMNLP-IJCNLP , year=

Small and Practical BERT Models for Sequence Labeling , author=. EMNLP-IJCNLP , year=

work page
[21]

ACL , year=

BAM! Born-Again Multi-Task Networks for Natural Language Understanding , author=. ACL , year=

work page
[22]

ACL , year=

Energy and Policy Considerations for Deep Learning in NLP , author=. ACL , year=

work page
[23]

ArXiv , year=

Green AI , author=. ArXiv , year=

work page
[24]

NeurIPS , year=

Are Sixteen Heads Really Better than One? , author=. NeurIPS , year=

work page
[25]

ICML , year=

Deep Learning with Limited Numerical Precision , author=. ICML , year=

work page
[26]

intel.ai , author=

Q8BERT, a Quantized 8bit Version of BERT-Base , url=. intel.ai , author=. 2019 , month=

work page 2019
[27]

2019 , eprint=

Transformers: State-of-the-art Natural Language Processing , author=. 2019 , eprint=

work page 2019
[28]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018

work page 2018
[29]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[30]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[31]

Emma Strubell, Ananya Ganesh, and Andrew McCallum

Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai. ArXiv, abs/1907.10597, 2019

work page arXiv 1907
[32]

Energy and policy considerations for deep learning in nlp

Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In ACL, 2019

work page 2019
[33]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

work page 2017
[34]

Transformers: State-of-the-art natural language processing, 2019

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: State-of-the-art natural language processing, 2019

work page 2019
[35]

Model compression

Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006

work page 2006
[36]

Distilling the Knowledge in a Neural Network

Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[37]

Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19--27, 2015

work page 2015
[38]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2018

work page 2018
[39]

Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018

work page 2018
[40]

Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R

Alex Wang, Ian F. Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Jason Phang, Edouard Grave, Najoung Kim, Phu Mon Htut, Thibault F' e vry, Berlin Chen, Nikita Nangia, Haokun Liu, Anhad Mohananey, Shikha Bordia, Nicolas Patry, Ellie Pavlick, and Samuel R. Bowman. jia...

work page 2019
[41]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In ACL, 2011

work page 2011
[42]

Squad: 100, 000+ questions for machine comprehension of text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016

work page 2016
[43]

arXiv preprint arXiv:1903.12136 , year=

Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. ArXiv, abs/1903.12136, 2019

work page arXiv 1903
[44]

Making neural machine reading comprehension faster

Debajyoti Chatterjee. Making neural machine reading comprehension faster. ArXiv, abs/1904.00796, 2019

work page arXiv 1904
[45]

Well-read students learn better: The impact of student initialization on knowledge distillation

Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. ArXiv, abs/1908.08962, 2019

work page arXiv 1908
[46]

Model compression with multi-task knowledge distillation for web-scale question answering system

Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. Model compression with multi-task knowledge distillation for web-scale question answering system. ArXiv, abs/1904.09636, 2019

work page arXiv 1904
[47]

Small and practical bert models for sequence labeling

Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. Small and practical bert models for sequence labeling. In EMNLP-IJCNLP, 2019

work page 2019
[48]

Are sixteen heads really better than one? In NeurIPS, 2019

Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019

work page 2019
[49]

Deep learning with limited numerical precision

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, 2015

work page 2015