super hub Mixed citations

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Julien Chaumond, Lysandre Debut, Thomas Wolf, Victor Sanh · 2019 · cs.CL · arXiv 1910.01108

Mixed citation behavior. Most common role is background (65%).

233 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 233 citing papers more from Julien Chaumond arXiv PDF

abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 20 method 11

citation-polarity summary

background 20 use method 11

claims ledger

abstract As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di

authors

Julien Chaumond Lysandre Debut Thomas Wolf Victor Sanh

co-cited works

representative citing papers

Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

cs.LG · 2026-06-29 · unverdicted · novelty 8.0

Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

cs.AI · 2026-06-09 · conditional · novelty 8.0

Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.

Canonical Regularisation of Wide Feature-Learning Neural Networks

stat.ML · 2026-05-18 · unverdicted · novelty 8.0

Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

The Anatomy of the CTC Oracle Gap: Acoustic Exhaustion and Linguistic Recovery

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

CTC-internal scores saturate for hypothesis selection on LibriSpeech, but RoBERTa-based MBR decoding cuts WER by 0.535 pp while MWER training fails near convergence.

Toward Calibrated Mixture-of-Experts Under Distribution Shift

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Expert calibration suffices for MoE calibration under distribution shifts in hard-routed models but not soft-routed ones; adversarial reweighting improves the accuracy-calibration tradeoff across models and shifts.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

Otters++ realizes TTFS via measured device decay in optical synapses, uses hybrid QNN-equivalent training with noise awareness, and reports 84.17% average GLUE score with energy gains over prior spiking transformers.

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.

Leveraging Machine-Learned Advice in Strategic Interactions with No-Regret Learners

cs.GT · 2026-06-09 · unverdicted · novelty 7.0

Introduces a pseudo-metric to quantify advice usefulness and shows reliable advice enables efficient approximate Stackelberg strategies while unreliable advice blocks simultaneous near-Stackelberg and no-regret guarantees but permits weak dominance in some correlated equilibria.

OPRD: On-Policy Representation Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

MATCHA: Matching Text via Contrastive Semantic Alignment

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

MATCHA introduces a dual-view contrastive metric measuring proximity to gold text and distance from adversarial contradictions, outperforming ROUGE and BERTScore by up to 20% on TruthfulQA and other NLP benchmarks.

Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging

hep-ex · 2026-05-20 · unverdicted · novelty 7.0

PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.

Distribution-free root cause analysis

stat.ME · 2026-05-20 · unverdicted · novelty 7.0

CROC constructs finite-sample valid confidence sets for the root-cause index in multi-stream change detection using conformal p-values under independence and exchangeability assumptions.

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.

Layer-wise Token Compression for Efficient Document Reranking

cs.IR · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

cs.OS · 2026-05-18 · unverdicted · novelty 7.0

TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.

citing papers explorer

Showing 20 of 20 citing papers after filters.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI cs.CV · 2026-06-27 · unverdicted · none · ref 12 · internal anchor
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing cs.CV · 2026-05-20 · unverdicted · none · ref 86 · 2 links · internal anchor
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
Depth Adaptive Efficient Visual Autoregressive Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 45 · internal anchor
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos cs.CV · 2026-04-03 · unverdicted · none · ref 50 · internal anchor
Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
SAM 3: Segment Anything with Concepts cs.CV · 2025-11-20 · unverdicted · none · ref 122 · internal anchor
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions cs.CV · 2025-03-10 · unverdicted · none · ref 26 · internal anchor
DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.
ECoSim: Data Efficient Fine-Tuning for Controllable Traffic Simulation cs.CV · 2026-07-01 · unverdicted · none · ref 21 · internal anchor
ECoSim adds multi-modal controllability to pretrained diffusion and autoregressive traffic models via identity-initialized FiLM layers while using less than 1% paired control data on Waymo Open Sim Agents Challenge.
Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation cs.CV · 2026-06-28 · unverdicted · none · ref 57 · internal anchor
RAHA applies rank-aware hyperbolic alignment to vision-language dataset distillation by enforcing geodesic alignment in the shared low-rank range and regularizing the residual subspace for improved transfer.
Multimodal Distribution Matching for Vision-Language Dataset Distillation cs.CV · 2026-05-22 · unverdicted · none · ref 54 · internal anchor
MDM distills vision-language datasets via joint embedding clustering, weight-space model interpolation, and geometry-aware distribution matching on the unit hypersphere.
IAM: Identity-Aware Human Motion and Shape Joint Generation cs.CV · 2026-04-28 · unverdicted · none · ref 25 · internal anchor
IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models cs.CV · 2026-04-14 · unverdicted · none · ref 56 · internal anchor
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.
LOLGORITHM: Funny Comment Generation Agent For Short Videos cs.CV · 2026-04-09 · unverdicted · none · ref 12 · internal anchor
LOLGORITHM is a modular multi-agent system for generating stylized funny comments on short videos that achieves 80-84% human preference over baselines on YouTube and Douyin datasets.
CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation cs.CV · 2026-03-26 · unverdicted · none · ref 37 · internal anchor
CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning cs.CV · 2026-03-11 · unverdicted · none · ref 20 · internal anchor
AssistMimic is the first multi-agent RL method that successfully tracks assistive human-human interaction motions in simulation by using partner-aware policies, single-agent initialization, dynamic reference retargeting, and contact-promoting rewards.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models cs.CV · 2025-03-27 · unverdicted · none · ref 53 · internal anchor
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
Geometric Foundation Model Distillation for Efficient Lunar 3D Reconstruction cs.CV · 2026-07-02 · unverdicted · none · ref 20 · internal anchor
Distillation of a 688M-parameter MASt3R teacher yields up to 7x smaller students that retain most lunar reconstruction accuracy and outperform sparse-supervised baselines.
Vitality-Aware Compression for Efficient Image-to-Shape Diffusion Transformers cs.CV · 2026-07-01 · unverdicted · none · ref 35 · internal anchor
Introduces vitality-aware compression for image-to-3D DiT models via structured pruning, adaptive quantization, and fine-tuning, claiming 66% size reduction with comparable fidelity.
PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition cs.CV · 2026-05-08 · unverdicted · none · ref 36 · internal anchor
PRIMED improves referring audio-visual segmentation by using a modality prior decoder and competition-aware fusion to adaptively suppress irrelevant modalities.
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting cs.CV · 2026-04-20 · unverdicted · none · ref 40 · internal anchor
ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement cs.CV · 2026-05-28 · unverdicted · none · ref 10 · internal anchor
ViASNet applies a 3D U-Net architecture augmented with audio and semantic inputs to predict dynamic saliency in video ads and uses frame-wise entropy to diagnose low-engagement scenes on eye-tracked data from 151 ads.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer