Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.
super hub Mixed citations
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Mixed citation behavior. Most common role is background (62%).
abstract
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di
authors
co-cited works
representative citing papers
Memory augmentation in LLMs amplifies sycophancy up to 25x compared to in-context baselines due to lossy memory extraction, with two lightweight mitigations that reduce the effect while preserving recall.
Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
Otters++ realizes TTFS via measured device decay in optical synapses, uses hybrid QNN-equivalent training with noise awareness, and reports 84.17% average GLUE score with energy gains over prior spiking transformers.
AR-OPD disentangles privileged supervision via anchored residual guidance to reduce hindsight leakage in on-policy distillation, reporting gains of 2.3 points over full privileged OPD and 7.9 over SFT on reasoning tasks.
Introduces a pseudo-metric to quantify advice usefulness and shows reliable advice enables efficient approximate Stackelberg strategies while unreliable advice blocks simultaneous near-Stackelberg and no-regret guarantees but permits weak dominance in some correlated equilibria.
OPRD performs distillation in hidden-state space on on-policy data for deterministic gradients and better math benchmark performance, plus OPRD-Bridge for cross-architecture transfer via low-rank projectors.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.
MATCHA introduces a dual-view contrastive metric measuring proximity to gold text and distance from adversarial contradictions, outperforming ROUGE and BERTScore by up to 20% on TruthfulQA and other NLP benchmarks.
PHAT-JeT combines geometric message-passing with hierarchical patch attention to reach state-of-the-art accuracy and background rejection among resource-constrained jet tagging models on four benchmarks.
CROC constructs finite-sample valid confidence sets for the root-cause index in multi-stream change detection using conformal p-values under independence and exchangeability assumptions.
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
Layer-wise Token Compression applies adaptive token pooling at middle transformer layers for cross-encoder rerankers, preserving MS MARCO ranking quality while raising QPS up to 25% on passages and 116% on documents, with added gains on listwise LLM rerankers and a regularizer effect for long inputs
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
DMP-MH clips degrees to control triangle sensitivity, synthesizes an edge-DP graph with Noisy Mirror Descent, and distills it into dual-stream hash networks, beating private baselines by up to 11.4 mAP on MIRFlickr-25K and NUS-WIDE while keeping 92.5% of non-private performance.
Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.
citing papers explorer
-
Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
-
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
AIGaitor is the first claimed end-to-end on-device monocular motion-capture and deep-learning gait analysis pipeline demonstrated on consumer smartphones.
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos
Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
-
SAM 3: Segment Anything with Concepts
SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
-
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
DBAC is a new directional metric for bias amplification in image captions that is less sensitive to sentence encoders and more accurate than LIC, validated on COCO gender and race attributes.
-
ECoSim: Data Efficient Fine-Tuning for Controllable Traffic Simulation
ECoSim adds multi-modal controllability to pretrained diffusion and autoregressive traffic models via identity-initialized FiLM layers while using less than 1% paired control data on Waymo Open Sim Agents Challenge.
-
Rank-Aware Hyperbolic Alignment for Vision-Language Dataset Distillation
RAHA applies rank-aware hyperbolic alignment to vision-language dataset distillation by enforcing geodesic alignment in the shared low-rank range and regularizing the residual subspace for improved transfer.
-
Multimodal Distribution Matching for Vision-Language Dataset Distillation
MDM distills vision-language datasets via joint embedding clustering, weight-space model interpolation, and geometry-aware distribution matching on the unit hypersphere.
-
IAM: Identity-Aware Human Motion and Shape Joint Generation
IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.
-
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.
-
LOLGORITHM: Funny Comment Generation Agent For Short Videos
LOLGORITHM is a modular multi-agent system for generating stylized funny comments on short videos that achieves 80-84% human preference over baselines on YouTube and Douyin datasets.
-
CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation
CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.
-
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
AssistMimic is the first multi-agent RL method that successfully tracks assistive human-human interaction motions in simulation by using partner-aware policies, single-agent initialization, dynamic reference retargeting, and contact-promoting rewards.
-
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
-
Geometric Foundation Model Distillation for Efficient Lunar 3D Reconstruction
Distillation of a 688M-parameter MASt3R teacher yields up to 7x smaller students that retain most lunar reconstruction accuracy and outperform sparse-supervised baselines.
-
Vitality-Aware Compression for Efficient Image-to-Shape Diffusion Transformers
Introduces vitality-aware compression for image-to-3D DiT models via structured pruning, adaptive quantization, and fine-tuning, claiming 66% size reduction with comparable fidelity.
-
PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
PRIMED improves referring audio-visual segmentation by using a modality prior decoder and competition-aware fusion to adaptively suppress irrelevant modalities.
-
ESsEN: Training Compact Discriminative Vision-Language Transformers in a Low-Resource Setting
ESsEN is a parameter-efficient two-tower vision-language transformer that matches larger models on discriminative tasks after training end-to-end with limited data and resources.
-
ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement
ViASNet applies a 3D U-Net architecture augmented with audio and semantic inputs to predict dynamic saliency in video ads and uses frame-wise entropy to diagnose low-engagement scenes on eye-tracked data from 151 ads.