TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
hub
T iny BERT : Distilling BERT for Natural Language Understanding
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
A ReLU-catalyzed abstraction method yields tighter bounds for transformer verification by converting dot-product constraints into ReLU forms that leverage standard convex relaxations.
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
Gopher, a 280 billion parameter language model, achieves state-of-the-art performance on the majority of 152 tasks with largest gains in reading comprehension, fact-checking, and toxic language detection.
Sparsity-guided distillation enables replacing attention layers in ViTs with simpler sequential modules, with sparser layers showing smaller performance drops.
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
citing papers explorer
-
TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
-
Validity-Calibrated Reasoning Distillation
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
-
Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
-
Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement
A ReLU-catalyzed abstraction method yields tighter bounds for transformer verification by converting dot-product constraints into ReLU forms that leverage standard convex relaxations.
-
Distribution Corrected Offline Data Distillation for Large Language Models
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
-
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
-
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding
Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
-
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
-
LIMO: Less is More for Reasoning
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
-
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Gopher, a 280 billion parameter language model, achieves state-of-the-art performance on the majority of 152 tasks with largest gains in reading comprehension, fact-checking, and toxic language detection.
-
From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation
Sparsity-guided distillation enables replacing attention layers in ViTs with simpler sequential modules, with sparser layers showing smaller performance drops.
-
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.