hub

T iny BERT : Distilling BERT for Natural Language Understanding

Jiao, X · 2020 · DOI 10.18653/v1/2020.findings-emnlp.372

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

open at publisher browse 13 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

cs.OS · 2026-05-18 · unverdicted · novelty 7.0

TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

Validity-Calibrated Reasoning Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.

Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

A ReLU-catalyzed abstraction method yields tighter bounds for transformer verification by converting dot-product constraints into ReLU forms that leverage standard convex relaxations.

Distribution Corrected Offline Data Distillation for Large Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.

Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.

LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations

cs.IR · 2025-09-16 · conditional · novelty 6.0

LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.

LIMO: Less is More for Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

cs.CL · 2021-12-08 · unverdicted · novelty 6.0

Gopher, a 280 billion parameter language model, achieves state-of-the-art performance on the majority of 152 tasks with largest gains in reading comprehension, fact-checking, and toxic language detection.

From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

Sparsity-guided distillation enables replacing attention layers in ViTs with simpler sequential modules, with sparser layers showing smaller performance drops.

Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection

cs.AI · 2026-05-04 · unverdicted · novelty 5.0

Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.

A Survey on Knowledge Distillation of Large Language Models

cs.CL · 2024-02-20 · accept · novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

citing papers explorer

Showing 13 of 13 citing papers.

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics cs.OS · 2026-05-18 · unverdicted · none · ref 34
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
Validity-Calibrated Reasoning Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 75
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization cs.LG · 2026-05-19 · unverdicted · none · ref 35
Provides sufficient conditions for successful distillation of combinatorial optimization tasks into DP-aligned graph neural networks under the linear representation hypothesis for the source model.
Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement cs.AI · 2026-05-14 · unverdicted · none · ref 14
A ReLU-catalyzed abstraction method yields tighter bounds for transformer verification by converting dot-product constraints into ReLU forms that leverage standard convex relaxations.
Distribution Corrected Offline Data Distillation for Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 18
A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference cs.LG · 2026-05-01 · unverdicted · none · ref 14
LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding cs.LG · 2026-05-01 · unverdicted · none · ref 21
Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations cs.IR · 2025-09-16 · conditional · none · ref 13
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
LIMO: Less is More for Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 202
LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher cs.CL · 2021-12-08 · unverdicted · none · ref 4
Gopher, a 280 billion parameter language model, achieves state-of-the-art performance on the majority of 152 tasks with largest gains in reading comprehension, fact-checking, and toxic language detection.
From Sparsity to Simplicity: Enabling Simpler Sequential Replacements via Sparse Attention Distillation cs.LG · 2026-05-15 · unverdicted · none · ref 12
Sparsity-guided distillation enables replacing attention layers in ViTs with simpler sequential modules, with sparser layers showing smaller performance drops.
Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection cs.AI · 2026-05-04 · unverdicted · none · ref 24
Reasoning-oriented knowledge distillation from DeepSeek-R1 plus response stabilization improves reliability and often performance of compact models for cross-language code clone detection on pairs like Python-Java and Rust-Java.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 85
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

T iny BERT : Distilling BERT for Natural Language Understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer