pith. machine review for the scientific record. sign in

arxiv: 1910.01108 · v4 · submitted 2019-10-02 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Authors on Pith no claims yet

Pith reviewed 2026-05-11 04:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords DistilBERTknowledge distillationBERTmodel compressionpre-traininglanguage modelsnatural language processingon-device computation
0
0 comments X

The pith

DistilBERT is a 40% smaller version of BERT that retains 97% of its language understanding while running 60% faster.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that knowledge distillation can be applied during the pre-training phase to create a compact general-purpose language model from BERT. By training the smaller student model with a triple loss that includes language modeling, distillation from the teacher, and cosine-distance terms, the authors transfer enough knowledge to preserve most capabilities. This matters because large pre-trained models are difficult to deploy under tight compute budgets on edge devices or in constrained environments. The resulting DistilBERT can later be fine-tuned on downstream tasks without needing the full original model size.

Core claim

We introduce DistilBERT, a smaller general-purpose language representation model pre-trained using knowledge distillation, which reduces the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

What carries the argument

Triple loss combining language modeling, distillation, and cosine-distance losses during pre-training to transfer knowledge to the smaller student model.

Load-bearing premise

The combination of language modeling, distillation, and cosine-distance losses transfers enough knowledge from the full BERT teacher to the smaller student without requiring the full model capacity or additional task-specific supervision.

What would settle it

If DistilBERT's fine-tuned performance on standard NLP benchmarks falls below 97% of BERT's scores or if measured inference speed gains are less than 60% in direct side-by-side tests, the central claims would not hold.

read the original abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces DistilBERT, a 6-layer distilled version of BERT-base (66M parameters) pre-trained with a triple loss combining masked language modeling, knowledge distillation, and cosine-distance embedding losses. It claims a 40% size reduction while retaining 97% of BERT's performance on language understanding tasks, 60% faster inference, and suitability for on-device use, supported by evaluations on GLUE (average 97% relative score), SQuAD, IMDB, loss ablations, a from-scratch 6-layer baseline comparison, and CPU/GPU speed measurements with reported batch sizes.

Significance. If the empirical results hold, the work offers a practical pre-training distillation method that enables smaller general-purpose language models without task-specific supervision, directly addressing deployment constraints. Strengths include the ablation evidence in §3.3 showing each loss term's contribution, the underperformance of the non-distilled baseline, and concrete inference timings, which together provide reproducible support for the central efficiency claims.

minor comments (2)
  1. [Abstract] Abstract: performance claims (97% retention, 60% speedup) are stated without reference to the specific downstream tasks or variance; while §3 and tables provide these details, a one-sentence qualifier on evaluation scope would improve standalone readability.
  2. [§3.3] §3.3: ablation results demonstrate the value of each loss component, but the table does not report run-to-run variance or number of seeds; adding this would make the contribution of the cosine term more robust.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the DistilBERT paper, recognition of its practical contributions to model compression, and recommendation for minor revision. We are pleased that the ablation evidence, baseline comparisons, and concrete speed measurements were noted as providing reproducible support for the claims.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is an empirical training procedure: a 6-layer student model is pre-trained on the same corpus as BERT-base using a composite loss (MLM + distillation + cosine embedding) and then evaluated on GLUE, SQuAD, and IMDB. All reported performance numbers (97 % relative GLUE score, 60 % speed-up, 40 % size reduction) are obtained by direct measurement after training; no equation or prediction is shown to be mathematically identical to a fitted parameter or to a self-citation chain. Ablations in §3.3 and the from-scratch baseline comparison further demonstrate that the result is not forced by construction. The work therefore rests on externally verifiable experimental outcomes rather than on any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The method implicitly assumes standard transformer architecture and knowledge-distillation transferability, which are treated as background rather than paper-specific inventions.

pith-pipeline@v0.9.0 · 5484 in / 1007 out tokens · 24625 ms · 2026-05-11T04:58:48.987359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses.

  • Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    DistilBERT (6 layers, 66M params) is compared to BERT-base on GLUE (avg. 97% relative score), SQuAD, and IMDB; ablations in §3.3 show each term of the triple loss (MLM + distillation + cosine) contributes

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...

  2. Learning the Signature of Memorization in Autoregressive Language Models

    cs.CL 2026-04 accept novelty 8.0

    A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

  3. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  4. When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

    cs.LG 2026-05 unverdicted novelty 7.0

    Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.

  5. Switchcraft: AI Model Router for Agentic Tool Calling

    cs.AI 2026-05 unverdicted novelty 7.0

    Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.

  6. TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models

    stat.ML 2026-05 unverdicted novelty 7.0

    TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.

  7. A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

    cs.CL 2026-05 unverdicted novelty 7.0

    Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

  8. DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

    cs.SE 2026-04 unverdicted novelty 7.0

    DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.9...

  9. VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

    cs.CR 2026-04 unverdicted novelty 7.0

    VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.

  10. Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

    astro-ph.GA 2026-04 unverdicted novelty 7.0

    A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

  11. AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

    cs.AI 2026-04 conditional novelty 7.0

    AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment pre...

  12. Adaptive Head Budgeting for Efficient Multi-Head Attention

    cs.LG 2026-04 unverdicted novelty 7.0

    BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.

  13. RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

    cs.CL 2026-04 unverdicted novelty 7.0

    RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.

  14. GuardPhish: Securing Open-Source LLMs from Phishing Abuse

    cs.CR 2026-04 unverdicted novelty 7.0

    Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

  15. Depth Adaptive Efficient Visual Autoregressive Modeling

    cs.CV 2026-04 unverdicted novelty 7.0

    DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

  16. SecureRouter: Encrypted Routing for Efficient Secure Inference

    cs.CR 2026-04 unverdicted novelty 7.0

    SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.

  17. Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

    cs.CL 2026-04 conditional novelty 7.0

    Synthetic data of 1M+ multi-label samples across 23 languages trains models that match or exceed English-only specialists on zero-shot benchmarks for emotion classification.

  18. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  19. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.

  20. Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

    cs.CL 2026-04 unverdicted novelty 7.0

    Kathleen uses recurrent oscillator banks, an efficient wavetable encoder, and phase harmonics to classify text at the byte level with high accuracy and low parameter count.

  21. A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.

  22. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  23. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  24. Accelerating Large Language Model Decoding with Speculative Sampling

    cs.CL 2023-02 accept novelty 7.0

    Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

  25. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  26. On the Burden of Achieving Fairness in Conformal Prediction

    stat.ML 2026-05 unverdicted novelty 6.0

    Pooled conformal calibration incurs irreducible group-wise coverage distortion set by cross-group quantile heterogeneity, and Equalized Coverage and Equalized Set Size are in fundamental tension.

  27. Distribution Corrected Offline Data Distillation for Large Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    A distribution-correction framework for offline LLM reasoning distillation improves accuracy on math benchmarks by adaptively aligning teacher supervision with the student's inference-time distribution.

  28. N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

    cs.LG 2026-05 unverdicted novelty 6.0

    N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.

  29. Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout pe...

  30. BoolXLLM: LLM-Assisted Explainability for Boolean Models

    cs.AI 2026-05 unverdicted novelty 6.0

    BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.

  31. Unified Approach for Weakly Supervised Multicalibration

    stat.ML 2026-05 unverdicted novelty 6.0

    A unified framework uses contamination-matrix risk rewrites and witness-based calibration constraints to estimate and correct multicalibration under weak supervision, providing finite-sample guarantees and the WLMC po...

  32. Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

    cs.CL 2026-05 conditional novelty 6.0

    Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.

  33. Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...

  34. Patch-Effect Graph Kernels for LLM Interpretability

    cs.AI 2026-05 unverdicted novelty 6.0

    Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape desc...

  35. DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.

  36. LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.

  37. Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.

  38. Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

    cs.CL 2026-04 unverdicted novelty 6.0

    TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.

  39. PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search

    cs.DB 2026-04 unverdicted novelty 6.0

    PiLLar is the first LLM-guided Monte-Carlo Tree Search framework for joint schema-value matching on pivot tables, achieving 87.94% average accuracy on a new benchmark PTbench derived from real-world domains.

  40. ImproBR: Bug Report Improver Using LLMs

    cs.SE 2026-04 unverdicted novelty 6.0

    ImproBR combines a hybrid detector with GPT-4o mini and RAG to raise bug report structural completeness from 7.9% to 96.4% and executable steps from 28.8% to 67.6% on 139 Mojira reports.

  41. IAM: Identity-Aware Human Motion and Shape Joint Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.

  42. ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while...

  43. RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high qu...

  44. GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...

  45. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    cs.LG 2026-04 unverdicted novelty 6.0

    On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.

  46. Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to...

  47. A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need

    cs.LG 2026-04 unverdicted novelty 6.0

    Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.

  48. LOLGORITHM: Funny Comment Generation Agent For Short Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    LOLGORITHM is a modular multi-agent system for generating stylized funny comments on short videos that achieves 80-84% human preference over baselines on YouTube and Douyin datasets.

  49. A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection

    cs.SE 2026-04 unverdicted novelty 6.0

    QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.

  50. LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics

    cs.CL 2026-04 unverdicted novelty 6.0

    A framework converts interpretable facial and acoustic features into language descriptions, feeds them to a pretrained LM for semantic embeddings, and uses those embeddings as priors to improve valence and arousal cha...

  51. Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge

    cs.DC 2026-04 unverdicted novelty 6.0

    ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, imp...

  52. ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions

    cs.RO 2026-04 unverdicted novelty 6.0

    ExpressMM integrates high-level language-guided planning with low-level vision-language-action policies to enable expressive and interruptible mobile manipulation behaviors in human-robot collaboration, shown effectiv...

  53. CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation

    cs.CV 2026-03 unverdicted novelty 6.0

    CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over ...

  54. Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    cs.RO 2025-02 accept novelty 6.0

    OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.

  55. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  56. $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    cs.LG 2024-10 unverdicted novelty 6.0

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  57. OpenVLA: An Open-Source Vision-Language-Action Model

    cs.RO 2024-06 unverdicted novelty 6.0

    OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.

  58. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    cs.RO 2024-03 accept novelty 6.0

    DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

  59. MiniLLM: On-Policy Distillation of Large Language Models

    cs.CL 2023-06 conditional novelty 6.0

    MiniLLM distills large language models into smaller ones via reverse KL divergence and on-policy optimization, yielding higher-quality responses with lower exposure bias than standard KD baselines.

  60. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    cs.LG 2023-05 accept novelty 6.0

    FrugalGPT learns query-specific cascades across heterogeneous LLM APIs to match or exceed top-model accuracy at far lower cost.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 84 Pith papers · 2 internal anchors

  1. [1]

    NIPS , year=

    Attention Is All You Need , author=. NIPS , year=

  2. [2]

    NAACL-HLT , year=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. NAACL-HLT , year=

  3. [3]

    Language Models are Unsupervised Multitask Learners , author=

  4. [4]

    ArXiv , year=

    RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. ArXiv , year=

  5. [5]

    ArXiv , year=

    Distilling the Knowledge in a Neural Network , author=. ArXiv , year=

  6. [6]

    KDD , year=

    Model compression , author=. KDD , year=

  7. [7]

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

    Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year=

  8. [8]

    International Conference on Learning Representations , year=

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. International Conference on Learning Representations , year=

  9. [9]

    2015 IEEE International Conference on Computer Vision (ICCV) , year=

    Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , author=. 2015 IEEE International Conference on Computer Vision (ICCV) , year=

  10. [10]

    ArXiv , year=

    SpanBERT: Improving Pre-training by Representing and Predicting Spans , author=. ArXiv , year=

  11. [11]

    Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R

    Alex Wang and Ian F. Tenney and Yada Pruksachatkun and Katherin Yu and Jan Hula and Patrick Xia and Raghu Pappagari and Shuning Jin and R. Thomas McCoy and Roma Patel and Yinghui Huang and Jason Phang and Edouard Grave and Najoung Kim and Phu Mon Htut and Thibault F'

  12. [12]

    ICLR , year=

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , author=. ICLR , year=

  13. [13]

    and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=

    Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke , title=. NAACL , year=

  14. [14]

    ACL , year=

    Learning Word Vectors for Sentiment Analysis , author=. ACL , year=

  15. [15]

    EMNLP , year=

    SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=

  16. [16]

    ArXiv , year=

    Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , author=. ArXiv , year=

  17. [17]

    ArXiv , year=

    Making Neural Machine Reading Comprehension Faster , author=. ArXiv , year=

  18. [18]

    ArXiv , year=

    Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation , author=. ArXiv , year=

  19. [19]

    ArXiv , year=

    Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System , author=. ArXiv , year=

  20. [20]

    EMNLP-IJCNLP , year=

    Small and Practical BERT Models for Sequence Labeling , author=. EMNLP-IJCNLP , year=

  21. [21]

    ACL , year=

    BAM! Born-Again Multi-Task Networks for Natural Language Understanding , author=. ACL , year=

  22. [22]

    ACL , year=

    Energy and Policy Considerations for Deep Learning in NLP , author=. ACL , year=

  23. [23]

    ArXiv , year=

    Green AI , author=. ArXiv , year=

  24. [24]

    NeurIPS , year=

    Are Sixteen Heads Really Better than One? , author=. NeurIPS , year=

  25. [25]

    ICML , year=

    Deep Learning with Limited Numerical Precision , author=. ICML , year=

  26. [26]

    intel.ai , author=

    Q8BERT, a Quantized 8bit Version of BERT-Base , url=. intel.ai , author=. 2019 , month=

  27. [27]

    2019 , eprint=

    Transformers: State-of-the-art Natural Language Processing , author=. 2019 , eprint=

  28. [28]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018

  29. [29]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  30. [30]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar S. Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke S. Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019

  31. [31]

    Emma Strubell, Ananya Ganesh, and Andrew McCallum

    Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren Etzioni. Green ai. ArXiv, abs/1907.10597, 2019

  32. [32]

    Energy and policy considerations for deep learning in nlp

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. In ACL, 2019

  33. [33]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

  34. [34]

    Transformers: State-of-the-art natural language processing, 2019

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Transformers: State-of-the-art natural language processing, 2019

  35. [35]

    Model compression

    Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006

  36. [36]

    Distilling the Knowledge in a Neural Network

    Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015

  37. [37]

    Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler

    Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. 2015 IEEE International Conference on Computer Vision (ICCV), pages 19--27, 2015

  38. [38]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2018

  39. [39]

    Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018

  40. [40]

    Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R

    Alex Wang, Ian F. Tenney, Yada Pruksachatkun, Katherin Yu, Jan Hula, Patrick Xia, Raghu Pappagari, Shuning Jin, R. Thomas McCoy, Roma Patel, Yinghui Huang, Jason Phang, Edouard Grave, Najoung Kim, Phu Mon Htut, Thibault F' e vry, Berlin Chen, Nikita Nangia, Haokun Liu, Anhad Mohananey, Shikha Bordia, Nicolas Patry, Ellie Pavlick, and Samuel R. Bowman. jia...

  41. [41]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In ACL, 2011

  42. [42]

    Squad: 100, 000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016

  43. [43]

    arXiv preprint arXiv:1903.12136 , year=

    Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from bert into simple neural networks. ArXiv, abs/1903.12136, 2019

  44. [44]

    Making neural machine reading comprehension faster

    Debajyoti Chatterjee. Making neural machine reading comprehension faster. ArXiv, abs/1904.00796, 2019

  45. [45]

    Well-read students learn better: The impact of student initialization on knowledge distillation

    Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. ArXiv, abs/1908.08962, 2019

  46. [46]

    Model compression with multi-task knowledge distillation for web-scale question answering system

    Ze Yang, Linjun Shou, Ming Gong, Wutao Lin, and Daxin Jiang. Model compression with multi-task knowledge distillation for web-scale question answering system. ArXiv, abs/1904.09636, 2019

  47. [47]

    Small and practical bert models for sequence labeling

    Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. Small and practical bert models for sequence labeling. In EMNLP-IJCNLP, 2019

  48. [48]

    Are sixteen heads really better than one? In NeurIPS, 2019

    Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, 2019

  49. [49]

    Deep learning with limited numerical precision

    Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, 2015