hub

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf · 2019 · cs.CL · arXiv 1910.01108

80 Pith papers cite this work. Polarity classification is still indexing.

80 Pith papers citing it

open full Pith review browse 80 citing papers arXiv PDF

abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge di

co-cited works

representative citing papers

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.

Switchcraft: AI Model Router for Agentic Tool Calling

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.

TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models

stat.ML · 2026-05-08 · unverdicted · novelty 7.0

TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.

A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

cs.CL · 2026-05-02 · unverdicted · novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures

cs.SE · 2026-04-30 · unverdicted · novelty 7.0

DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.

VOW: Verifiable and Oblivious Watermark Detection for Large Language Models

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.

Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning

astro-ph.GA · 2026-04-28 · unverdicted · novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

cs.AI · 2026-04-27 · conditional · novelty 7.0

AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment predicts external adoption metrics.

Adaptive Head Budgeting for Efficient Multi-Head Attention

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.

RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

cs.CR · 2026-04-19 · unverdicted · novelty 7.0

Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.

Depth Adaptive Efficient Visual Autoregressive Modeling

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.

SecureRouter: Encrypted Routing for Efficient Secure Inference

cs.CR · 2026-04-16 · unverdicted · novelty 7.0

SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.

Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data

cs.CL · 2026-04-14 · conditional · novelty 7.0

Synthetic data of 1M+ multi-label samples across 23 languages trains models that match or exceed English-only specialists on zero-shot benchmarks for emotion classification.

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 7.0

OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention

cs.CL · 2026-04-09 · unverdicted · novelty 7.0 · 2 refs

Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

cs.LG · 2026-01-26 · unverdicted · novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.

Eliciting Latent Predictions from Transformers with the Tuned Lens

cs.LG · 2023-03-14 · accept · novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

Accelerating Large Language Model Decoding with Speculative Sampling

cs.CL · 2023-02-02 · accept · novelty 7.0

Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

citing papers explorer

Showing 50 of 80 citing papers.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 38 · internal anchor
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
Learning the Signature of Memorization in Autoregressive Language Models cs.CL · 2026-04-03 · accept · none · ref 18 · internal anchor
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 71 · internal anchor
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity cs.LG · 2026-05-09 · unverdicted · none · ref 14 · internal anchor
Foundation model priors amplify worst-client disparity under extreme federated heterogeneity, creating a fairness paradox where larger models perform worse for disadvantaged clients.
Switchcraft: AI Model Router for Agentic Tool Calling cs.AI · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
Switchcraft routes agentic tool-calling queries to the lowest-cost model that preserves correctness, reaching 82.9% accuracy and 84% cost reduction on five benchmarks.
TRACE: Transport Alignment Conformal Prediction via Diffusion and Flow Matching Models stat.ML · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
TRACE creates valid conformal prediction sets for complex generative models by scoring outputs via averaged denoising or velocity errors along stochastic transport paths instead of likelihoods.
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis cs.CL · 2026-05-02 · unverdicted · none · ref 55 · internal anchor
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures cs.SE · 2026-04-30 · unverdicted · none · ref 31 · internal anchor
DEFault++ delivers automated hierarchical fault detection, categorization into 12 transformer-specific types, and root-cause diagnosis among 45 mechanisms on a new benchmark of 3,739 mutated instances, with AUROC >0.96 and Macro-F1 0.85, plus improved developer repair accuracy in a user study.
VOW: Verifiable and Oblivious Watermark Detection for Large Language Models cs.CR · 2026-04-30 · unverdicted · none · ref 41 · internal anchor
VOW formulates LLM watermark detection as a secure two-party computation using a Verifiable Oblivious Pseudorandom Function to achieve private and cryptographically verifiable detection.
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning astro-ph.GA · 2026-04-28 · unverdicted · none · ref 60 · internal anchor
A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment cs.AI · 2026-04-27 · conditional · none · ref 13 · internal anchor
AgentPulse is a continuous multi-signal framework that scores AI agents on benchmark performance, adoption, sentiment and ecosystem health, showing these factors are complementary and that benchmark-plus-sentiment predicts external adoption metrics.
Adaptive Head Budgeting for Efficient Multi-Head Attention cs.LG · 2026-04-24 · unverdicted · none · ref 5 · internal anchor
BudgetFormer adaptively budgets the number and selection of attention heads per input in Transformers, reducing FLOPs and memory on text classification while matching or exceeding standard multi-head performance.
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian cs.CL · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
RoLegalGEC is the first Romanian legal-domain dataset for grammatical error detection and correction, consisting of 350,000 examples, with evaluations of several neural models.
GuardPhish: Securing Open-Source LLMs from Phishing Abuse cs.CR · 2026-04-19 · unverdicted · none · ref 37 · internal anchor
Open-source LLMs detect phishing intent at high rates but still generate actionable phishing content, and GuardPhish supplies a dataset plus modular classifiers to close the gap.
Depth Adaptive Efficient Visual Autoregressive Modeling cs.CV · 2026-04-19 · unverdicted · none · ref 45 · internal anchor
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
SecureRouter: Encrypted Routing for Efficient Secure Inference cs.CR · 2026-04-16 · unverdicted · none · ref 43 · internal anchor
SecureRouter accelerates secure transformer inference by 1.95x via an encrypted router that selects input-adaptive models from an MPC-optimized pool with negligible accuracy loss.
Multilingual Multi-Label Emotion Classification at Scale with Synthetic Data cs.CL · 2026-04-14 · conditional · none · ref 8 · internal anchor
Synthetic data of 1M+ multi-label samples across 23 languages trains models that match or exceed English-only specialists on zero-shot benchmarks for emotion classification.
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 17 · internal anchor
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
Kathleen: Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention cs.CL · 2026-04-09 · unverdicted · none · ref 7 · 2 links · internal anchor
Kathleen performs byte-level text classification via recurrent oscillator banks, FFT wavetable encoding, and phase harmonics, matching pretrained baselines on standard benchmarks with 36% fewer parameters.
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos cs.CV · 2026-04-03 · unverdicted · none · ref 50 · internal anchor
Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models cs.LG · 2026-01-26 · unverdicted · none · ref 14 · internal anchor
A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better token efficiency.
Eliciting Latent Predictions from Transformers with the Tuned Lens cs.LG · 2023-03-14 · accept · none · ref 78 · internal anchor
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
Accelerating Large Language Model Decoding with Speculative Sampling cs.CL · 2023-02-02 · accept · none · ref 16 · internal anchor
Speculative sampling accelerates LLM decoding 2-2.5x by letting a draft model propose short sequences that the target model scores in parallel, then applies modified rejection sampling to keep the exact target distribution.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 62 · internal anchor
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation cs.LG · 2026-05-13 · unverdicted · none · ref 42 · internal anchor
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation cs.LG · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
RESD turns failure trajectories into token-level supervision via retrospective reflections and a persistent global playbook, enabling faster improvement than standard self-distillation or GRPO with only one rollout per prompt.
BoolXLLM: LLM-Assisted Explainability for Boolean Models cs.AI · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
BoolXLLM augments an existing Boolean rule learner with LLMs for feature selection, discretization thresholds, and natural-language rule translation to improve interpretability while preserving accuracy.
Unified Approach for Weakly Supervised Multicalibration stat.ML · 2026-05-11 · unverdicted · none · ref 36 · internal anchor
A unified framework uses contamination-matrix risk rewrites and witness-based calibration constraints to estimate and correct multicalibration under weak supervision, providing finite-sample guarantees and the WLMC post-hoc recalibration algorithm.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CL · 2026-05-08 · conditional · none · ref 6 · internal anchor
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 76 · internal anchor
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
Patch-Effect Graph Kernels for LLM Interpretability cs.AI · 2026-05-07 · unverdicted · none · ref 12 · internal anchor
Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.
DiBA: Diagonal and Binary Matrix Approximation for Neural Network Weight Compression cs.LG · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
DiBA factors weight matrices into diagonal-binary-diagonal-binary-diagonal form to cut matrix-vector multiplies from mn to m+k+n operations and improves accuracy on DistilBERT and audio transformer tasks after replacement.
LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference cs.LG · 2026-05-01 · unverdicted · none · ref 13 · internal anchor
LEAP adds a layer-wise exit-aware constraint to standard distillation, reconciling it with early-exit mechanisms and delivering 1.61x wall-clock speedup on MiniLM at 0.95 threshold with 91.9% early exits by layer 7.
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding cs.LG · 2026-05-01 · unverdicted · none · ref 40 · internal anchor
Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding cs.CL · 2026-04-30 · unverdicted · none · ref 26 · internal anchor
TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.
PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search cs.DB · 2026-04-29 · unverdicted · none · ref 54 · internal anchor
PiLLar is the first LLM-guided Monte-Carlo Tree Search framework for joint schema-value matching on pivot tables, achieving 87.94% average accuracy on a new benchmark PTbench derived from real-world domains.
ImproBR: Bug Report Improver Using LLMs cs.SE · 2026-04-28 · unverdicted · none · ref 29 · internal anchor
ImproBR combines a hybrid detector with GPT-4o mini and RAG to raise bug report structural completeness from 7.9% to 96.4% and executable steps from 28.8% to 67.6% on 139 Mojira reports.
IAM: Identity-Aware Human Motion and Shape Joint Generation cs.CV · 2026-04-28 · unverdicted · none · ref 25 · internal anchor
IAM jointly synthesizes motion sequences and body shape parameters conditioned on multimodal identity signals to achieve more realistic and identity-consistent human motions.
ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 12 · internal anchor
ADE scales multi-anchor word representations to transformers via Vocabulary Projection, Grouped Positional Encoding, and context-aware reweighting, achieving 98.7% fewer trainable parameters than DeBERTa-v3-base while matching or exceeding it on two text-classification benchmarks and compressing the
RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization cs.CL · 2026-04-26 · unverdicted · none · ref 4 · internal anchor
RouteNLP is a closed-loop LLM routing framework using conformal cascading and distillation co-optimization that cut inference costs by 58% in an 8-week enterprise deployment while preserving 91% acceptance and high quality on benchmarks.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 6 · internal anchor
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe cs.LG · 2026-04-14 · unverdicted · none · ref 21 · internal anchor
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models cs.CV · 2026-04-14 · unverdicted · none · ref 56 · internal anchor
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.
A Little Rank Goes a Long Way: Random Scaffolds with LoRA Adapters Are All You Need cs.LG · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
Frozen random backbones with low-rank LoRA adapters recover 96-100% of fully trained performance on diverse architectures while training only 0.5-40% of parameters.
LOLGORITHM: Funny Comment Generation Agent For Short Videos cs.CV · 2026-04-09 · unverdicted · none · ref 12 · internal anchor
LOLGORITHM is a modular multi-agent system for generating stylized funny comments on short videos that achieves 80-84% human preference over baselines on YouTube and Douyin datasets.
A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection cs.SE · 2026-04-09 · unverdicted · none · ref 32 · internal anchor
QTyBERT matches or exceeds BERT-based log anomaly detection effectiveness while reducing embedding generation time to near static word embedding levels.
LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics cs.CL · 2026-04-08 · unverdicted · none · ref 30 · internal anchor
A framework converts interpretable facial and acoustic features into language descriptions, feeds them to a pretrained LM for semantic embeddings, and uses those embeddings as priors to improve valence and arousal change prediction on Aff-Wild2 and SEWA while remaining transparent.
Beyond End-to-End: Dynamic Chain Optimization for Private LLM Adaptation on the Edge cs.DC · 2026-04-08 · unverdicted · none · ref 28 · internal anchor
ChainFed achieves memory-efficient private LLM fine-tuning on edge devices through sequential layer-by-layer adapter training with dynamic co-tuning, perceptive optimization, and adaptive starting point selection, improving accuracy by up to 46.46%.
ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions cs.RO · 2026-04-07 · unverdicted · none · ref 14 · internal anchor
ExpressMM integrates high-level language-guided planning with low-level vision-language-action policies to enable expressive and interruptible mobile manipulation behaviors in human-robot collaboration, shown effective in an assembly task via audience evaluations.
CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation cs.CV · 2026-03-26 · unverdicted · none · ref 37 · internal anchor
CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer