Parameter-Efficient Transfer Learning for NLP

Andrea Gesmundo; Andrei Giurgiu; Bruna Morrone; Mona Attariyan; Neil Houlsby; Quentin de Laroussilhe; Stanislaw Jastrzebski; Sylvain Gelly

arxiv: 1902.00751 · v2 · pith:QD3VZYO4new · submitted 2019-02-02 · 💻 cs.LG · cs.CL· stat.ML

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby , Andrei Giurgiu , Stanislaw Jastrzebski , Bruna Morrone , Quentin de Laroussilhe , Andrea Gesmundo , Mona Attariyan , Sylvain Gelly This is my paper

classification 💻 cs.LG cs.CLstat.ML

keywords parameterstaskfine-tuningtransferadaptermodelonlytasks

0 comments

read the original abstract

Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. To demonstrate adapter's effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CARD: Cluster-level Adaptation with Reward-guided Decoding for Personalized Text Generation
cs.AI 2026-01 unverdicted novelty 7.0

CARD uses style-based user clustering and implicit preference contrasts to enable efficient personalized text generation via lightweight decoding adjustments on frozen LLMs.
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
cs.LG 2025-10 unverdicted novelty 7.0

One of the Q, K or V weights in transformer self-attention is redundant and replaceable by the identity matrix under mild assumptions, reducing parameters by 25 percent with no loss in small-model performance.
Residual Feature Integration is Sufficient to Prevent Negative Transfer
cs.LG 2025-05 unverdicted novelty 7.0

Residual feature integration with a trainable target-side encoder provably prevents negative transfer, achieving convergence rates no worse than training from scratch under informative target distributions.
LoRA: Low-Rank Adaptation of Large Language Models
cs.CL 2021-06 accept novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
Empirical Bayes Conformal Prediction for Vision and Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Empirical Bayes conformal prediction converts score variability into r-value nonconformity scores that preserve target coverage while reducing inclusion of high-variance false candidates in image classification, CLIP ...
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA
cs.CL 2026-05 unverdicted novelty 6.0

Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggrega...
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
Do Masked Autoencoders Improve Downhole Prediction? An Empirical Study on Real Well Drilling Data
cs.LG 2026-04 unverdicted novelty 6.0

Masked autoencoder pretraining on 3.5 million timesteps of real drilling telemetry reduces total mud volume prediction error by 19.8% versus supervised GRU but trails LSTM by 6.4% on Utah FORGE wells.
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
cs.LG 2026-04 unverdicted novelty 6.0

AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering
cs.CR 2026-04 unverdicted novelty 6.0

LLM4CodeRE adapts LLMs with multi-adapter and seq2seq fine-tuning for accurate assembly-to-source decompilation and reverse translation in code reverse engineering.
HyperAdapt: Simple High-Rank Adaptation
cs.LG 2025-09 unverdicted novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation
cs.CV 2026-05 unverdicted novelty 5.0

BiomedAP improves robustness of biomedical VLMs to prompt variations using gated cross-modal fusion and dual-anchor constraints, outperforming baselines on 11 benchmarks.
Mesh Based Simulations with Spatial and Temporal awareness
cs.LG 2026-05 unverdicted novelty 5.0

A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention corre...
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
cs.DC 2026-04 unverdicted novelty 5.0

SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.
Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation
cs.LG 2026-04 conditional novelty 5.0

A temporal extension of TabDDPM generates coherent synthetic time-series sequences on the WISDM dataset that match real distributions and support downstream classification with macro F1 of 0.64.
Assessing the Potential of Masked Autoencoder Foundation Models in Predicting Downhole Metrics from Surface Drilling Data
cs.LG 2026-04 unverdicted novelty 4.0

A literature review of thirteen papers finds that masked autoencoders have not been applied to downhole metric prediction from surface drilling data despite their advantages for unlabeled time-series modeling.
From Weights to Activations: Is Steering the Next Frontier of Adaptation?
cs.CL 2026-04 unverdicted novelty 4.0

Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control
cs.LG 2025-12 unverdicted novelty 4.0

AdaFRUGAL automates FRUGAL's static hyperparameters with linear decay on subspace ratio and loss-aware update frequency, delivering competitive accuracy with lower memory and faster training on C4, VietVault, and GLUE.
Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector
cs.CL 2025-09 unverdicted novelty 3.0

Fine-tuned LLaMA 3.1-8B variants for the energy sector outperform the base model on domain QA benchmarks, with LoRA delivering similar gains at lower training cost.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security
cs.HC 2024-01 unverdicted novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
cs.CV 2026-05 unverdicted novelty 2.0

Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)
cs.CL 2025-01 unverdicted novelty 2.0

A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.