Parameter-Efficient Transfer Learning for NLP

Neil Houlsby , Andrei Giurgiu , Stanislaw Jastrzebski , Bruna Morrone , Quentin de Laroussilhe , Andrea Gesmundo , Mona Attariyan , Sylvain Gelly

Authors on Pith no claims yet

classification 💻 cs.LG cs.CLstat.ML

keywords parameterstaskfine-tuningtransferadaptermodelonlytasks

0 comments

read the original abstract

Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. To demonstrate adapter's effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LoRA: Low-Rank Adaptation of Large Language Models
cs.CL 2021-06 accept novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
Do Masked Autoencoders Improve Downhole Prediction? An Empirical Study on Real Well Drilling Data
cs.LG 2026-04 unverdicted novelty 6.0

Masked autoencoder pretraining on 3.5 million timesteps of real drilling telemetry reduces total mud volume prediction error by 19.8% versus supervised GRU but trails LSTM by 6.4% on Utah FORGE wells.
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
cs.LG 2026-04 unverdicted novelty 6.0

AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering
cs.CR 2026-04 unverdicted novelty 6.0

LLM4CodeRE adapts LLMs with multi-adapter and seq2seq fine-tuning for accurate assembly-to-source decompilation and reverse translation in code reverse engineering.
Mesh Based Simulations with Spatial and Temporal awareness
cs.LG 2026-05 unverdicted novelty 5.0

A unified training framework for mesh-based ML surrogates in CFD improves accuracy and long-horizon stability by enforcing spatial derivative consistency via multi-node prediction, using temporal cross-attention corre...
SplitFT: An Adaptive Federated Split Learning System For LLMs Fine-Tuning
cs.DC 2026-04 unverdicted novelty 5.0

SplitFT adapts cut-layer selection and reduces LoRA rank per client in federated split learning to improve efficiency and performance when fine-tuning LLMs on heterogeneous devices and data.
Extending Tabular Denoising Diffusion Probabilistic Models for Time-Series Data Generation
cs.LG 2026-04 conditional novelty 5.0

A temporal extension of TabDDPM generates coherent synthetic time-series sequences on the WISDM dataset that match real distributions and support downstream classification with macro F1 of 0.64.
Assessing the Potential of Masked Autoencoder Foundation Models in Predicting Downhole Metrics from Surface Drilling Data
cs.LG 2026-04 unverdicted novelty 4.0

A literature review of thirteen papers finds that masked autoencoders have not been applied to downhole metric prediction from surface drilling data despite their advantages for unlabeled time-series modeling.
From Weights to Activations: Is Steering the Next Frontier of Adaptation?
cs.CL 2026-04 unverdicted novelty 4.0

Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
cs.CV 2026-05 unverdicted novelty 2.0

Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.