hub Canonical reference

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard, Sebastian Ruder · 2018 · cs.CL · arXiv 1801.06146

Canonical reference. 83% of citing Pith papers cite this work as background.

38 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 38 citing papers arXiv PDF

abstract

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open-source our pretrained models and code.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 method 1

citation-polarity summary

background 5 use method 1

representative citing papers

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

A Hormone-inspired Emotion Layer for Transformer language models (HELT)

cs.NE · 2026-04-13 · unverdicted · novelty 7.0

HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

cs.CL · 2019-08-27 · unverdicted · novelty 7.0

FinBERT adapts BERT to the financial domain and outperforms prior state-of-the-art methods on financial sentiment analysis tasks.

XLNet: Generalized Autoregressive Pretraining for Language Understanding

cs.CL · 2019-06-19 · accept · novelty 7.0

XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.

Reconstructing conformal field theoretical compositions with Transformers

hep-th · 2026-05-01 · unverdicted · novelty 7.0

Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

cs.CL · 2019-09-26 · accept · novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

Fine-Tuning Language Models from Human Preferences

cs.CL · 2019-09-18 · unverdicted · novelty 7.0

Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.

Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

DG-Hard uses Donoho-Gavish hard thresholding on the fine-tuning weight delta to separate task-aligned signal from noise-like residual, recovering damaged capabilities while preserving target-task gains.

PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability

cs.LG · 2026-04-08 · unverdicted · novelty 6.0 · 2 refs

A GNN-ODE surrogate forecasts reactor thermal-hydraulics under partial observability, achieving low MAE on held-out transients, fast inference, and recovery of a physical Reynolds-number exponent after fine-tuning on limited experimental data.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

TabTransformer: Tabular Data Modeling Using Contextual Embeddings

cs.LG · 2020-12-11 · unverdicted · novelty 6.0

TabTransformer uses Transformer self-attention to generate contextual embeddings from categorical features in tabular data, outperforming prior deep learning methods by at least 1% mean AUC and matching tree-based ensembles on 15 public datasets while showing robustness to missing and noisy features

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

cs.CL · 2020-02-10 · accept · novelty 6.0

Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.

CTRL: A Conditional Transformer Language Model for Controllable Generation

cs.CL · 2019-09-11 · unverdicted · novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis

cs.CL · 2019-06-24 · unverdicted · novelty 6.0

Authors release a new 800-sentence gender-balanced profession dataset and use it to test occupational gender stereotypes in three sentiment analysis models.

Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents

cs.AI · 2026-04-15 · conditional · novelty 6.0

Cross-domain memory transfer in coding agents boosts average performance by 3.7% via meta-knowledge like validation routines, with high-level abstractions transferring better than low-level traces.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

cs.CL · 2024-04-09 · conditional · novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

citing papers explorer

Showing 38 of 38 citing papers.

Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 23
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
On the Geometry of On-Policy Distillation cs.LG · 2026-06-05 · unverdicted · none · ref 6 · internal anchor
OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.
A Hormone-inspired Emotion Layer for Transformer language models (HELT) cs.NE · 2026-04-13 · unverdicted · none · ref 30 · internal anchor
HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
Massive Activations in Large Language Models cs.CL · 2024-02-27 · unverdicted · none · ref 43 · internal anchor
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
FinBERT: Financial Sentiment Analysis with Pre-trained Language Models cs.CL · 2019-08-27 · unverdicted · none · ref 5 · internal anchor
FinBERT adapts BERT to the financial domain and outperforms prior state-of-the-art methods on financial sentiment analysis tasks.
XLNet: Generalized Autoregressive Pretraining for Language Understanding cs.CL · 2019-06-19 · accept · none · ref 14 · internal anchor
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
Reconstructing conformal field theoretical compositions with Transformers hep-th · 2026-05-01 · unverdicted · none · ref 42
Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 124
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 45
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 28
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations cs.CL · 2019-09-26 · accept · none · ref 17
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Fine-Tuning Language Models from Human Preferences cs.CL · 2019-09-18 · unverdicted · none · ref 7
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining cs.LG · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
DG-Hard uses Donoho-Gavish hard thresholding on the fine-tuning weight delta to separate task-aligned signal from noise-like residual, recovering damaged capabilities while preserving target-task gains.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts cs.CL · 2026-05-13 · unverdicted · none · ref 79 · internal anchor
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability cs.LG · 2026-04-08 · unverdicted · none · ref 44 · 2 links · internal anchor
A GNN-ODE surrogate forecasts reactor thermal-hydraulics under partial observability, achieving low MAE on held-out transients, fast inference, and recovery of a physical Reynolds-number exponent after fine-tuning on limited experimental data.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 176 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
TabTransformer: Tabular Data Modeling Using Contextual Embeddings cs.LG · 2020-12-11 · unverdicted · none · ref 79 · internal anchor
TabTransformer uses Transformer self-attention to generate contextual embeddings from categorical features in tabular data, outperforming prior deep learning methods by at least 1% mean AUC and matching tree-based ensembles on 15 public datasets while showing robustness to missing and noisy features
How Much Knowledge Can You Pack Into the Parameters of a Language Model? cs.CL · 2020-02-10 · accept · none · ref 49 · internal anchor
Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 17 · internal anchor
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
Good Secretaries, Bad Truck Drivers? Occupational Gender Stereotypes in Sentiment Analysis cs.CL · 2019-06-24 · unverdicted · none · ref 19 · internal anchor
Authors release a new 800-sentence gender-balanced profession dataset and use it to test occupational gender stereotypes in three sentiment analysis models.
Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents cs.AI · 2026-04-15 · conditional · none · ref 2
Cross-domain memory transfer in coding agents boosts average performance by 3.7% via meta-knowledge like validation routines, with high-level abstractions transferring better than low-level traces.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 23
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 98
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 118
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 43
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting cs.SE · 2026-06-19 · unverdicted · none · ref 9 · internal anchor
SemFin combines model configuration files with repository tags to impute missing metadata across 317k PTLMs, outperforming propagation baselines by up to 31.4% and expanding reuse and license lineage chains on 167k models.
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment cs.AI · 2023-08-10 · accept · none · ref 33 · internal anchor
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Quantifying the Carbon Emissions of Machine Learning cs.CY · 2019-10-21 · unverdicted · none · ref 9 · internal anchor
Presents a calculator tool for estimating carbon emissions from ML model training along with mitigation actions.
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation cs.LG · 2026-05-06 · unverdicted · none · ref 10 · 2 links
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.
Uncertainty-Aware Transformers: Conformal Prediction for Language Models cs.LG · 2026-04-10 · unverdicted · none · ref 5
CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.
RoBERTa: A Robustly Optimized BERT Pretraining Approach cs.CL · 2019-07-26 · accept · none · ref 17
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison cs.AI · 2026-06-03 · unverdicted · none · ref 27 · internal anchor
Headache specialists preferred their own literature summaries over those from Sonnet, GPT-4o, and Llama 3.1 in a blinded evaluation, though AI summaries were sometimes indistinguishable.
Transfer Learning for Risk Classification of Social Media Posts: Model Evaluation Study cs.CL · 2019-07-04 · unverdicted · none · ref 15 · internal anchor
Finetuning GPT-1 on 150000 unlabeled Reachout.com posts then feeding the features into AutoML yields a new state-of-the-art macro F1 of 0.572 for triaging risk in 1588 labeled CLPsych 2017 posts without metadata or history.
Enhancing PIO Element Detection in Medical Text Using Contextualized Embedding cs.CL · 2019-06-26 · unverdicted · none · ref 6 · internal anchor
Builds an improved PIO dataset and reports performance gains from domain-specific BERT embeddings plus ensembles in multi-label PIO classification.
Domain Fine-Tuning FinBERT on Finnish Histopathological Reports: Train-Time Signals and Downstream Correlations cs.CL · 2026-04-16 · unverdicted · none · ref 1
Fine-tuning FinBERT on Finnish medical text produces embedding geometry shifts whose correlation with downstream performance the authors attempt to measure as a potential early signal for domain adaptation benefit.
Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction cs.CL · 2019-07-06 · unverdicted · none · ref 6 · internal anchor
A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.
A Scalable Framework for Multilevel Streaming Data Analytics using Deep Learning eess.SY · 2019-07-15 · unverdicted · none · ref 21 · internal anchor
Describes a multilevel streaming text analytics framework combining Spark streaming, LSTM models, and SQL processing for real-time sentiment analysis demonstrated on a business use case.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF cs.CL · 2026-05-05 · unverdicted · none · ref 37
The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.

Universal Language Model Fine-tuning for Text Classification

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer