GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
hub Canonical reference
Universal Language Model Fine-tuning for Text Classification
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open-source our pretrained models and code.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.
HormoneT5 augments T5 with a hormone-inspired block that predicts six continuous emotion values and uses them to modulate responses, reporting over 85% per-hormone accuracy and human preference for emotional quality.
Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
FinBERT adapts BERT to the financial domain and outperforms prior state-of-the-art methods on financial sentiment analysis tasks.
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Language models fine-tuned via RL on 5k-60k human preference comparisons produce stylistically better text continuations and human-preferred summaries that sometimes copy input sentences.
DG-Hard uses Donoho-Gavish hard thresholding on the fine-tuning weight delta to separate task-aligned signal from noise-like residual, recovering damaged capabilities while preserving target-task gains.
PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
A GNN-ODE surrogate forecasts reactor thermal-hydraulics under partial observability, achieving low MAE on held-out transients, fast inference, and recovery of a physical Reynolds-number exponent after fine-tuning on limited experimental data.
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
TabTransformer uses Transformer self-attention to generate contextual embeddings from categorical features in tabular data, outperforming prior deep learning methods by at least 1% mean AUC and matching tree-based ensembles on 15 public datasets while showing robustness to missing and noisy features
Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
Authors release a new 800-sentence gender-balanced profession dataset and use it to test occupational gender stereotypes in three sentiment analysis models.
Cross-domain memory transfer in coding agents boosts average performance by 3.7% via meta-knowledge like validation routines, with high-level abstractions transferring better than low-level traces.
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
citing papers explorer
-
Towards Imputation of Pre-Trained Language Model Metadata using Semantic Fingerprinting
SemFin combines model configuration files with repository tags to impute missing metadata across 317k PTLMs, outperforming propagation baselines by up to 31.4% and expanding reuse and license lineage chains on 167k models.