super hub

Language Models are Few-Shot Learners

Benjamin Mann, Jared Kaplan, Melanie Subbiah, Nick Ryder, Prafulla Dhariwal, Tom B. Brown · 2020 · cs.CL · arXiv 2005.14165

148 Pith papers cite this work. Polarity classification is still indexing.

148 Pith papers citing it

open full Pith review browse 148 citing papers more from Benjamin Mann arXiv PDF

abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

unclear 1

claims ledger

abstract Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performan

authors

Benjamin Mann Jared Kaplan Melanie Subbiah Nick Ryder Prafulla Dhariwal Tom B. Brown

co-cited works

representative citing papers

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

LLM tasks are supported by multiple distinct circuits rather than unique mechanisms, demonstrated via Overlap-Aware Sheaf Repulsion and the Distributive Dense Circuit Hypothesis.

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.

Scalable Distributed Stochastic Optimization via Bidirectional Compression: Beyond Pessimistic Limits

math.OC · 2026-05-08 · unverdicted · novelty 7.0

Inkheart SGD and M4 use bidirectional compression to achieve time complexities in distributed SGD that improve with worker count n and surpass prior lower bounds under a necessary structural assumption.

Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

cs.CL · 2026-05-05 · accept · novelty 7.0

Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.

Reconstructing conformal field theoretical compositions with Transformers

hep-th · 2026-05-01 · unverdicted · novelty 7.0

Transformers reconstruct the constituent RCFTs in tensor-product theories from low-energy spectra, reaching 98% accuracy on WZW models and generalizing to larger central charges with few out-of-domain examples.

E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

cs.CR · 2026-05-01 · unverdicted · novelty 7.0

E-MIA converts document details into four types of exam questions and aggregates the RAG's answers into a membership score that separates member and non-member documents better than prior similarity-based or probe-based attacks.

From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.

Agentic Witnessing: Pragmatic and Scalable TEE-Enabled Privacy-Preserving Auditing

cs.CR · 2026-04-27 · unverdicted · novelty 7.0

Agentic Witnessing enables privacy-preserving auditing of semantic properties in private data by running an LLM auditor in a TEE that answers binary queries and produces cryptographic transcripts of its reasoning.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Evaluating Temporal Consistency in Multi-Turn Language Models

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

Language models frequently violate temporal scope stability in multi-turn dialogues by drifting toward present-day assumptions even when they possess the correct facts.

Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.

On the Emergence of Syntax by Means of Local Interaction

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.

Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading

cs.CR · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.

ProtoCycle: Reflective Tool-Augmented Planning for Text-Guided Protein Design

q-bio.QM · 2026-04-18 · unverdicted · novelty 7.0

ProtoCycle improves text-guided protein design by coupling an LLM planner with tool feedback and reflection to achieve better language alignment and foldability than direct generation.

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.

Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport

cs.AI · 2026-04-14 · unverdicted · novelty 7.0

GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on subreddit data.

LiveGesture Streamable Co-Speech Gesture Generation Model

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

LiveGesture introduces the first fully streamable zero-lookahead co-speech full-body gesture generation model using a causal vector-quantized tokenizer and hierarchical autoregressive transformers that matches offline SOTA on BEAT2.

citing papers explorer

Showing 30 of 30 citing papers after filters.

Editing Models with Task Arithmetic cs.LG · 2022-12-08 · accept · none · ref 9 · internal anchor
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Show Your Work: Scratchpads for Intermediate Computation with Language Models cs.LG · 2021-11-30 · unverdicted · none · ref 4 · internal anchor
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning cs.LG · 2026-05-09 · unverdicted · none · ref 37 · internal anchor
The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework cs.LG · 2026-05-01 · unverdicted · none · ref 1 · internal anchor
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.
In-context Learning and Induction Heads cs.LG · 2022-09-24 · unverdicted · none · ref 1 · internal anchor
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale cs.LG · 2022-08-15 · conditional · none · ref 121 · internal anchor
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Diffusion Models Beat GANs on Image Synthesis cs.LG · 2021-05-11 · accept · none · ref 6 · internal anchor
Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity cs.LG · 2021-01-11 · accept · none · ref 3 · internal anchor
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Scaling Laws for Autoregressive Generative Modeling cs.LG · 2020-10-28 · accept · none · ref 1 · internal anchor
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 241 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
State-Space NTK Collapse Near Bifurcations cs.LG · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
Spectral Transformer Neural Processes cs.LG · 2026-05-10 · unverdicted · none · ref 8 · internal anchor
STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
Ensemble Distributionally Robust Bayesian Optimisation cs.LG · 2026-05-08 · unverdicted · none · ref 95 · internal anchor
A tractable ensemble distributionally robust Bayesian optimization method achieves improved sublinear regret bounds under context uncertainty.
Coupling Models for One-Step Discrete Generation cs.LG · 2026-05-08 · unverdicted · none · ref 15 · internal anchor
Coupling Models enable single-step discrete sequence generation via learned couplings to Gaussian latents and outperform prior one-step baselines on text perplexity, biological FBD, and image FID metrics.
Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 63 · internal anchor
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
Federation of Experts: Communication Efficient Distributed Inference for Large Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
FoE restructures MoE blocks into per-KV-head clusters with sum-based synchronization, removing all-to-all communication in single-node settings and limiting it to intra-node in multi-node settings for up to 5.2x faster inference with comparable quality.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 246 · internal anchor
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 49 · 2 links · internal anchor
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Controllable Image Generation with Composed Parallel Token Prediction cs.LG · 2026-04-07 · unverdicted · none · ref 32 · internal anchor
A new formulation for composing discrete generative processes enables precise control over novel condition combinations in image generation, cutting error rates by 63% and speeding up inference.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 11 · internal anchor
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 11 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Representation Engineering: A Top-Down Approach to AI Transparency cs.LG · 2023-10-02 · unverdicted · none · ref 1 · internal anchor
Representation engineering uses population-level representations in deep neural networks to monitor and manipulate cognitive phenomena like honesty and harmlessness, providing simple effective baselines for LLM safety.
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges cs.LG · 2021-04-27 · accept · none · ref 13 · internal anchor
Geometric deep learning provides a unified mathematical framework based on grids, groups, graphs, geodesics, and gauges to explain and extend neural network architectures by incorporating physical regularities.
Linformer: Self-Attention with Linear Complexity cs.LG · 2020-06-08 · conditional · none · ref 2 · internal anchor
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
When Context Sticks: Studying Interference in In-Context Learning cs.LG · 2026-04-25 · unverdicted · none · ref 7 · internal anchor
In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 5 · internal anchor
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models cs.LG · 2026-04-29 · unverdicted · none · ref 3 · internal anchor
Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.
FLeX: Fourier-based Low-rank EXpansion for multilingual transfer cs.LG · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
LoRA fine-tuning of Code Llama with Fourier regularization raises Java pass@1 from 34.2% to 42.1% while using a small high-quality dataset.
The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge cs.LG · 2026-04-10 · unverdicted · none · ref 1 · internal anchor
A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

Language Models are Few-Shot Learners

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer