super hub Canonical reference

Language Models are Few-Shot Learners

Benjamin Mann, Jared Kaplan, Melanie Subbiah, Nick Ryder, Prafulla Dhariwal, Tom B. Brown · 2020 · cs.CL · arXiv 2005.14165

Canonical reference. 76% of citing Pith papers cite this work as background.

408 Pith papers citing it

Background 76% of classified citations

open full Pith review browse 408 citing papers more from Benjamin Mann arXiv PDF

abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 70 method 9 dataset 2 baseline 1

citation-polarity summary

background 62 use method 9 unclear 7 use dataset 2 baseline 1 support 1

claims ledger

abstract Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performan

authors

Benjamin Mann Jared Kaplan Melanie Subbiah Nick Ryder Prafulla Dhariwal Tom B. Brown

co-cited works

representative citing papers

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

cs.LG · 2026-06-18 · unverdicted · novelty 8.0

StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

cs.CL · 2026-06-01 · unverdicted · novelty 8.0

NLP papers commonly report annotator recruitment, expertise, and volume but frequently omit training, compensation, socio-demographics, adjudication, and agreement metrics, with reporting improving over time yet remaining uneven across tasks and venues.

Are Flat Minima an Illusion?

cs.LG · 2026-03-24 · unverdicted · novelty 8.0

Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Show Your Work: Scratchpads for Intermediate Computation with Language Models

cs.LG · 2021-11-30 · unverdicted · novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Generative Language Modeling for Automated Theorem Proving

cs.LG · 2020-09-07 · unverdicted · novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

HCMS: Head-Chunked Multi-Stream Pipeline for Communication-Computation Overlap in Long-Sequence Parallel Attention

cs.DC · 2026-07-02 · unverdicted · novelty 7.0

HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.

Masked Language Flow Models

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

MLFMs combine masking with continuous flows to scale flow-based language models to reasoning and instruction-following tasks on GSM8K and MT-Bench.

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

cs.LG · 2026-06-10 · unverdicted · novelty 7.0

ICL in LLMs shows a sharp ceiling on categorical distributions for high-cardinality tabular data, failing to reproduce rare classes despite examples, while numerical fidelity improves.

The Power of Test-Time Training for Approximate Sampling

cs.DS · 2026-06-09 · unverdicted · novelty 7.0

Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0 · 2 refs

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

RL with chrF reward trains LLMs to better utilize in-context linguistic knowledge for zero-shot translation of unseen languages, outperforming ICL and SFT.

Toward Calibrated, Fair, and accurate Deepfake Detection

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.

Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

The authors introduce a three-level formality spectrum (informal, casual, formal) and the 3LF dataset to correct supervision misalignment in formality transfer, reporting large gains in informal-to-formal performance on models including GPT variants.

citing papers explorer

Showing 41 of 41 citing papers after filters.

Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 2 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport cs.AI · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on subreddit data.
Measuring Faithfulness in Chain-of-Thought Reasoning cs.AI · 2023-07-17 · conditional · none · ref 4 · internal anchor
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation cs.AI · 2026-06-10 · unverdicted · none · ref 3 · internal anchor
DrugBench evaluates AI control protocols on 3,671 medical conversations for four medication harm types and finds existing protocols subvertible, proposing severity-based monitoring instead.
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection cs.AI · 2026-06-02 · conditional · none · ref 4 · 2 links · internal anchor
Empirical evaluation across 25 LLMs shows contamination detection methods achieve correct outcomes in only 201 of 335 cases, exposing failure modes from distribution shift and benchmark scale.
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach cs.AI · 2026-05-19 · unverdicted · none · ref 95 · internal anchor
The paper proposes Strategic Prior-data Fitted Network (SPN), an inference-time framework that adapts pretrained tabular foundation models (PFNs) to strategic manipulation by aligning predictions with approximated post-manipulation distributions via strategic in-context examples.
Reasoning Can Be Restored by Correcting a Few Decision Tokens cs.AI · 2026-05-16 · conditional · none · ref 1 · internal anchor
Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 37 · internal anchor
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach cs.AI · 2026-04-10 · unverdicted · none · ref 70 · internal anchor
A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits cs.AI · 2026-04-07 · unverdicted · none · ref 10 · internal anchor
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents cs.AI · 2026-04-05 · conditional · none · ref 15 · internal anchor
Persistent memory is necessary and sufficient for LLM poker agents to reach ToM levels 3-5 and use strategic deception, while agents without memory stay at level 0.
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model cs.AI · 2025-10-20 · unverdicted · none · ref 2 · internal anchor
Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
Large Language Models for Market Research: A Data-augmentation Approach cs.AI · 2024-12-26 · unverdicted · none · ref 7 · internal anchor
A data-augmentation framework for conjoint analysis integrates LLM-generated data with human responses to yield consistent, asymptotically normal estimators and reported cost savings of 24.9-79.8% in two empirical studies.
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies cs.AI · 2024-12-03 · unverdicted · none · ref 8 · internal anchor
PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 17 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Revealing Safety-Critical Scenarios for UTM via Transformer cs.AI · 2026-06-30 · unverdicted · none · ref 14 · internal anchor
Transformer RL with a Policy Model and Action Sampler finds UTM safety vulnerabilities 8x more efficiently than expert testing in 700-hour simulations.
PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement cs.AI · 2026-06-21 · unverdicted · none · ref 3 · internal anchor
PAPERCLAW is a multi-agent system for end-to-end autonomous research paper generation from literature to output, with human refinement and LLM-judge evaluation showing strong results.
Nothing from Something: Can a Language Model Discover 0? cs.AI · 2026-06-15 · unverdicted · none · ref 17 · internal anchor
Language models require explicit examples to learn zero in arithmetic but language pretraining halves the examples needed.
Q-Delta: Beyond Key-Value Associative State Evolution cs.AI · 2026-06-07 · unverdicted · none · ref 53 · internal anchor
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators cs.AI · 2026-06-05 · unverdicted · none · ref 56 · internal anchor
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
DenseSteer: Steering Small Language Models towards Dense Math Reasoning cs.AI · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
DenseSteer is an inference-time steering framework that improves small LLMs' accuracy on math reasoning by modulating representations toward dense reasoning patterns with fewer but higher-density steps.
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 8 · internal anchor
Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable corrections on BDD-X subsets.
Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses cs.AI · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
A PMT-constrained LLM framework with A-TLM configuration outperforms classical imputation methods on RMSE and bias for block-wise missing disaster survey data.
Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models cs.AI · 2026-05-17 · unverdicted · none · ref 11 · internal anchor
Authors release the multimodal WJoconde knowledge graph for French cultural heritage and a LLM-VLM pipeline that extracts and validates new triples from unstructured text and images to extend the graph.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks cs.AI · 2026-05-11 · unverdicted · none · ref 83 · internal anchor
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
Alignment as Jurisprudence cs.AI · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Jurisprudence and AI alignment share core structures in predicting and shaping decisions by powerful actors through language specification and interpretation, enabling mutual insights.
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning cs.AI · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems cs.AI · 2026-04-22 · unverdicted · none · ref 4 · internal anchor
Generative AI must be evaluated as recursive pluralist sociotechnical systems via MaSH Loops and distributional World Values Benchmarks instead of static functionalist or prescriptive tests.
Automatic Generation of Executable BPMN Models from Medical Guidelines cs.AI · 2026-04-09 · unverdicted · none · ref 3 · internal anchor
LLM-based pipeline converts medical guidelines into executable BPMN models with over 92% per-patient decision agreement and an entropy detector for policy ambiguity.
The Cartesian Cut in Agentic AI cs.AI · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
An Analysis of Artificial Intelligence Adoption in NIH-Funded Research cs.AI · 2026-04-08 · unverdicted · none · ref 16 · internal anchor
AI makes up 15.9% of NIH-funded biomedical projects in 2025 with a 13.4% funding premium, yet 79% stay in research stages, only 14.7% reach clinical deployment, and health disparities work is just 5.7% of AI projects.
What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline cs.AI · 2026-03-17 · unverdicted · none · ref 55 · internal anchor
The thesis presents Pino, an end-to-end pipeline that supervises reinforcement learning agents with argumentation-based normative advisors, introduces an algorithm for automatic argument extraction, and defines a mitigation strategy for norm avoidance.
A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI cs.AI · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
Proposes a state-space constrained emulation framework for pluralistic AI evaluation using synthetic cognitive profiles and reports instability in persona coherence under sequential and perturbed inference.
MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop cs.AI · 2026-05-16 · conditional · none · ref 4 · internal anchor
MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support cs.AI · 2026-04-20 · unverdicted · none · ref 24 · internal anchor
A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.
Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition cs.AI · 2026-04-19 · conditional · none · ref 2 · internal anchor
Fine-tuned LLaMA3 with LoRA reaches 81.24% F1 on 18-category fine-grained medical entity recognition, beating zero-shot by 63.11% and few-shot by 35.63%.
Large Language Model-Brained GUI Agents: A Survey cs.AI · 2024-11-27 · unverdicted · none · ref 78 · internal anchor
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
A Survey of Hallucination in Large Foundation Models cs.AI · 2023-09-12 · accept · none · ref 113 · internal anchor
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support cs.AI · 2026-05-31 · unverdicted · none · ref 5 · internal anchor
A survey synthesizing LLM and MM-LLM uses in transportation operations, mobility services, and decision support while noting challenges like data heterogeneity and real-time needs.
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents cs.AI · 2025-12-14 · unreviewed · ref 2 · internal anchor

Language Models are Few-Shot Learners

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer