Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
super hub Canonical reference
Language Models are Few-Shot Learners
Canonical reference. 76% of citing Pith papers cite this work as background.
abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performan
authors
co-cited works
representative citing papers
StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.
NLP papers commonly report annotator recruitment, expertise, and volume but frequently omit training, compensation, socio-demographics, adjudication, and agreement metrics, with reporting improving over time yet remaining uneven across tasks and venues.
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
HCMS partitions multi-head attention into chunks and pipelines them across dual CUDA streams to overlap communication and computation, delivering 10-17.5% speedup over Ulysses for 31K-56K token sequences.
MLFMs combine masking with continuous flows to scale flow-based language models to reasoning and instruction-following tasks on GSM8K and MT-Bench.
ICL in LLMs shows a sharp ceiling on categorical distributions for high-cardinality tabular data, failing to reproduce rare classes despite examples, while numerical fidelity improves.
Establishes a quadratic lower bound on query complexity for sampling from large classes of distributions given approximate density oracles, answers an open question on optimality of random walks, and shows circumvention for bounded classes as an abstraction of TTT.
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
RL with chrF reward trains LLMs to better utilize in-context linguistic knowledge for zero-shot translation of unseen languages, outperforming ICL and SFT.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
The authors introduce a three-level formality spectrum (informal, casual, formal) and the 3LF dataset to correct supervision misalignment in formality transfer, reporting large gains in informal-to-formal performance on models including GPT variants.
citing papers explorer
-
Evaluating Large Language Models in Scientific Discovery
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
-
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on subreddit data.
-
Measuring Faithfulness in Chain-of-Thought Reasoning
Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.
-
DrugBench: Evaluating AI Control Protocols for Medication Harm Mitigation
DrugBench evaluates AI control protocols on 3,671 medical conversations for four medication harm types and finds existing protocols subvertible, proposing severity-based monitoring instead.
-
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Empirical evaluation across 25 LLMs shows contamination detection methods achieve correct outcomes in only 201 of 335 cases, exposing failure modes from distribution shift and benchmark scale.
-
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
The paper proposes Strategic Prior-data Fitted Network (SPN), an inference-time framework that adapts pretrained tabular foundation models (PFNs) to strategic manipulation by aligning predictions with approximated post-manipulation distributions via strategic in-context examples.
-
Reasoning Can Be Restored by Correcting a Few Decision Tokens
Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.
-
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
-
ZAYA1-8B Technical Report
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
-
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach
A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
-
When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
Lightweight numerical bandits on text embeddings match or exceed LLM accuracy in contextual bandits at a fraction of the cost, with an embedding-based diagnostic to choose between them.
-
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents
Persistent memory is necessary and sufficient for LLM poker agents to reach ToM levels 3-5 and use strategic deception, while agents without memory stay at level 0.
-
Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model
Saber improves both speed and accuracy of diffusion language models on code generation by dynamically adjusting unmasking steps and reverting low-confidence tokens via backtracking.
-
Large Language Models for Market Research: A Data-augmentation Approach
A data-augmentation framework for conjoint analysis integrates LLM-generated data with human responses to yield consistent, asymptotically normal estimators and reported cost savings of 24.9-79.8% in two empirical studies.
-
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies
PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
-
A Roadmap to Pluralistic Alignment
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
-
Revealing Safety-Critical Scenarios for UTM via Transformer
Transformer RL with a Policy Model and Action Sampler finds UTM safety vulnerabilities 8x more efficiently than expert testing in 700-hour simulations.
-
PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement
PAPERCLAW is a multi-agent system for end-to-end autonomous research paper generation from literature to output, with human refinement and LLM-judge evaluation showing strong results.
-
Nothing from Something: Can a Language Model Discover 0?
Language models require explicit examples to learn zero in arithmetic but language pretraining halves the examples needed.
-
Q-Delta: Beyond Key-Value Associative State Evolution
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
-
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
-
DenseSteer: Steering Small Language Models towards Dense Math Reasoning
DenseSteer is an inference-time steering framework that improves small LLMs' accuracy on math reasoning by modulating representations toward dense reasoning patterns with fewer but higher-density steps.
-
From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning
Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable corrections on BDD-X subsets.
-
Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses
A PMT-constrained LLM framework with A-TLM configuration outperforms classical imputation methods on RMSE and bias for block-wise missing disaster survey data.
-
Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models
Authors release the multimodal WJoconde knowledge graph for French cultural heritage and a LLM-VLM pipeline that extracts and validates new triples from unstructured text and images to extend the graph.
-
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
-
Alignment as Jurisprudence
Jurisprudence and AI alignment share core structures in predicting and shaping decisions by powerful actors through language specification and interpretation, enabling mutual insights.
-
Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning
Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.
-
Measuring the Machine: Evaluating Generative AI as Pluralist Sociotechical Systems
Generative AI must be evaluated as recursive pluralist sociotechnical systems via MaSH Loops and distributional World Values Benchmarks instead of static functionalist or prescriptive tests.
-
Automatic Generation of Executable BPMN Models from Medical Guidelines
LLM-based pipeline converts medical guidelines into executable BPMN models with over 92% per-patient decision agreement and an entropy detector for policy ambiguity.
-
The Cartesian Cut in Agentic AI
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
-
An Analysis of Artificial Intelligence Adoption in NIH-Funded Research
AI makes up 15.9% of NIH-funded biomedical projects in 2025 with a 13.4% funding premium, yet 79% stay in research stages, only 14.7% reach clinical deployment, and health disparities work is just 5.7% of AI projects.
-
What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline
The thesis presents Pino, an end-to-end pipeline that supervises reinforcement learning agents with argumentation-based normative advisors, introduces an algorithm for automatic argument extraction, and defines a mitigation strategy for norm avoidance.
-
A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI
Proposes a state-space constrained emulation framework for pluralistic AI evaluation using synthetic cognitive profiles and reports instability in persona coherence under sequential and perturbed inference.
-
MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop
MADP multi-agent pipeline with human-in-the-loop achieves 97% full automation on 955 real documents, 98.5% accuracy on ablation set, and 69-70% reductions in FTE, energy, and emissions versus manual processing.
-
Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support
A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.
-
Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition
Fine-tuned LLaMA3 with LoRA reaches 81.24% F1 on 18-category fine-grained medical entity recognition, beating zero-shot by 63.11% and few-shot by 35.63%.
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
-
A Survey of Hallucination in Large Foundation Models
A survey classifying hallucination phenomena specific to large foundation models, establishing evaluation criteria, examining mitigation strategies, and discussing future directions.
-
Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support
A survey synthesizing LLM and MM-LLM uses in transportation operations, mobility services, and decision support while noting challenges like data heterogeneity and real-time needs.
- MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents