hub

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al · 1901

26 Pith papers cite this work. Polarity classification is still indexing.

26 Pith papers citing it

browse 26 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 4

citation-polarity summary

background 3 unclear 1

representative citing papers

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

cs.LG · 2025-06-12 · unverdicted · novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

cs.HC · 2024-05-13 · conditional · novelty 8.0

AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Soft Head Selection for Injecting ICL-Derived Task Embeddings

cs.CL · 2025-07-28 · conditional · novelty 7.0

SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B parameters.

LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion

cs.LG · 2025-03-04 · unverdicted · novelty 7.0

LLM-TabLogic extracts inter-column logical constraints using LLMs and conditions a score-based latent diffusion model on them to generate synthetic tabular data that preserves those relationships.

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

cs.AI · 2024-10-06 · unverdicted · novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

Vector Retrieval with Similarity and Diversity: How Hard Is It?

cs.IR · 2024-07-05 · unverdicted · novelty 7.0

VRSD is defined by maximizing query-to-sum similarity, proven NP-complete, with a parameter-free heuristic outperforming MMR and DPP baselines.

LLM Agents can Autonomously Exploit One-day Vulnerabilities

cs.CR · 2024-04-11 · unverdicted · novelty 7.0

GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

LRM: Large Reconstruction Model for Single Image to 3D

cs.CV · 2023-11-08 · conditional · novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

The Curse of Recursion: Training on Generated Data Makes Models Forget

cs.LG · 2023-05-27 · conditional · novelty 7.0

Use of model-generated content in training causes irreversible loss of distribution tails, termed model collapse, in VAEs, GMMs, and LLMs.

CodeT: Code Generation with Generated Tests

cs.CL · 2022-07-21 · conditional · novelty 7.0

CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.

Language models recognize dropout and Gaussian noise applied to their activations

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.

MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs

cs.LG · 2025-06-15 · unverdicted · novelty 6.0

MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.

Understanding and Mitigating Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks

cs.LG · 2025-02-06 · unverdicted · novelty 6.0

Empirical study across 10 tasks showing bias inheritance from LLM-augmented data harms related downstream performance, with three misalignment factors and three mitigation strategies identified.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

cs.CV · 2024-08-19 · unverdicted · novelty 6.0

LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

cs.CL · 2023-12-10 · unverdicted · novelty 6.0

ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.

GPT-Driver: Learning to Drive with GPT

cs.CV · 2023-10-02 · conditional · novelty 6.0

GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

cs.CL · 2023-08-14 · conditional · novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

cs.CV · 2023-06-26 · accept · novelty 6.0

A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.

Large Language Models Are Human-Level Prompt Engineers

cs.LG · 2022-11-03 · unverdicted · novelty 6.0

APE generates instruction candidates via LLM and selects the best by zero-shot performance of a second LLM, matching or beating human prompts on 19 of 24 NLP tasks.

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

cs.CL · 2022-10-17 · accept · novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.

Steered Generation via Gradient-Based Optimization on Sparse Query Features

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.

Interactive Critique-Revision Training for Reliable Structured LLM Generation

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.

citing papers explorer

Showing 26 of 26 citing papers.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods cs.LG · 2025-06-12 · unverdicted · none · ref 3
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments cs.HC · 2024-05-13 · conditional · none · ref 2
AgentClinic is a multimodal agent benchmark demonstrating that LLM diagnostic accuracy on MedQA drops to below one-tenth in sequential clinical simulations, with Claude-3.5 leading and large tool-use differences across models.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding cs.LG · 2026-05-19 · unverdicted · none · ref 2
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Soft Head Selection for Injecting ICL-Derived Task Embeddings cs.CL · 2025-07-28 · conditional · none · ref 1
SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B parameters.
LLM-TabLogic: Preserving Inter-Column Logical Relationships in Synthetic Tabular Data via Prompt-Guided Latent Diffusion cs.LG · 2025-03-04 · unverdicted · none · ref 4
LLM-TabLogic extracts inter-column logical constraints using LLMs and conditions a score-based latent diffusion model on them to generate synthetic tabular data that preserves those relationships.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark cs.AI · 2024-10-06 · unverdicted · none · ref 7
PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
Vector Retrieval with Similarity and Diversity: How Hard Is It? cs.IR · 2024-07-05 · unverdicted · none · ref 2
VRSD is defined by maximizing query-to-sum similarity, proven NP-complete, with a parameter-free heuristic outperforming MMR and DPP baselines.
LLM Agents can Autonomously Exploit One-day Vulnerabilities cs.CR · 2024-04-11 · unverdicted · none · ref 4
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention cs.CL · 2024-04-10 · conditional · none · ref 4
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
LRM: Large Reconstruction Model for Single Image to 3D cs.CV · 2023-11-08 · conditional · none · ref 3
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
The Curse of Recursion: Training on Generated Data Makes Models Forget cs.LG · 2023-05-27 · conditional · none · ref 2
Use of model-generated content in training causes irreversible loss of distribution tails, termed model collapse, in VAEs, GMMs, and LLMs.
CodeT: Code Generation with Generated Tests cs.CL · 2022-07-21 · conditional · none · ref 2
CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.
Language models recognize dropout and Gaussian noise applied to their activations cs.AI · 2026-04-19 · unverdicted · none · ref 2
Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on LLMs cs.LG · 2025-06-15 · unverdicted · none · ref 4
MaskPro learns categorical distributions over groups of M weights to generate exact (N:M) sparsity via N-way sampling without replacement and stabilizes training with a moving average tracker of loss residuals.
Understanding and Mitigating Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks cs.LG · 2025-02-06 · unverdicted · none · ref 7
Empirical study across 10 tasks showing bias inheritance from LLM-augmented data harms related downstream performance, with three misalignment factors and three mitigation strategies identified.
LongVILA: Scaling Long-Context Visual Language Models for Long Videos cs.CV · 2024-08-19 · unverdicted · none · ref 4
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models cs.CL · 2023-12-10 · unverdicted · none · ref 2
ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.
GPT-Driver: Learning to Drive with GPT cs.CV · 2023-10-02 · conditional · none · ref 4
GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate cs.CL · 2023-08-14 · conditional · none · ref 2
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning cs.CV · 2023-06-26 · accept · none · ref 3
A new dataset of 400k visual instructions including negative examples at three semantic levels reduces hallucinations in models like MiniGPT-4 when used for fine-tuning while improving benchmark performance.
Large Language Models Are Human-Level Prompt Engineers cs.LG · 2022-11-03 · unverdicted · none · ref 7
APE generates instruction candidates via LLM and selects the best by zero-shot performance of a second LLM, matching or beating human prompts on 19 of 24 NLP tasks.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them cs.CL · 2022-10-17 · accept · none · ref 2
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Steered Generation via Gradient-Based Optimization on Sparse Query Features cs.LG · 2026-05-21 · unverdicted · none · ref 4
Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.
Interactive Critique-Revision Training for Reliable Structured LLM Generation cs.LG · 2026-05-08 · unverdicted · none · ref 1
DPA-GRPO trains a generator-verifier pair via group-relative policy optimization on paired counterfactual actions, improving structured output accuracy on TaxCalcBench over zero-shot and generator-only baselines.
RAP: Runtime Adaptive Pruning for LLM Inference cs.LG · 2025-05-22 · unverdicted · none · ref 5
RAP is a reinforcement learning framework for runtime-adaptive pruning of LLMs that jointly optimizes model weights and KV-cache usage under varying memory budgets.
MUR: Momentum Uncertainty guided Reasoning for Large Language Models cs.CL · 2025-07-20 · unreviewed · ref 2

Language models are few-shot learners

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer