A Survey on In-context Learning
Pith reviewed 2026-05-12 12:53 UTC · model grok-4.3
The pith
In-context learning allows large language models to make predictions from prompts that include a few task examples without parameter updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In-context learning has emerged as a new paradigm for natural language processing in which large language models make predictions based on contexts augmented with a few examples, and this survey organizes the techniques, applications, and open problems to facilitate further research into how such models extrapolate abilities from limited context.
What carries the argument
The formal definition of in-context learning as the process by which large language models generate outputs conditioned on a prompt that contains a small number of demonstration examples.
Load-bearing premise
The rapidly expanding body of ICL literature can be usefully organized and summarized within the scope and selection criteria of a single survey paper.
What would settle it
Publication of a substantial set of subsequent papers that introduce major ICL techniques or challenges absent from the survey's categories would show the organization does not capture the field's current state.
read the original abstract
With the increasing capabilities of large language models (LLMs), in-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP), where LLMs make predictions based on contexts augmented with a few examples. It has been a significant trend to explore ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress and challenges of ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques, including training strategies, prompt designing strategies, and related analysis. Additionally, we explore various ICL application scenarios, such as data engineering and knowledge updating. Finally, we address the challenges of ICL and suggest potential directions for further research. We hope that our work can encourage more research on uncovering how ICL works and improving ICL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys in-context learning (ICL) as an emerging paradigm for large language models (LLMs) in NLP. It begins with a formal definition of ICL and its relation to related concepts, then organizes advanced techniques (training strategies, prompt design, and analyses), reviews applications (e.g., data engineering and knowledge updating), and concludes by addressing challenges and suggesting future research directions.
Significance. A structured survey on ICL would be moderately significant given the field's rapid expansion, as it could consolidate literature on techniques, applications, and open problems to guide researchers. The logical organization from definition through techniques and applications to challenges supports its utility as a reference, though its impact depends on the depth and balance of coverage across the cited works.
minor comments (3)
- [Introduction/Abstract] The abstract states that the paper clarifies ICL's correlation to related studies, but the provided outline does not specify how overlaps with few-shot prompting or meta-learning are delineated; adding a dedicated subsection or table comparing these would improve clarity.
- The discussion of prompt designing strategies is listed as a core technique area, yet without explicit criteria for paper selection or inclusion of quantitative benchmarks across methods, the summary risks appearing selective; a methods section detailing search terms and coverage would strengthen the survey.
- [Challenges and Future Directions] In the challenges section, the suggestions for future directions are high-level; grounding each with references to specific recent papers or open benchmarks would make the recommendations more actionable.
Simulated Author's Rebuttal
We thank the referee for their review and recommendation of minor revision. The positive assessment of the survey's logical organization, coverage of definitions, techniques, applications, and challenges is appreciated. As no specific major comments were provided, we have no point-by-point revisions to address.
Circularity Check
No significant circularity: literature survey without derivations or predictions
full rationale
This is a survey paper whose sole claim is to organize and summarize the existing ICL literature. It presents a formal definition of ICL and discusses techniques, applications, and challenges drawn from cited works, but advances no original equations, fitted parameters, predictions, or derivations. No load-bearing step reduces to self-definition, self-citation chains, or renaming of results. Self-citations (if any) support only descriptive coverage and are not invoked to justify uniqueness theorems or force technical conclusions. The paper is self-contained as an editorial synthesis against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/LawOfExistence.leanlaw_of_existence unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The key idea of in-context learning is to learn from analogy. Figure 1 gives an example that describes how language models make decisions via ICL.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 60 Pith papers
-
From Context to Skills: Can Language Models Learn from Context Skillfully?
Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.
-
Gradient-Based Program Synthesis with Neurally Interpreted Languages
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
-
Pro$^2$Assist: Continuous Step-Aware Proactive Assistance with Multimodal Egocentric Perception for Long-Horizon Procedural Tasks
Pro²Assist uses multimodal egocentric perception from AR glasses to track fine-grained progress in long-horizon procedural tasks and deliver timely proactive assistance, outperforming baselines by over 21% in action u...
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
Deformation-based In-Context Learning for Point Cloud Understanding
DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.
-
Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings
A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.
-
ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild
ProAgent uses on-demand tiered perception and context-aware LLM reasoning to deliver proactive assistance on AR glasses, achieving up to 27.7% higher prediction accuracy and 20.5% lower false detections than baselines.
-
When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models
CREST-Search is a red-teaming framework that crafts seemingly benign search queries to induce unsafe citations from web-augmented LLMs, backed by a new WebSearch-Harm dataset for fine-tuning a specialized attacker model.
-
Soft Head Selection for Injecting ICL-Derived Task Embeddings
SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B pa...
-
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
-
CodeMind: Evaluating Large Language Models for Code Reasoning
CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs
Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
-
Personal Visual Context Learning in Large Multimodal Models
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
-
Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging
WALDO improves zero-shot anomaly localization in medical imaging by selecting reference distributions via entropy-weighted Sliced Wasserstein distances and Goldilocks zone sampling, yielding a 19% relative gain on bra...
-
Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.
-
RAQG-QPP: Query Performance Prediction with Retrieved Query Variants and Retrieval Augmented Query Generation
Retrieved query variants from logs combined with LLM-augmented generation improve unsupervised QPP accuracy by up to 30% for neural rankers on TREC DL'19 and DL'20.
-
Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model
A graph-to-text paradigm with Dynamic Concept Binding Mechanism integrates interactive graphs and LLMs to recommend product bundles, yielding 6.3%-26.5% gains over baselines on POG, POG_dense, and Steam datasets.
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
-
Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset
OpenCEM is the first open-source digital twin that integrates unstructured contextual information with quantitative microgrid dynamics to enable context-aware energy management.
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
-
A Rule-Aware Prompt Framework for Structured Numeric Reasoning in Cyber-Physical Systems
A rule-aware modular prompt framework enables LLMs to perform structured numeric reasoning on power grid data by separating rules from normalized deviations, improving anomaly detection consistency and reducing token ...
-
SnapAudit: Active Auditing of Differentially Private In-Context Learning via Snapshot-Based Simulation
SnapAudit decomposes DP-ICL into a deterministic snapshot stage and a stochastic noise stage, using bootstrap simulation to achieve 80-200x faster auditing and exposing privacy bound violations in existing Gaussian an...
-
Localizing Task Recognition and Task Learning in In-Context Learning via Attention Head Analysis
A new framework using Task Subspace Logit Attribution localizes attention heads specialized for task recognition and task learning in in-context learning, showing they align and rotate hidden states within a task subspace.
-
Artificial Phantasia: Emergent Mental Imagery in Large Language Models
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
BugScope: Learn to Find Bugs Like Human
BugScope structures LLM bug detection into three human-mirroring steps and distills guidelines from examples, reaching 0.87 F1 on 33 real bugs while outperforming Claude and Cursor tools and uncovering 184 new issues ...
-
In-depth Analysis of Graph-based RAG in a Unified Framework
A unified framework and large-scale comparison of graph-based RAG methods on QA tasks yields new high-performing variants obtained by recombining existing components.
-
Judge a Book by its Cover: Investigating Multi-Modal LLMs for Multi-Page Handwritten Document Transcription
Introduces OCR+PAGE-1 and OCR+PAGE-N prompting strategies that improve zero-shot multi-page handwritten document transcription by sharing context across pages.
-
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datase...
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Large Language Models are not Fair Evaluators
LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better matc...
-
LLMs with in-context learning for Algorithmic Theoretical Physics
Frontier LLMs with in-context learning and CAS integration solve most algorithmic tasks in theoretical physics when supplied with worked examples.
-
When Context Sticks: Studying Interference in In-Context Learning
In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.
-
Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)
SSAS improves LLM sentiment prediction consistency and data quality by up to 30% on three review datasets via syntactic and semantic context assessment summarization.
-
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
-
Context-Guided Decompilation: A Step Towards Re-executability
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.
-
Online In-Context Distillation for Low-Resource Vision Language Models
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
-
SSA: Improving Performance With a Better Scoring Function
Replacing Softmax with Scaled Signed Averaging in transformer attention improves generalization under distribution shifts for in-context learning and boosts results on NLP benchmarks.
-
Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question
Question interpretation diversity outperforms model diversity for LLM ensembling on binary QA tasks using majority voting.
-
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models
LENS is a new multi-level benchmark dataset for evaluating MLLMs on perception-to-reasoning tasks using the same images across all levels with recent social media content.
-
Exploring Cross-lingual Latent Transplantation: Mutual Opportunities and Open Challenges
XTransplant empirically shows that cross-lingual latent transplantation yields mutual benefits for multilingual capability and cultural adaptability in LLMs, especially low-resource ones, while revealing underutilized...
-
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
-
The Pedagogy of AI Mistakes: Fostering Higher-Order Thinking
AI mistakes can be structured into course activities to foster higher-order thinking, metacognition, and AI literacy in higher education.
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
-
Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition
Fine-tuned LLaMA3 with LoRA reaches 81.24% F1 on 18-category fine-grained medical entity recognition, beating zero-shot by 63.11% and few-shot by 35.63%.
-
Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs
wSSAS is a two-phase deterministic framework that uses hierarchical text organization and SNR-based feature prioritization to improve clustering integrity, categorization accuracy, and reproducibility when applying LL...
-
LLMs Underperform Graph-Based Parsers on Supervised Relation Extraction for Complex Graphs
Graph-based parsers outperform LLMs on supervised relation extraction as linguistic graph complexity grows with more relations per document.
-
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
-
Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption
LLM watermarking adoption is limited by misaligned stakeholder incentives; incentive-aligned approaches such as in-context watermarking can enable practical use in targeted domains like education and peer review.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Large Language Model-Brained GUI Agents: A Survey
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
-
Agent AI: Surveying the Horizons of Multimodal Interaction
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
-
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
-
The Rise and Potential of Large Language Model Based Agents: A Survey
The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
-
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Scaling a multilingual vision-language model in size and training breadth yields new state-of-the-art results on over 25 benchmarks plus emerging abilities in counting and multilingual detection.
-
Network Edge Inference for Large Language Models: Principles, Techniques, and Opportunities
A survey synthesizing challenges, system architectures, model optimizations, deployment methods, and resource management techniques for large language model inference at the network edge.
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners. In Ad- vances in Neural Information Processing Systems 33: Annual Conference on Neural Information Process- ing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual. Marc-Etienne Brunet, Ashton Anderson, and Richard S. Zemel. 2023. ICL markup: Structuring in- context learning using soft-token tags. CoRR, abs/2312...
-
[2]
Data distributional properties drive emergent in-context learning in transformers. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Sys- tems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. Anwoy Chatterjee, Eshaan Tanwar, Subhabrata Dutta, and Tanmoy Chakraborty. 2024. L...
-
[3]
Transformers learn higher-order optimization methods for in-context learning: A study with linear models. CoRR, abs/2310.17086. Yeqi Gao, Zhao Song, and Shenghao Xie. 2023. In- context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. CoRR, abs/2307.02419. Shivam Garg, Dimitris Tsipras, Percy ...
-
[4]
Association for Computational Linguistics. Michael Hahn and Navin Goyal. 2023. A theory of emergent in-context learning as implicit structure induction. CoRR, abs/2303.07971. Chi Han, Ziqi Wang, Han Zhao, and Heng Ji. 2023a. Explaining emergent in-context learning as kernel regression. Preprint, arXiv:2305.12766. Xiaochuang Han, Daniel Simig, Todor Mihayl...
-
[5]
Self-demos: Eliciting out-of-demonstration generalizability in large language models. CoRR, abs/2404.00884. 11 Clyde Highmore. 2024. In-context learning in large language models: A comprehensive survey. Or Honovich, Uri Shaham, Samuel R. Bowman, and Omer Levy. 2023. Instruction induction: From few examples to natural language task descriptions. In Proceed...
-
[6]
arXiv preprint arXiv:2304.09960 , year=
The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In International Confer- ence on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA , volume 162 of Proceedings of Machine Learning Research, pages 9639–9659. PMLR. Srinivasan Iyer, Xi Victoria Lin, Ramakanth P...
-
[7]
Association for Computational Linguistics. Arvind Mahankali, Tatsunori B. Hashimoto, and Tengyu Ma. 2023. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. CoRR, abs/2307.03576. Costas Mavromatis, Balasubramaniam Srinivasan, Zhengyuan Shen, Jiani Zhang, Huzefa Rangwala, Christos Faloutsos, and...
-
[8]
Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, pages 2655–2671, Seattle, United States. Association for Computational Linguistics. Abulhair Saparov and He He. 2023. Language models are greedy reasoners...
work page 2022
-
[9]
Do pretrained transformers learn in-context by gradient descent? Preprint, arXiv:2310.08540. Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. 2022. Language models are multilingual chain-of-thought reasoners. ArXiv preprint, abs/2210.03057. Weijia Shi, Sewo...
-
[10]
An information-theoretic approach to prompt engineering without ground truth labels. In Proc. of ACL, pages 819–862, Dublin, Ireland. Association for Computational Linguistics. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Pretraining data mixtures enable narrow model selection capabilities in transformer models. CoRR, abs/2311.00871. Jinghan Yang, Shuming Ma, and Furu Wei. 2023a. Auto-icl: In-context learning without human supervi- sion. CoRR, abs/2311.09263. Zhe Yang, Damai Dai, Peiyi Wang, and Zhifang Sui. 2023b. Not all demonstration examples are equally beneficial: Rew...
-
[12]
Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron C
OpenReview.net. Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron C. Courville, Behnam Neyshabur, and Hanie Sedghi
-
[13]
Teaching algorithmic reasoning via in-context learning.arXiv preprint arXiv:2211.09066, 2022
Teaching algorithmic reasoning via in-context learning. CoRR, abs/2211.09066. Wangchunshu Zhou, Yuchen Eleanor Jiang, Ryan Cot- terell, and Mrinmaya Sachan. 2023b. Efficient prompting via dynamic in-context learning. CoRR, abs/2305.11170. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023c. Large l...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.