Recognition: unknown
When Context Sticks: Studying Interference in In-Context Learning
Pith reviewed 2026-05-08 08:28 UTC · model grok-4.3
The pith
Earlier examples in a prompt continue to interfere with a transformer's adaptation to later tasks during in-context learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using controlled sweeps of linear followed by quadratic examples in prompts, the study demonstrates that more initial linear examples increase error in quadratic predictions, additional quadratic examples decrease it with diminishing returns, and sequential training on the target function enables the fastest recovery while random training yields the weakest resilience to interference.
What carries the argument
Persistent interference from preceding context, measured as the degradation in prediction accuracy when switching between linear and quadratic regression tasks in the prompt.
If this is right
- More preceding examples from one function class will increase error when predicting the other class.
- Error reduction from corrective examples slows after the first few additions.
- Sequential training curricula produce models that recover quickest from context interference.
- Random training curricula result in models with the poorest robustness to task switches.
Where Pith is reading between the lines
- Prompt design in practice should consider ordering to reduce the impact of earlier examples on later ones.
- These dynamics might extend to natural language tasks, suggesting that long context windows could accumulate unwanted biases.
- Alternative training methods could be explored to enhance resilience beyond the tested curricula.
Load-bearing premise
That results from these synthetic regression tasks with linear and quadratic functions generalize to the interference effects in real-world in-context learning with language models.
What would settle it
Finding that the number of preceding examples has no systematic effect on prediction error, or that all curricula show identical recovery rates, when tested on the same task switches or on actual language modeling prompts.
Figures
read the original abstract
This paper investigates context stickiness in in-context learning (ICL), a phenomenon where earlier examples in a prompt interfere with a transformer's ability to adapt to later tasks. Using synthetic regression tasks over linear and quadratic functions, we examine how models trained under sequential, mixed, and random curricula handle abrupt task switches during inference. By sweeping over structured combinations of misleading linear examples followed by recovery quadratic examples, we quantify how prior context biases prediction error and how quickly models realign. Our results show strong evidence of persistent interference: more preceding linear examples reliably degrade quadratic predictions, while additional quadratic examples reduce error but with diminishing returns. We further find that training curricula significantly modulate resilience, with sequential training on the target function class yielding the fastest recovery, and surprisingly, random training producing the least robust behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study of context stickiness in in-context learning (ICL) using transformers trained on synthetic linear and quadratic regression tasks under sequential, mixed, and random curricula. It examines performance on inference prompts with abrupt switches from misleading linear examples to target quadratic examples, claiming to show persistent interference where additional preceding linear examples degrade quadratic predictions, with diminishing returns from recovery examples, and curricula modulating adaptation speed (sequential best, random worst).
Significance. The controlled synthetic experiments offer a clean way to isolate and quantify interference effects in ICL, which could help explain transformer adaptation mechanisms if the patterns prove robust. The curriculum comparisons are a useful angle for training design. However, the absence of any scaling or natural-language validation substantially limits the significance for understanding ICL in actual large language models.
major comments (3)
- [§3 (Experimental Setup)] §3 (Experimental Setup): The description of model training, data generation, and evaluation lacks key reproducibility details including transformer architecture (layers, heads, embedding size), optimization hyperparameters, exact prompt lengths, number of independent seeds/runs, and how error is aggregated. Without these, the 'strong evidence' of interference cannot be verified or reproduced.
- [§4 (Results)] §4 (Results): Figures and tables reporting degradation with more linear examples and recovery curves do not include error bars, standard deviations, or any statistical significance tests. This undermines the reliability of claims such as 'reliably degrade' and 'diminishing returns' since variance in synthetic regression could explain the trends.
- [§5 (Discussion)] §5 (Discussion) and abstract: The central claim that the work studies interference 'in in-context learning' and provides evidence relevant to transformers/LLMs rests on the untested assumption that linear/quadratic synthetic tasks with artificial switches capture the interference dynamics of high-dimensional, semantically structured natural-language ICL. No bridging experiments, scaling studies, or comparisons to real LLM prompts are provided.
minor comments (2)
- [Introduction] The introduction introduces 'context stickiness' informally; a short formal definition or equation quantifying the interference (e.g., error as function of prefix length) would improve clarity.
- [Abstract] Abstract and §4: The 'surprisingly' qualifier on random curriculum results is not supported by a direct comparison figure or table reference, making the surprise claim harder to evaluate.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments have prompted us to improve the reproducibility and statistical rigor of the manuscript. We address each major comment below and have made corresponding revisions.
read point-by-point responses
-
Referee: [§3 (Experimental Setup)] §3 (Experimental Setup): The description of model training, data generation, and evaluation lacks key reproducibility details including transformer architecture (layers, heads, embedding size), optimization hyperparameters, exact prompt lengths, number of independent seeds/runs, and how error is aggregated. Without these, the 'strong evidence' of interference cannot be verified or reproduced.
Authors: We agree that these details were insufficient in the original submission. The revised manuscript expands Section 3 and adds Appendix A with full specifications: a 4-layer transformer with 8 attention heads and embedding size 256; Adam optimizer with learning rate 1e-4, batch size 64, and 50k training steps; prompts consisting of 10-20 examples (approximately 200-400 tokens); results aggregated as mean over 5 independent random seeds with standard deviation; and data generation procedures for linear/quadratic functions. These additions enable full reproduction of all experiments. revision: yes
-
Referee: [§4 (Results)] §4 (Results): Figures and tables reporting degradation with more linear examples and recovery curves do not include error bars, standard deviations, or any statistical significance tests. This undermines the reliability of claims such as 'reliably degrade' and 'diminishing returns' since variance in synthetic regression could explain the trends.
Authors: We acknowledge the omission of variability measures. All figures in the revised Section 4 now display error bars as mean ± one standard deviation across the 5 seeds. We have added a statistical analysis subsection reporting paired t-tests comparing conditions with varying numbers of linear examples (all p < 0.01 for the reported degradations) and confirming diminishing returns in recovery. The text has been updated to reference these statistics when stating trends. revision: yes
-
Referee: [§5 (Discussion)] §5 (Discussion) and abstract: The central claim that the work studies interference 'in in-context learning' and provides evidence relevant to transformers/LLMs rests on the untested assumption that linear/quadratic synthetic tasks with artificial switches capture the interference dynamics of high-dimensional, semantically structured natural-language ICL. No bridging experiments, scaling studies, or comparisons to real LLM prompts are provided.
Authors: We agree that the synthetic setting does not automatically generalize to natural-language ICL and have revised the abstract and Section 5 to explicitly frame the work as a controlled study of interference mechanisms rather than a direct claim about LLMs. The discussion now includes a dedicated limitations paragraph acknowledging the gap and outlining why synthetic tasks enable isolation of effects not feasible in high-dimensional language data. We have not added new scaling or LLM experiments, as they fall outside the paper's scope of providing mechanistic insights via precise synthetic controls. revision: partial
Circularity Check
No circularity: empirical results from synthetic experiments
full rationale
This is a purely empirical paper that reports controlled experiments on synthetic linear/quadratic regression tasks with different training curricula and abrupt task switches. No mathematical derivation chain, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methods. The central observations (persistent interference, curriculum effects) are direct measurements from the experimental setup rather than reductions of outputs to inputs by construction. The study is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic regression tasks over linear and quadratic functions capture key aspects of interference in transformer in-context learning.
invented entities (1)
-
context stickiness
no independent evidence
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. “Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection”. In: (2023). arXiv: 2306. 04637 [cs.LG].URL:https://arxiv.org/abs/2306.04637
-
[4]
Nested Learn- ing: The Illusion of Deep Learning Architectures
Ali Behrouz, Meisam Razaviyayn, Peiling Zhong, and Vahab Mirrokni. “Nested Learn- ing: The Illusion of Deep Learning Architectures”. In: (2025). NeurIPS.URL: https : //openreview.net/pdf?id=nbMeRvNb7A
2025
-
[5]
arXiv preprint arXiv:2405.00200 , year=
Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R. Gormley, and Graham Neubig. “In-Context Learning with Long-Context Models: An In-Depth Exploration”. In: (2025). arXiv: 2405.00200 [cs.CL] .URL: https://arxiv.org/abs/2405. 00200
-
[6]
Harmon Bhasin, Timothy Ossowski, Yiqiao Zhong, and Junjie Hu. “How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes”. In: (2024). arXiv: 2404.03558 [cs.CL] .URL: https://arxiv.org/abs/2404. 03558
-
[7]
Language Models are Few-Shot Learners
Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: (2020). arXiv: 2005. 14165 [cs.CL].URL:https://arxiv.org/abs/2005.14165
work page internal anchor Pith review arXiv 2020
-
[8]
In-context Interference in Chat-based Large Language Models
Eric Nuertey Coleman, Julio Hurtado, and Vincenzo Lomonaco. “In-context Interference in Chat-based Large Language Models”. In: (2023). arXiv: 2309 . 12727 [cs.AI].URL: https://arxiv.org/abs/2309.12727
- [9]
-
[10]
A Survey on In-context Learning
Qingxiu Dong et al. “A Survey on In-context Learning”. In: (2024). arXiv: 2301.00234 [cs.CL].URL:https://arxiv.org/abs/2301.00234
work page internal anchor Pith review arXiv 2024
-
[11]
Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”. In: (2023). arXiv: 2208 . 01066 [cs.CL].URL:https://arxiv.org/abs/2208.01066
-
[12]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling Laws for Neural Language Models”. In: (2020). arXiv: 2001.08361 [cs.LG].URL: https://arxiv. org/abs/2001.08361
work page internal anchor Pith review arXiv 2020
-
[13]
Task Diversity Shortens the ICL Plateau
Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, and Ernest K. Ryu. “Task Diversity Shortens the ICL Plateau”. In: (2025). arXiv:2410.05448 [cs.LG].URL:https://arxiv.org/abs/2410.05448
-
[14]
Order Matters: Rethinking Prompt Construction in In-Context Learning
Warren Li, Yiqian Wang, Zihan Wang, and Jingbo Shang. “Order Matters: Rethinking Prompt Construction in In-Context Learning”. In: (2025). arXiv: 2511.09700 [cs.CL] .URL: https://arxiv.org/abs/2511.09700
-
[15]
Available: https://doi.org/10.1162/tacl a 00449
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. “Lost in the Middle: How Language Models Use Long Contexts”. In: (2023). arXiv:2307.03172 [cs.CL].URL:https://arxiv.org/abs/2307.03172
work page internal anchor Pith review arXiv 2023
-
[16]
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity”. In: (2022). arXiv: 2104.08786 [cs.CL] .URL: https://arxiv.org/abs/2104. 08786
-
[17]
arXiv preprint arXiv:2202.12837 , year=
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” In: (2022). arXiv:2202.12837 [cs.CL].URL: https://arxiv.org/abs/ 2202.12837. 13
-
[18]
Catherine Olsson et al.In-context Learning and Induction Heads. 2022. arXiv: 2209.11895 [cs.LG].URL:https://arxiv.org/abs/2209.11895
work page internal anchor Pith review arXiv 2022
-
[19]
arXiv preprint arXiv:2212.07677 , title =
Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. “Transformers learn in-context by gradient descent”. In: (2023). arXiv: 2212.07677 [cs.LG] .URL: https://arxiv. org/abs/2212.07677
-
[20]
Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. “What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning”. In: (2023). arXiv: 2305. 09731 [cs.CL].URL:https://arxiv.org/abs/2305.09731
- [21]
-
[22]
Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau.Function V ectors in Large Language Models. 2024. arXiv: 2310.15213 [cs.CL] . URL:https://arxiv.org/abs/2310.15213
-
[23]
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. “An Explanation of In- context Learning as Implicit Bayesian Inference”. In: (2022). arXiv:2111.02080 [cs.CL]. URL:https://arxiv.org/abs/2111.02080
-
[24]
arXiv:2502.14010 [cs.LG].URL:https://arxiv.org/abs/2502.14010
Kayo Yin and Jacob Steinhardt.Which Attention Heads Matter for In-Context Learning?2025. arXiv:2502.14010 [cs.LG].URL:https://arxiv.org/abs/2502.14010
-
[25]
Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. “Calibrate Before Use: Improving Few-Shot Performance of Language Models”. In: (2021). arXiv: 2102.09690 [cs.CL].URL:https://arxiv.org/abs/2102.09690. A Acknowledgments We thank the UC Berkeley staff and fellow students for feedback that improved the paper. In particular, we thank Prof. An...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.