arxiv: 2604.23371 · v1 · submitted 2026-04-25 · 💻 cs.LG

Recognition: unknown

When Context Sticks: Studying Interference in In-Context Learning

Dagny Streit, Hanna R{\o}d, Justin Li, Nils Valseth Selte

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords in-context learninginterferencecontext stickinesscurriculum learningsynthetic regressiontransformer adaptationprompt interferencetask switching

0 comments

The pith

Earlier examples in a prompt continue to interfere with a transformer's adaptation to later tasks during in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates how prior examples in a prompt can stick and bias predictions even when new examples arrive for a different task. By training transformers on regression problems involving linear and quadratic functions under varied curricula, the authors measure error changes as the number of misleading and corrective examples varies. They discover that interference persists reliably, degrading performance based on the count of prior examples, and that the training sequence strongly influences how fast the model overcomes it. Readers might care because in-context learning powers quick adaptation in large models, making any stickiness a potential limit on their flexibility with mixed or changing contexts.

Core claim

Using controlled sweeps of linear followed by quadratic examples in prompts, the study demonstrates that more initial linear examples increase error in quadratic predictions, additional quadratic examples decrease it with diminishing returns, and sequential training on the target function enables the fastest recovery while random training yields the weakest resilience to interference.

What carries the argument

Persistent interference from preceding context, measured as the degradation in prediction accuracy when switching between linear and quadratic regression tasks in the prompt.

If this is right

More preceding examples from one function class will increase error when predicting the other class.
Error reduction from corrective examples slows after the first few additions.
Sequential training curricula produce models that recover quickest from context interference.
Random training curricula result in models with the poorest robustness to task switches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt design in practice should consider ordering to reduce the impact of earlier examples on later ones.
These dynamics might extend to natural language tasks, suggesting that long context windows could accumulate unwanted biases.
Alternative training methods could be explored to enhance resilience beyond the tested curricula.

Load-bearing premise

That results from these synthetic regression tasks with linear and quadratic functions generalize to the interference effects in real-world in-context learning with language models.

What would settle it

Finding that the number of preceding examples has no systematic effect on prediction error, or that all curricula show identical recovery rates, when tested on the same task switches or on actual language modeling prompts.

Figures

Figures reproduced from arXiv: 2604.23371 by Dagny Streit, Hanna R{\o}d, Justin Li, Nils Valseth Selte.

**Figure 1.** Figure 1: Overall 3D Error Surfaces view at source ↗

**Figure 2.** Figure 2: Recovery curves: holding the number of preceding linear examples constant view at source ↗

**Figure 3.** Figure 3: Stickiness curves: holding the number of following quadratic examples constant view at source ↗

**Figure 4.** Figure 4: Sequential: error versus quadratic examples view at source ↗

**Figure 5.** Figure 5: Sequential curriculum: error when switching between tasks view at source ↗

**Figure 6.** Figure 6: Mixed and random curricula: error when switching between tasks view at source ↗

read the original abstract

This paper investigates context stickiness in in-context learning (ICL), a phenomenon where earlier examples in a prompt interfere with a transformer's ability to adapt to later tasks. Using synthetic regression tasks over linear and quadratic functions, we examine how models trained under sequential, mixed, and random curricula handle abrupt task switches during inference. By sweeping over structured combinations of misleading linear examples followed by recovery quadratic examples, we quantify how prior context biases prediction error and how quickly models realign. Our results show strong evidence of persistent interference: more preceding linear examples reliably degrade quadratic predictions, while additional quadratic examples reduce error but with diminishing returns. We further find that training curricula significantly modulate resilience, with sequential training on the target function class yielding the fastest recovery, and surprisingly, random training producing the least robust behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study of context stickiness in in-context learning (ICL) using transformers trained on synthetic linear and quadratic regression tasks under sequential, mixed, and random curricula. It examines performance on inference prompts with abrupt switches from misleading linear examples to target quadratic examples, claiming to show persistent interference where additional preceding linear examples degrade quadratic predictions, with diminishing returns from recovery examples, and curricula modulating adaptation speed (sequential best, random worst).

Significance. The controlled synthetic experiments offer a clean way to isolate and quantify interference effects in ICL, which could help explain transformer adaptation mechanisms if the patterns prove robust. The curriculum comparisons are a useful angle for training design. However, the absence of any scaling or natural-language validation substantially limits the significance for understanding ICL in actual large language models.

major comments (3)

[§3 (Experimental Setup)] §3 (Experimental Setup): The description of model training, data generation, and evaluation lacks key reproducibility details including transformer architecture (layers, heads, embedding size), optimization hyperparameters, exact prompt lengths, number of independent seeds/runs, and how error is aggregated. Without these, the 'strong evidence' of interference cannot be verified or reproduced.
[§4 (Results)] §4 (Results): Figures and tables reporting degradation with more linear examples and recovery curves do not include error bars, standard deviations, or any statistical significance tests. This undermines the reliability of claims such as 'reliably degrade' and 'diminishing returns' since variance in synthetic regression could explain the trends.
[§5 (Discussion)] §5 (Discussion) and abstract: The central claim that the work studies interference 'in in-context learning' and provides evidence relevant to transformers/LLMs rests on the untested assumption that linear/quadratic synthetic tasks with artificial switches capture the interference dynamics of high-dimensional, semantically structured natural-language ICL. No bridging experiments, scaling studies, or comparisons to real LLM prompts are provided.

minor comments (2)

[Introduction] The introduction introduces 'context stickiness' informally; a short formal definition or equation quantifying the interference (e.g., error as function of prefix length) would improve clarity.
[Abstract] Abstract and §4: The 'surprisingly' qualifier on random curriculum results is not supported by a direct comparison figure or table reference, making the surprise claim harder to evaluate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments have prompted us to improve the reproducibility and statistical rigor of the manuscript. We address each major comment below and have made corresponding revisions.

read point-by-point responses

Referee: [§3 (Experimental Setup)] §3 (Experimental Setup): The description of model training, data generation, and evaluation lacks key reproducibility details including transformer architecture (layers, heads, embedding size), optimization hyperparameters, exact prompt lengths, number of independent seeds/runs, and how error is aggregated. Without these, the 'strong evidence' of interference cannot be verified or reproduced.

Authors: We agree that these details were insufficient in the original submission. The revised manuscript expands Section 3 and adds Appendix A with full specifications: a 4-layer transformer with 8 attention heads and embedding size 256; Adam optimizer with learning rate 1e-4, batch size 64, and 50k training steps; prompts consisting of 10-20 examples (approximately 200-400 tokens); results aggregated as mean over 5 independent random seeds with standard deviation; and data generation procedures for linear/quadratic functions. These additions enable full reproduction of all experiments. revision: yes
Referee: [§4 (Results)] §4 (Results): Figures and tables reporting degradation with more linear examples and recovery curves do not include error bars, standard deviations, or any statistical significance tests. This undermines the reliability of claims such as 'reliably degrade' and 'diminishing returns' since variance in synthetic regression could explain the trends.

Authors: We acknowledge the omission of variability measures. All figures in the revised Section 4 now display error bars as mean ± one standard deviation across the 5 seeds. We have added a statistical analysis subsection reporting paired t-tests comparing conditions with varying numbers of linear examples (all p < 0.01 for the reported degradations) and confirming diminishing returns in recovery. The text has been updated to reference these statistics when stating trends. revision: yes
Referee: [§5 (Discussion)] §5 (Discussion) and abstract: The central claim that the work studies interference 'in in-context learning' and provides evidence relevant to transformers/LLMs rests on the untested assumption that linear/quadratic synthetic tasks with artificial switches capture the interference dynamics of high-dimensional, semantically structured natural-language ICL. No bridging experiments, scaling studies, or comparisons to real LLM prompts are provided.

Authors: We agree that the synthetic setting does not automatically generalize to natural-language ICL and have revised the abstract and Section 5 to explicitly frame the work as a controlled study of interference mechanisms rather than a direct claim about LLMs. The discussion now includes a dedicated limitations paragraph acknowledging the gap and outlining why synthetic tasks enable isolation of effects not feasible in high-dimensional language data. We have not added new scaling or LLM experiments, as they fall outside the paper's scope of providing mechanistic insights via precise synthetic controls. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results from synthetic experiments

full rationale

This is a purely empirical paper that reports controlled experiments on synthetic linear/quadratic regression tasks with different training curricula and abrupt task switches. No mathematical derivation chain, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described methods. The central observations (persistent interference, curriculum effects) are direct measurements from the experimental setup rather than reductions of outputs to inputs by construction. The study is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that synthetic linear/quadratic regression tasks model relevant ICL dynamics and that prediction error differences reflect interference rather than other factors.

axioms (1)

domain assumption Synthetic regression tasks over linear and quadratic functions capture key aspects of interference in transformer in-context learning.
The entire experimental design and interpretation of results depend on this assumption about task representativeness.

invented entities (1)

context stickiness no independent evidence
purpose: To name and frame the observed persistent interference from earlier prompt examples.
The term is introduced to describe the main phenomenon without independent evidence outside the experiments.

pith-pipeline@v0.9.0 · 5437 in / 1335 out tokens · 41401 ms · 2026-05-08T08:28:03.757936+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 24 canonical work pages · 5 internal anchors

[1]

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou.What learning algorithm is in-context learning? Investigations with linear models. 2023. arXiv: 2211 . 15661 [cs.LG].URL:https://arxiv.org/abs/2211.15661

work page arXiv 2023
[2]

Jinheon Baek, Sun Jae Lee, Prakhar Gupta, Geunseob Oh, Siddharth Dalmia, and Prateek Kolhar.Revisiting In-Context Learning with Long Context Language Models. 2025. arXiv: 2412.16926 [cs.CL].URL:https://arxiv.org/abs/2412.16926

work page arXiv 2025
[3]

Transformers as statisticians: Provable in-context learning with in-context algorithm selection.ArXiv, abs/2306.04637, 2023

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. “Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection”. In: (2023). arXiv: 2306. 04637 [cs.LG].URL:https://arxiv.org/abs/2306.04637

work page arXiv 2023
[4]

Nested Learn- ing: The Illusion of Deep Learning Architectures

Ali Behrouz, Meisam Razaviyayn, Peiling Zhong, and Vahab Mirrokni. “Nested Learn- ing: The Illusion of Deep Learning Architectures”. In: (2025). NeurIPS.URL: https : //openreview.net/pdf?id=nbMeRvNb7A

2025
[5]

arXiv preprint arXiv:2405.00200 , year=

Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R. Gormley, and Graham Neubig. “In-Context Learning with Long-Context Models: An In-Depth Exploration”. In: (2025). arXiv: 2405.00200 [cs.CL] .URL: https://arxiv.org/abs/2405. 00200

work page arXiv 2025
[6]

How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes

Harmon Bhasin, Timothy Ossowski, Yiqiao Zhong, and Junjie Hu. “How does Multi-Task Training Affect Transformer In-Context Capabilities? Investigations with Function Classes”. In: (2024). arXiv: 2404.03558 [cs.CL] .URL: https://arxiv.org/abs/2404. 03558

work page arXiv 2024
[7]

Language Models are Few-Shot Learners

Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: (2020). arXiv: 2005. 14165 [cs.CL].URL:https://arxiv.org/abs/2005.14165

work page internal anchor Pith review arXiv 2020
[8]

In-context Interference in Chat-based Large Language Models

Eric Nuertey Coleman, Julio Hurtado, and Vincenzo Lomonaco. “In-context Interference in Chat-based Large Language Models”. In: (2023). arXiv: 2309 . 12727 [cs.AI].URL: https://arxiv.org/abs/2309.12727

work page arXiv 2023
[9]

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei.Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta- Optimizers. 2023. arXiv: 2212.10559 [cs.CL].URL: https://arxiv.org/abs/ 2212.10559

work page arXiv 2023
[10]

A Survey on In-context Learning

Qingxiu Dong et al. “A Survey on In-context Learning”. In: (2024). arXiv: 2301.00234 [cs.CL].URL:https://arxiv.org/abs/2301.00234

work page internal anchor Pith review arXiv 2024
[11]

What can transformers learn in-context? a case study of simple function classes.ArXiv, abs/2208.01066, 2022

Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”. In: (2023). arXiv: 2208 . 01066 [cs.CL].URL:https://arxiv.org/abs/2208.01066

work page arXiv 2023
[12]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. “Scaling Laws for Neural Language Models”. In: (2020). arXiv: 2001.08361 [cs.LG].URL: https://arxiv. org/abs/2001.08361

work page internal anchor Pith review arXiv 2020
[13]

Task Diversity Shortens the ICL Plateau

Jaeyeon Kim, Sehyun Kwon, Joo Young Choi, Jongho Park, Jaewoong Cho, Jason D. Lee, and Ernest K. Ryu. “Task Diversity Shortens the ICL Plateau”. In: (2025). arXiv:2410.05448 [cs.LG].URL:https://arxiv.org/abs/2410.05448

work page arXiv 2025
[14]

Order Matters: Rethinking Prompt Construction in In-Context Learning

Warren Li, Yiqian Wang, Zihan Wang, and Jingbo Shang. “Order Matters: Rethinking Prompt Construction in In-Context Learning”. In: (2025). arXiv: 2511.09700 [cs.CL] .URL: https://arxiv.org/abs/2511.09700

work page arXiv 2025
[15]

Available: https://doi.org/10.1162/tacl a 00449

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. “Lost in the Middle: How Language Models Use Long Contexts”. In: (2023). arXiv:2307.03172 [cs.CL].URL:https://arxiv.org/abs/2307.03172

work page internal anchor Pith review arXiv 2023
[16]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity”. In: (2022). arXiv: 2104.08786 [cs.CL] .URL: https://arxiv.org/abs/2104. 08786

work page arXiv 2022
[17]

arXiv preprint arXiv:2202.12837 , year=

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” In: (2022). arXiv:2202.12837 [cs.CL].URL: https://arxiv.org/abs/ 2202.12837. 13

work page arXiv 2022
[18]

Catherine Olsson et al.In-context Learning and Induction Heads. 2022. arXiv: 2209.11895 [cs.LG].URL:https://arxiv.org/abs/2209.11895

work page internal anchor Pith review arXiv 2022
[19]

arXiv preprint arXiv:2212.07677 , title =

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. “Transformers learn in-context by gradient descent”. In: (2023). arXiv: 2212.07677 [cs.LG] .URL: https://arxiv. org/abs/2212.07677

work page arXiv 2023
[20]

What In-Context Learning

Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. “What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning”. In: (2023). arXiv: 2305. 09731 [cs.CL].URL:https://arxiv.org/abs/2305.09731

work page arXiv 2023
[21]

Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli.Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. 2023. arXiv: 2306.15063 [cs.LG].URL:https://arxiv.org/abs/2306.15063

work page arXiv 2023
[22]

2024 , month = feb, number =

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau.Function V ectors in Large Language Models. 2024. arXiv: 2310.15213 [cs.CL] . URL:https://arxiv.org/abs/2310.15213

work page arXiv 2024
[23]

An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. “An Explanation of In- context Learning as Implicit Bayesian Inference”. In: (2022). arXiv:2111.02080 [cs.CL]. URL:https://arxiv.org/abs/2111.02080

work page arXiv 2022
[24]

arXiv:2502.14010 [cs.LG].URL:https://arxiv.org/abs/2502.14010

Kayo Yin and Jacob Steinhardt.Which Attention Heads Matter for In-Context Learning?2025. arXiv:2502.14010 [cs.LG].URL:https://arxiv.org/abs/2502.14010

work page arXiv 2025
[25]

Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. “Calibrate Before Use: Improving Few-Shot Performance of Language Models”. In: (2021). arXiv: 2102.09690 [cs.CL].URL:https://arxiv.org/abs/2102.09690. A Acknowledgments We thank the UC Berkeley staff and fellow students for feedback that improved the paper. In particular, we thank Prof. An...

work page arXiv 2021