arxiv: 2605.10466 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

Haoren Xu , Guanhua Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords self-attentionin-context learningrepetitive generationcovariance readoutsoftmax attentionmode collapsetransformer limitergodic inputs

0 comments

The pith

Under stationary ergodic elliptical inputs, softmax self-attention converges almost surely to a linear readout of the input covariance matrix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives that self-attention in transformers, when context grows long, stops tracking individual tokens and instead outputs a simple linear function of the second-moment statistics of the entire sequence. This single mechanism accounts for why models can perform in-context regression by effectively running gradient steps on population covariance and why long generations drift toward repetitive, Markovian behavior. A reader sees both in-context learning and mode collapse as direct consequences of the same asymptotic simplification rather than separate quirks of training or scale.

Core claim

Under the assumptions of stationarity, ergodicity and ellipticity, the softmax attention output converges almost surely to Θ_V Σ Θ_K^T Θ_Q x_t where Σ denotes the covariance of the input sequence. Consequently a single attention head realizes one population gradient step for in-context linear regression; residual stacking of heads iterates the update. Across L layers the terminal representation converges at rate 1/t to a deterministic function of the current token alone, turning autoregressive sampling into a first-order Markov chain whose attractors explain repetitive generation.

What carries the argument

Covariance readout: the almost-sure limit of softmax attention output to the form Θ_V Σ Θ_K^T Θ_Q x_t that discards token-level detail in favor of second-order input statistics.

If this is right

A single softmax head implements one step of population gradient descent on in-context linear regression.
Residual stacking of such heads iterates the gradient update over multiple steps.
In an L-layer stack the final hidden state converges to a deterministic function of the current token at rate 1/t.
Autoregressive generation collapses to a first-order Markov chain whose attracting orbits produce repetition and mode collapse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests attention layers are structurally biased toward second-order moments, which may limit their ability to capture higher-order dependencies even in very long contexts.
Architectural changes that preserve token-level detail rather than converging to covariance could reduce unwanted repetition without harming in-context learning.
The unification predicts a trade-off: stronger covariance readout improves ICL consistency but accelerates drift into repetitive orbits.

Load-bearing premise

The input sequence must be stationary, ergodic, and elliptically distributed.

What would settle it

Generate long stationary ergodic elliptical sequences, compute the empirical attention output, and check whether it approaches the predicted matrix expression Θ_V Σ Θ_K^T Θ_Q x_t within sampling error; divergence on such data would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.10466 by Guanhua Fang, Haoren Xu.

**Figure 2.** Figure 2: Covariance-readout alignment under elliptical inputs. Each panel shows the cosine similarity [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

read the original abstract

Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $\Theta_V\Sigma\Theta_K^{\top}\Theta_Q x_t$, where $\Sigma$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives that softmax attention converges almost surely to a covariance linear readout under stationarity, ergodicity and ellipticity, which then produces both ICL as gradient descent and repetition as Markov collapse.

read the letter

The main new piece is the almost-sure convergence of the attention output to Θ_V Σ Θ_K^T Θ_Q x_t, where Σ is the input covariance. From there the paper shows how a single head implements one step of population gradient descent on in-context linear regression and how stacking with residuals iterates it. On the generation side it shows the hidden state is driven to a deterministic function of the current token at rate 1/t, so the whole transformer collapses to a first-order Markov chain whose attractors give a structural reason for repetition and mode collapse. That unification from the mechanism itself is the cleanest part of the work and it is not just a restatement of earlier attention analyses. The derivation uses the ergodic theorem on the second-moment matrix followed by a continuous-mapping argument through the softmax, and the stress-test confirms there is no visible gap once the steps are written out. Ellipticity is used to ensure only second moments survive in the limit, which is consistent with the claim. The assumptions are stated up front, which is good. The main limitation is that stationarity, ergodicity and ellipticity are quite restrictive for real text distributions, and the paper does not appear to include checks on how much the limit approximates finite-context behavior or non-elliptical data. Without those, it is difficult to judge how far the account travels beyond the idealized setting. This is worth a serious referee for theorists who care about mechanistic explanations of transformer behavior. The math is grounded and the two consequences are worked out explicitly, so an editor should send it out even if revisions are needed on scope and validation.

Referee Report

0 major / 4 minor

Summary. The manuscript claims that, under the assumptions of stationary, ergodic, and elliptical input token distributions, the output of a softmax self-attention layer converges almost surely to the linear covariance readout Θ_V Σ Θ_K^T Θ_Q x_t (where Σ denotes the population covariance of the inputs). This limiting form is then used to show that a single attention head can realize one step of population gradient descent on in-context linear regression, that stacked residual heads iterate this update, and that propagation through an L-layer transformer drives the terminal hidden state to a deterministic function of the current token at rate 1/t, implying that autoregressive generation asymptotically collapses to a first-order Markov chain whose attractors explain repetition and mode collapse.

Significance. If the almost-sure convergence result is correct, the work supplies a single mechanistic derivation for two seemingly unrelated LLM behaviors (in-context learning and repetitive generation) directly from the attention operator, without additional learned parameters or external memory. The explicit use of the ergodic theorem followed by a continuous-mapping argument through the softmax, together with the concrete identification of the limiting linear form, constitutes a clear technical contribution. The paper also supplies falsifiable predictions (e.g., the 1/t decay of hidden-state variance and the Markovian character of long generations) that can be checked empirically.

minor comments (4)

[§3.1, Eq. (8)] §3.1, Eq. (8): the statement that the elliptical condition “ensures that only second moments appear” would be strengthened by an explicit reference to the relevant property of elliptical distributions (e.g., that the conditional expectation depends only on the covariance).
[§4.2] §4.2: the claim that residual stacking “implements multiple gradient-descent steps” is stated without a precise induction or contraction-mapping argument showing that the composite map remains a contraction for finite L; a short lemma would remove ambiguity.
[Figure 2] Figure 2 and the surrounding text: the plotted trajectories are described as “empirical confirmation,” yet the caption does not state the number of independent runs, the precise input distribution used, or error bars; this makes it difficult to assess variability of the observed 1/t decay.
[§5.3] §5.3: the discussion of mode collapse would benefit from a brief comparison with existing analyses of repetition in autoregressive models (e.g., those based on temperature or nucleus sampling) to clarify the novel contribution of the covariance-readout view.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review. We are pleased that the technical contribution of the almost-sure convergence result, its unification of in-context learning and repetitive generation, and the falsifiable predictions are recognized. No major comments were raised, so we will proceed with minor revisions to address any editorial or presentational suggestions.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on ergodic theorem and continuous mapping under explicit assumptions

full rationale

The central claim derives the almost-sure convergence of the softmax attention output to Θ_V Σ Θ_K^T Θ_Q x_t from the assumptions of stationarity, ergodicity, and ellipticity by applying the ergodic theorem to the empirical second-moment matrix and then invoking continuous mapping through the softmax. This is a standard probabilistic argument that does not reduce any prediction or limit to a fitted parameter, self-definition, or self-citation chain. The elliptical condition is used only to restrict the limit to second moments, which is consistent with the stated result and does not create a definitional loop. No load-bearing steps collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about the input process and the standard softmax attention definition; no free parameters are fitted within the derivation itself.

axioms (1)

domain assumption Input sequence is stationary, ergodic, and elliptically distributed
Invoked to obtain the almost-sure convergence of the attention output to the covariance readout.

pith-pipeline@v0.9.0 · 5534 in / 1279 out tokens · 111934 ms · 2026-05-12T03:56:14.755602+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to Θ_V Σ Θ_K^T Θ_Q x_t
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the long-context limit is therefore a linear readout of the input's second-order statistics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[2]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[3]

International Conference on Machine Learning , pages=

Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[5]

Findings of the Association for Computational Linguistics: ACL 2023 , pages=

Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

work page 2023
[7]

Advances in Neural Information Processing Systems , volume=

Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Advances in neural information processing systems , volume=

Data distributional properties drive emergent in-context learning in transformers , author=. Advances in neural information processing systems , volume=

work page
[12]

2021 , eprint=

A Theoretical Analysis of the Repetition Problem in Text Generation , author=. 2021 , eprint=

work page 2021
[13]

Advances in Neural Information Processing Systems , volume=

Learning to break the loop: Analyzing and mitigating repetitions for neural text generation , author=. Advances in Neural Information Processing Systems , volume=

work page
[14]

Journal of Machine Learning Research , volume=

Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , volume=

work page
[15]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

work page
[16]

2017 , publisher=

Asymptotic theory of weakly dependent random processes , author=. 2017 , publisher=

work page 2017
[17]

What learning algorithm is in-context learning? investigations with linear models

Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022

work page arXiv 2022
[18]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[19]

Data distributional properties drive emergent in-context learning in transformers

Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems, 35: 0 18878--18891, 2022

work page 2022
[20]

Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers

Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005--4019, 2023

work page 2023
[21]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1 0 (1): 0 12, 2021

work page 2021
[22]

A theoretical analysis of the repetition problem in text generation, 2021

Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. A theoretical analysis of the repetition problem in text generation, 2021. URL https://arxiv.org/abs/2012.14660

work page arXiv 2021
[23]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019

work page internal anchor Pith review arXiv 1904
[24]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Asymptotic theory of weakly dependent random processes, volume 80

Emmanuel Rio et al. Asymptotic theory of weakly dependent random processes, volume 80. Springer, 2017

work page 2017
[26]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[27]

Transformers learn in-context by gradient descent

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151--35174. PMLR, 2023

work page 2023
[28]

Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning

Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems, 36: 0 15614--15638, 2023

work page 2023
[29]

Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019

work page arXiv 1908
[30]

An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021

work page arXiv 2021
[31]

Learning to break the loop: Analyzing and mitigating repetitions for neural text generation

Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35: 0 3082--3095, 2022

work page 2022
[32]

Trained transformers learn linear models in-context

Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024

work page 2024