Recognition: 2 theorem links
· Lean TheoremSelf-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3
The pith
Under stationary ergodic elliptical inputs, softmax self-attention converges almost surely to a linear readout of the input covariance matrix.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumptions of stationarity, ergodicity and ellipticity, the softmax attention output converges almost surely to Θ_V Σ Θ_K^T Θ_Q x_t where Σ denotes the covariance of the input sequence. Consequently a single attention head realizes one population gradient step for in-context linear regression; residual stacking of heads iterates the update. Across L layers the terminal representation converges at rate 1/t to a deterministic function of the current token alone, turning autoregressive sampling into a first-order Markov chain whose attractors explain repetitive generation.
What carries the argument
Covariance readout: the almost-sure limit of softmax attention output to the form Θ_V Σ Θ_K^T Θ_Q x_t that discards token-level detail in favor of second-order input statistics.
If this is right
- A single softmax head implements one step of population gradient descent on in-context linear regression.
- Residual stacking of such heads iterates the gradient update over multiple steps.
- In an L-layer stack the final hidden state converges to a deterministic function of the current token at rate 1/t.
- Autoregressive generation collapses to a first-order Markov chain whose attracting orbits produce repetition and mode collapse.
Where Pith is reading between the lines
- The result suggests attention layers are structurally biased toward second-order moments, which may limit their ability to capture higher-order dependencies even in very long contexts.
- Architectural changes that preserve token-level detail rather than converging to covariance could reduce unwanted repetition without harming in-context learning.
- The unification predicts a trade-off: stronger covariance readout improves ICL consistency but accelerates drift into repetitive orbits.
Load-bearing premise
The input sequence must be stationary, ergodic, and elliptically distributed.
What would settle it
Generate long stationary ergodic elliptical sequences, compute the empirical attention output, and check whether it approaches the predicted matrix expression Θ_V Σ Θ_K^T Θ_Q x_t within sampling error; divergence on such data would falsify the claim.
Figures
read the original abstract
Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $\Theta_V\Sigma\Theta_K^{\top}\Theta_Q x_t$, where $\Sigma$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that, under the assumptions of stationary, ergodic, and elliptical input token distributions, the output of a softmax self-attention layer converges almost surely to the linear covariance readout Θ_V Σ Θ_K^T Θ_Q x_t (where Σ denotes the population covariance of the inputs). This limiting form is then used to show that a single attention head can realize one step of population gradient descent on in-context linear regression, that stacked residual heads iterate this update, and that propagation through an L-layer transformer drives the terminal hidden state to a deterministic function of the current token at rate 1/t, implying that autoregressive generation asymptotically collapses to a first-order Markov chain whose attractors explain repetition and mode collapse.
Significance. If the almost-sure convergence result is correct, the work supplies a single mechanistic derivation for two seemingly unrelated LLM behaviors (in-context learning and repetitive generation) directly from the attention operator, without additional learned parameters or external memory. The explicit use of the ergodic theorem followed by a continuous-mapping argument through the softmax, together with the concrete identification of the limiting linear form, constitutes a clear technical contribution. The paper also supplies falsifiable predictions (e.g., the 1/t decay of hidden-state variance and the Markovian character of long generations) that can be checked empirically.
minor comments (4)
- [§3.1, Eq. (8)] §3.1, Eq. (8): the statement that the elliptical condition “ensures that only second moments appear” would be strengthened by an explicit reference to the relevant property of elliptical distributions (e.g., that the conditional expectation depends only on the covariance).
- [§4.2] §4.2: the claim that residual stacking “implements multiple gradient-descent steps” is stated without a precise induction or contraction-mapping argument showing that the composite map remains a contraction for finite L; a short lemma would remove ambiguity.
- [Figure 2] Figure 2 and the surrounding text: the plotted trajectories are described as “empirical confirmation,” yet the caption does not state the number of independent runs, the precise input distribution used, or error bars; this makes it difficult to assess variability of the observed 1/t decay.
- [§5.3] §5.3: the discussion of mode collapse would benefit from a brief comparison with existing analyses of repetition in autoregressive models (e.g., those based on temperature or nucleus sampling) to clarify the novel contribution of the covariance-readout view.
Simulated Author's Rebuttal
We thank the referee for the positive and constructive review. We are pleased that the technical contribution of the almost-sure convergence result, its unification of in-context learning and repetitive generation, and the falsifiable predictions are recognized. No major comments were raised, so we will proceed with minor revisions to address any editorial or presentational suggestions.
Circularity Check
No significant circularity; derivation relies on ergodic theorem and continuous mapping under explicit assumptions
full rationale
The central claim derives the almost-sure convergence of the softmax attention output to Θ_V Σ Θ_K^T Θ_Q x_t from the assumptions of stationarity, ergodicity, and ellipticity by applying the ergodic theorem to the empirical second-moment matrix and then invoking continuous mapping through the softmax. This is a standard probabilistic argument that does not reduce any prediction or limit to a fitted parameter, self-definition, or self-citation chain. The elliptical condition is used only to restrict the limit to second moments, which is consistent with the stated result and does not create a definitional loop. No load-bearing steps collapse to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Input sequence is stationary, ergodic, and elliptically distributed
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to Θ_V Σ Θ_K^T Θ_Q x_t
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the long-context limit is therefore a linear readout of the input's second-order statistics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[2]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[3]
International Conference on Machine Learning , pages=
Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[5]
Findings of the Association for Computational Linguistics: ACL 2023 , pages=
Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=
work page 2023
-
[7]
Advances in Neural Information Processing Systems , volume=
Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
Advances in neural information processing systems , volume=
Data distributional properties drive emergent in-context learning in transformers , author=. Advances in neural information processing systems , volume=
-
[12]
A Theoretical Analysis of the Repetition Problem in Text Generation , author=. 2021 , eprint=
work page 2021
-
[13]
Advances in Neural Information Processing Systems , volume=
Learning to break the loop: Analyzing and mitigating repetitions for neural text generation , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
Journal of Machine Learning Research , volume=
Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , volume=
-
[15]
Transformer Circuits Thread , volume=
A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
-
[16]
Asymptotic theory of weakly dependent random processes , author=. 2017 , publisher=
work page 2017
-
[17]
What learning algorithm is in-context learning? investigations with linear models
Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022
-
[18]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[19]
Data distributional properties drive emergent in-context learning in transformers
Stephanie Chan, Adam Santoro, Andrew Lampinen, Jane Wang, Aaditya Singh, Pierre Richemond, James McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems, 35: 0 18878--18891, 2022
work page 2022
-
[20]
Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4005--4019, 2023
work page 2023
-
[21]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1 0 (1): 0 12, 2021
work page 2021
-
[22]
A theoretical analysis of the repetition problem in text generation, 2021
Zihao Fu, Wai Lam, Anthony Man-Cho So, and Bei Shi. A theoretical analysis of the repetition problem in text generation, 2021. URL https://arxiv.org/abs/2012.14660
-
[23]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019
work page internal anchor Pith review arXiv 1904
-
[24]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Asymptotic theory of weakly dependent random processes, volume 80
Emmanuel Rio et al. Asymptotic theory of weakly dependent random processes, volume 80. Springer, 2017
work page 2017
-
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[27]
Transformers learn in-context by gradient descent
Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151--35174. PMLR, 2023
work page 2023
-
[28]
Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark Steyvers, and William Yang Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems, 36: 0 15614--15638, 2023
work page 2023
-
[29]
Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019
-
[30]
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021
-
[31]
Learning to break the loop: Analyzing and mitigating repetitions for neural text generation
Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. Advances in Neural Information Processing Systems, 35: 0 3082--3095, 2022
work page 2022
-
[32]
Trained transformers learn linear models in-context
Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25 0 (49): 0 1--55, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.