Recognition: no theorem link
Scaling Limits of Long-Context Transformers
Pith reviewed 2026-05-12 01:41 UTC · model grok-4.3
The pith
The critical scaling for attention selectivity depends on the local distance distribution near the query rather than global context features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For i.i.d. uniform keys on the sphere S^{d-1} and fixed query, the attention mechanism undergoes a phase transition at inverse temperature scaling β_n^* ~ n^{2/(d-1)}. Below this scale the output converges to a deterministic local average around the query plus Gaussian fluctuations; exactly at the scale a finite number of nearest keys each receive positive limiting mass; above the scale all mass concentrates on the single closest key. In the subcritical regime with identity value matrix the map approximates the backward heat equation.
What carries the argument
The local exponent of the distance-to-query distribution near zero, which fixes the critical scaling β_n^* ≍ n^{2/(d-1)} and determines the limiting laws of ordered attention weights and outputs across all regimes.
If this is right
- Below the critical scale the attention output is a local average around the query with deterministic bias and Gaussian fluctuations.
- At the critical scale a finite collection of nearest keys retains macroscopic mass without collapse to a single key.
- Above the critical scale all attention mass concentrates on the single closest key.
- In the subcritical regime with the identity value matrix the attention map approximates the backward heat equation.
Where Pith is reading between the lines
- If real token embeddings deviate from uniform sphere distribution, the critical scaling for selectivity would shift according to the new local distance exponent.
- The regime analysis supplies a concrete way to choose β_n in practice so that long-context attention achieves a chosen balance between averaging and focus.
- The same local-exponent approach could be applied to non-uniform or dependent key distributions that better model actual embedding spaces.
Load-bearing premise
The keys are modeled as independent uniform random points on the sphere with a fixed query, producing a specific power-law tail for small distances.
What would settle it
Numerical computation of attention weights for large n at β_n = n^{2/(d-1) - 0.1} showing that the weights do not converge to the predicted local average with explicit bias would falsify the subcritical regime description.
Figures
read the original abstract
We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $\beta_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $\beta_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $\beta_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes the long-context scaling limits of softmax self-attention with a fixed query and n i.i.d. uniform keys on the sphere S^{d-1}. Treating the inverse temperature β_n as the scaling parameter, it shows that the critical scale at which attention becomes selective (rather than uniform averaging) is β_n^* ≍ n^{2/(d-1)}, governed by the local power-law exponent of the distance-to-query distribution near zero. The paper characterizes the limiting laws of the ordered attention weights and attention output in three regimes: subcritical (local averaging around the query with explicit deterministic bias and Gaussian fluctuations), critical (macroscopic mass retained by a finite number of nearest keys), and supercritical (collapse onto the single closest key). It further notes that the subcritical regime with identity value matrix approximates a backward heat equation.
Significance. If the results hold, the work supplies a rigorous probabilistic framework for phase transitions in attention, demonstrating that selectivity thresholds depend on local geometry of the key distribution rather than global context statistics. The explicit limiting distributions derived via order statistics and extreme-value tools, together with the heat-equation connection, constitute a clear theoretical contribution that could guide analysis of long-context transformers. The parameter-free nature of the critical scaling and the regime-specific characterizations are notable strengths.
major comments (2)
- [Main results / critical-scale theorem] The central derivation of β_n^* ≍ n^{2/(d-1)} (stated in the abstract and main theorem) rests on the local exponent of the distance distribution; the manuscript should explicitly derive or cite the spherical-cap volume calculation that produces the factor 2/(d-1) to confirm the exponent is load-bearing and not an artifact of the uniform assumption.
- [Subcritical regime analysis] In the subcritical regime, the claim that the attention output reduces to a local average with deterministic bias and Gaussian fluctuations is load-bearing for the heat-equation interpretation; the error bounds or convergence rates (especially for large d) should be stated explicitly so that the approximation's validity for finite n is clear.
minor comments (3)
- [Notation and setup] Clarify the precise definition of the ordered attention weights (e.g., whether ties are broken randomly or by index) at the first appearance of the notation.
- [Abstract and introduction] The abstract mentions 'all regimes of β_n'; add a short table or diagram summarizing the three regimes, their β_n scalings, and the corresponding limiting behaviors for quick reference.
- [Discussion of value matrix] Include a brief remark on how the results extend (or fail to extend) when the value matrix is not the identity, as this affects the heat-equation claim.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments. We address each major comment below and will incorporate clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Main results / critical-scale theorem] The central derivation of β_n^* ≍ n^{2/(d-1)} (stated in the abstract and main theorem) rests on the local exponent of the distance distribution; the manuscript should explicitly derive or cite the spherical-cap volume calculation that produces the factor 2/(d-1) to confirm the exponent is load-bearing and not an artifact of the uniform assumption.
Authors: We agree that an explicit derivation strengthens the presentation. In the revision we will add a short subsection deriving the local volume scaling: for uniform K on S^{d-1}, the surface measure yields P(1 - ⟨q, K⟩ ≤ t) ∼ c_d t^{(d-1)/2} as t → 0 (via the standard parametrization of spherical caps and the quadratic approximation cos θ ≈ 1 - θ²/2). This directly produces the critical scaling β_n^* ≍ n^{2/(d-1)} through the extreme-value analysis of the maximum inner product and is a consequence of local Euclidean geometry rather than a global artifact of uniformity. We will also cite the relevant spherical-geometry references. revision: yes
-
Referee: [Subcritical regime analysis] In the subcritical regime, the claim that the attention output reduces to a local average with deterministic bias and Gaussian fluctuations is load-bearing for the heat-equation interpretation; the error bounds or convergence rates (especially for large d) should be stated explicitly so that the approximation's validity for finite n is clear.
Authors: The subcritical theorems establish convergence in distribution to the stated local-average limit (with explicit bias and Gaussian fluctuations) as n → ∞ under β_n = o(n^{2/(d-1)}). In the revision we will add a remark on convergence rates, noting that the contribution of distant keys decays exponentially in the subcritical regime and that the Gaussian approximation error can be controlled via standard Berry–Esseen bounds on the order statistics of the inner products. The bias and variance constants depend on d through the cap-volume prefactor, which we will make explicit; the results hold for fixed d with n → ∞, and we will clarify the finite-n regime of validity. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper derives scaling limits and regime transitions for softmax attention directly from the explicit model of i.i.d. uniform keys on S^{d-1} with fixed query. The critical exponent β_n^* ≍ n^{2/(d-1)} follows from the local volume scaling of spherical caps (distance distribution near zero has power d-1), a standard geometric fact applied via extreme-value statistics. Limiting laws for ordered weights and attention output in sub-, critical, and super-critical regimes are obtained from tail asymptotics and order statistics without fitted parameters, self-definitions, or load-bearing self-citations. All steps remain within the stated probabilistic assumptions and use classical tools, rendering the chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- β_n
- d
axioms (2)
- domain assumption Keys are i.i.d. uniform on the unit sphere S^{d-1}
- standard math Standard results from extreme-value theory and local limit laws for distances on the sphere
Reference graph
Works this paper leans on
- [1]
-
[2]
Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime G and Le, Quoc and Salakhutdinov, Ruslan , booktitle=. Transformer-
- [3]
-
[4]
Token Sample Complexity of Attention , author=
-
[5]
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective , author=. 2025 , note=
work page 2025
-
[6]
A multiscale analysis of mean-field transformers in the moderate interaction regime , author=. NeurIPS , year=
- [7]
-
[8]
A course in abstract harmonic analysis , author=. 2016 , publisher=
work page 2016
-
[9]
Journal of machine learning research , volume=
A neural probabilistic language model , author=. Journal of machine learning research , volume=
-
[10]
arXiv preprint arXiv:2601.22156 , year=
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts , author=. arXiv preprint arXiv:2601.22156 , year=
-
[11]
Ye, Xi and Zhang, Wuwei and Yin, Fangcong and Yen, Howard and Chen, Danqi , note=
-
[12]
arXiv preprint arXiv:2506.16640 , year=
Long-context generalization with sparse attention , author=. arXiv preprint arXiv:2506.16640 , year=
- [13]
-
[14]
Findings of the Association for Computational Linguistics: ACL 2024 , pages=
Length generalization of causal transformers without position encoding , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=
work page 2024
-
[15]
Softmax is not enough (for sharp size generalisation) , author=. ICML , year=
-
[16]
Qu, Jingang and Holzm. Tab
-
[17]
Selective attention: Enhancing transformer through principled context control , author=. NeurIPS , year=
-
[18]
Frank W. J. Olver , title =. 1997 , note =
work page 1997
-
[19]
Lecture Notes-Monograph Series , pages=
Poisson-kingman partitions , author=. Lecture Notes-Monograph Series , pages=. 2003 , publisher=
work page 2003
-
[20]
Olav Kallenberg , title =
-
[21]
An introduction to the theory of point processes: volume II: general theory and structure , author=. 2008 , publisher=
work page 2008
-
[22]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =
work page 2024
-
[24]
Query-key normalization for transformers , author=. arXiv preprint arXiv:2010.04245 , year=
-
[25]
On layer normalization in the transformer architecture , author=. ICML , year=
-
[26]
Critical attention scaling in long-context transformers , author=. ICLR , year=
-
[27]
Generating Long Sequences with Sparse Transformers
Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[28]
The random graph process is globally synchronizing , year =
Vishesh Jain and Clayton Mizgerd and Mehtaab Sawhney , note =. The random graph process is globally synchronizing , year =
-
[29]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[30]
Longformer: The long-document transformer , author=
- [31]
- [32]
-
[33]
Residual connections provably mitigate oversmoothing in graph neural networks , author=. 2025 , note=
work page 2025
-
[34]
Quantitative Clustering in Mean-Field Transformer Models , author=. 2025 , note=
work page 2025
-
[35]
Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation , author=. ICLR , year=
-
[36]
Puvvada, Krishna C. and Ladhak, Faisal and Akle Serano, Santiago and Hsieh, Cheng-Ping and Acharya, Shantanu and Majumdar, Somshubra and Jia, Fei and Kriman, Samuel and Sun, Simeng and Rekesh, Dima and Ginsburg, Boris , booktitle=
- [37]
-
[38]
Overcoming a theoretical limitation of self-attention , author=. 2022 , note=
work page 2022
- [39]
-
[40]
Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , year=. Yarn:
- [41]
- [42]
-
[43]
On the complete phase synchronization for the Kuramoto model in the mean-field limit , author=. 2014 , note=
work page 2014
-
[44]
Solutions of stationary McKean-Vlasov equation on a high-dimensional sphere and other Riemannian manifolds , author=. 2024 , note=
work page 2024
-
[45]
Physica D: Nonlinear Phenomena , volume=
Emergence of phase concentration for the Kuramoto--Sakaguchi equation , author=. Physica D: Nonlinear Phenomena , volume=. 2020 , publisher=
work page 2020
-
[46]
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=. 2019 , publisher=
work page 2019
-
[47]
Annals of Mathematics , volume=
Asymptotics for a class of non-linear evolution equations, with applications to geometric problems , author=. Annals of Mathematics , volume=. 1983 , publisher=
work page 1983
-
[48]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , note=
work page 2024
-
[49]
Synchronization of mean-field models on the circle , author=. 2025 , note=
work page 2025
-
[50]
Physica D: Nonlinear Phenomena , volume=
On the complete synchronization of the Kuramoto phase model , author=. Physica D: Nonlinear Phenomena , volume=. 2010 , publisher=
work page 2010
-
[51]
Inventiones mathematicae , volume=
On the trend to global equilibrium for spatially inhomogeneous kinetic systems: the Boltzmann equation , author=. Inventiones mathematicae , volume=. 2005 , publisher=
work page 2005
-
[52]
Synchronization on circles and spheres with nonlinear interactions , author=. 2024 , note=
work page 2024
-
[53]
International conference on machine learning , pages=
Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[54]
Advances in Neural Information Processing Systems , volume=
The emergence of clusters in self-attention dynamics , author=. Advances in Neural Information Processing Systems , volume=
-
[55]
Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization , author=. 2025 , note=
work page 2025
-
[56]
A mathematical perspective on transformers , author=. Bull. Amer. Math. Soc. , year=
-
[57]
International Conference on Artificial Intelligence and Statistics , pages=
Sinkformers: Transformers with doubly stochastic attention , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[58]
Advances in Neural Information Processing Systems , volume=
Redesigning the transformer architecture with insights from multi-particle dynamical systems , author=. Advances in Neural Information Processing Systems , volume=
-
[59]
Understanding and improving transformer from a multi-particle dynamic system point of view , author=. 2019 , note=
work page 2019
- [60]
-
[61]
Annales de l'Institut Henri Poincar
On the trend to global equilibrium for Kuramoto oscillators , author=. Annales de l'Institut Henri Poincar
-
[62]
IEEE Transactions on Information Theory , year=
Convergence analysis of probability flow ode for score-based generative models , author=. IEEE Transactions on Information Theory , year=
- [63]
-
[64]
Acebr. The. Reviews of modern physics , volume=. 2005 , publisher=
work page 2005
-
[65]
Asymptotic formation and orbital stability of phase-locked states for the
Choi, Young-Pil and Ha, Seung-Yeal and Jung, Sungeun and Kim, Yongduck , journal=. Asymptotic formation and orbital stability of phase-locked states for the. 2012 , publisher=
work page 2012
-
[66]
Synchronization in complex networks , author=. Physics reports , volume=. 2008 , publisher=
work page 2008
-
[67]
Self-entrainment of a population of coupled non-linear oscillators , author=. International Symposium on Mathematical Problems in Theoretical Physics: January 23--29, 1975, Kyoto University, Kyoto/Japan , pages=. 1975 , organization=
work page 1975
- [68]
-
[69]
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
work page 2024
-
[70]
Exact Sequence Classification with Hardmax Transformers , author=. 2025 , note=
work page 2025
-
[71]
Muse: Text-to-image generation via masked generative transformers , author=. 2023 , note=
work page 2023
-
[72]
Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms , author=. 2024 , note=
work page 2024
-
[73]
Convex optimization: algorithms and complexity , year =
Bubeck, S. Convex optimization: algorithms and complexity , year =
-
[74]
Chen, Yongxin and Khong, Sei Zhen and Georgiou, Tryphon T. , booktitle =. On the definiteness of graph Laplacians with negative weights: Geometrical and passivity-based approaches , year =
-
[75]
High-Dimensional Probability: An Introduction with Applications in Data Science , year =
Vershynin, Roman , publisher =. High-Dimensional Probability: An Introduction with Applications in Data Science , year =
-
[76]
R. M. Dudley , journal =
-
[77]
Sriperumbudur and Kenji Fukumizu and Arthur Gretton and Bernhard Sch
Bharath K. Sriperumbudur and Kenji Fukumizu and Arthur Gretton and Bernhard Sch. Electronic Journal of Statistics , pages =
-
[78]
Ian Goodfellow AND Yoshua Bengio AND Aaron Courville , publisher =. Deep Learning , year =
-
[79]
Experimental Mathematics , volume=
Experimental study of energy-minimizing point configurations on spheres , author=. Experimental Mathematics , volume=. 2009 , publisher=
work page 2009
-
[80]
Cohn, Henry and Kumar, Abhinav and Miller, Stephen and Radchenko, Danylo and Viazovska, Maryna , journal=. Universal optimality of the. 2022 , publisher=
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.