pith. machine review for the scientific record. sign in

arxiv: 2605.08505 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· math.PR· math.ST· stat.TH

Recognition: no theorem link

Scaling Limits of Long-Context Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.PRmath.STstat.TH
keywords long-context transformerssoftmax self-attentionscaling limitsattention collapseinverse temperaturecritical regimessphere distributionheat equation
0
0 comments X

The pith

The critical scaling for attention selectivity depends on the local distance distribution near the query rather than global context features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies softmax self-attention in the long-context limit with a fixed query and n independent uniform random keys on the sphere. It treats the inverse temperature β_n as the parameter that decides whether attention averages everything uniformly or focuses sharply. The authors prove that the threshold where selectivity appears is fixed by the local power-law behavior of distances to the query near zero, giving the explicit scaling β_n^* ≍ n^{2/(d-1)}. They then derive the precise limiting distributions of the ordered attention weights and the output vector in the three regimes below, at, and above this threshold.

Core claim

For i.i.d. uniform keys on the sphere S^{d-1} and fixed query, the attention mechanism undergoes a phase transition at inverse temperature scaling β_n^* ~ n^{2/(d-1)}. Below this scale the output converges to a deterministic local average around the query plus Gaussian fluctuations; exactly at the scale a finite number of nearest keys each receive positive limiting mass; above the scale all mass concentrates on the single closest key. In the subcritical regime with identity value matrix the map approximates the backward heat equation.

What carries the argument

The local exponent of the distance-to-query distribution near zero, which fixes the critical scaling β_n^* ≍ n^{2/(d-1)} and determines the limiting laws of ordered attention weights and outputs across all regimes.

If this is right

  • Below the critical scale the attention output is a local average around the query with deterministic bias and Gaussian fluctuations.
  • At the critical scale a finite collection of nearest keys retains macroscopic mass without collapse to a single key.
  • Above the critical scale all attention mass concentrates on the single closest key.
  • In the subcritical regime with the identity value matrix the attention map approximates the backward heat equation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If real token embeddings deviate from uniform sphere distribution, the critical scaling for selectivity would shift according to the new local distance exponent.
  • The regime analysis supplies a concrete way to choose β_n in practice so that long-context attention achieves a chosen balance between averaging and focus.
  • The same local-exponent approach could be applied to non-uniform or dependent key distributions that better model actual embedding spaces.

Load-bearing premise

The keys are modeled as independent uniform random points on the sphere with a fixed query, producing a specific power-law tail for small distances.

What would settle it

Numerical computation of attention weights for large n at β_n = n^{2/(d-1) - 0.1} showing that the weights do not converge to the predicted local average with explicit bias would falsify the subcritical regime description.

Figures

Figures reproduced from arXiv: 2605.08505 by Giuseppe Bruno, Philippe Rigollet, Shi Chen, Yury Polyanskiy, Zhengjiang Lin.

Figure 1
Figure 1. Figure 1: Unified schematic of the ordered-weight and attention-output regimes as the inverse temperature βn varies, with Ynpqq “ ATTpnq ´V q Euclidean distance from key xi to the query. Two qualitatively distinct failure modes flank the regime of interest: if βn is too small, the smallest D2Qs are blurred into the bulk, the weights Aj are essentially uniform, and the attention map fails to distinguish relevant toke… view at source ↗
Figure 2
Figure 2. Figure 2: Rescaled output displacement on S 2 for n “ 104 , V “ Id, and ρpxq9 exppx1x2q. The columns use βn “ n 5{4 , n, n3{4 , n1{2 , n1{4 , with the regime-dependent scalings shown above the panels, and the last column is the deterministic drift field. The bottom row is a geodesic chart centered at p1, 1, 1q{? 3 [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ordered attention weights for uniform contexts on S 4 . Left: the heatmap shows the empirical mean of Ap1q over 100 trials; the dashed curve βn “ 2n α marks the critical scaling. Right: with βn “ n α{4 , the rescaled ordered weights follow the subcritical prediction e ´x α , where x “ k{mnpqq. 6. Conclusion We introduced a tractable probabilistic model to better understand attention scaling in the long-con… view at source ↗
Figure 4
Figure 4. Figure 4: Large blocks are separated by gaps of length m, so the corresponding block variables are independent under m-dependence. Let Gn :“ t1, . . . , nuzBn be the union of the gaps and the final remainder. Then |Gn| “ O ´ n bn ` bn ¯ “ opnq. Since 0 ď Xi,n ď 1, ˇ ˇ ˇ ˇ ˇ źn i“1 p1 ´ Xi,nq ´ ź iPBn p1 ´ Xi,nq ˇ ˇ ˇ ˇ ˇ ď ÿ iPGn Xi,n [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗
read the original abstract

We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $\beta_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $\beta_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $\beta_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript analyzes the long-context scaling limits of softmax self-attention with a fixed query and n i.i.d. uniform keys on the sphere S^{d-1}. Treating the inverse temperature β_n as the scaling parameter, it shows that the critical scale at which attention becomes selective (rather than uniform averaging) is β_n^* ≍ n^{2/(d-1)}, governed by the local power-law exponent of the distance-to-query distribution near zero. The paper characterizes the limiting laws of the ordered attention weights and attention output in three regimes: subcritical (local averaging around the query with explicit deterministic bias and Gaussian fluctuations), critical (macroscopic mass retained by a finite number of nearest keys), and supercritical (collapse onto the single closest key). It further notes that the subcritical regime with identity value matrix approximates a backward heat equation.

Significance. If the results hold, the work supplies a rigorous probabilistic framework for phase transitions in attention, demonstrating that selectivity thresholds depend on local geometry of the key distribution rather than global context statistics. The explicit limiting distributions derived via order statistics and extreme-value tools, together with the heat-equation connection, constitute a clear theoretical contribution that could guide analysis of long-context transformers. The parameter-free nature of the critical scaling and the regime-specific characterizations are notable strengths.

major comments (2)
  1. [Main results / critical-scale theorem] The central derivation of β_n^* ≍ n^{2/(d-1)} (stated in the abstract and main theorem) rests on the local exponent of the distance distribution; the manuscript should explicitly derive or cite the spherical-cap volume calculation that produces the factor 2/(d-1) to confirm the exponent is load-bearing and not an artifact of the uniform assumption.
  2. [Subcritical regime analysis] In the subcritical regime, the claim that the attention output reduces to a local average with deterministic bias and Gaussian fluctuations is load-bearing for the heat-equation interpretation; the error bounds or convergence rates (especially for large d) should be stated explicitly so that the approximation's validity for finite n is clear.
minor comments (3)
  1. [Notation and setup] Clarify the precise definition of the ordered attention weights (e.g., whether ties are broken randomly or by index) at the first appearance of the notation.
  2. [Abstract and introduction] The abstract mentions 'all regimes of β_n'; add a short table or diagram summarizing the three regimes, their β_n scalings, and the corresponding limiting behaviors for quick reference.
  3. [Discussion of value matrix] Include a brief remark on how the results extend (or fail to extend) when the value matrix is not the identity, as this affects the heat-equation claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address each major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Main results / critical-scale theorem] The central derivation of β_n^* ≍ n^{2/(d-1)} (stated in the abstract and main theorem) rests on the local exponent of the distance distribution; the manuscript should explicitly derive or cite the spherical-cap volume calculation that produces the factor 2/(d-1) to confirm the exponent is load-bearing and not an artifact of the uniform assumption.

    Authors: We agree that an explicit derivation strengthens the presentation. In the revision we will add a short subsection deriving the local volume scaling: for uniform K on S^{d-1}, the surface measure yields P(1 - ⟨q, K⟩ ≤ t) ∼ c_d t^{(d-1)/2} as t → 0 (via the standard parametrization of spherical caps and the quadratic approximation cos θ ≈ 1 - θ²/2). This directly produces the critical scaling β_n^* ≍ n^{2/(d-1)} through the extreme-value analysis of the maximum inner product and is a consequence of local Euclidean geometry rather than a global artifact of uniformity. We will also cite the relevant spherical-geometry references. revision: yes

  2. Referee: [Subcritical regime analysis] In the subcritical regime, the claim that the attention output reduces to a local average with deterministic bias and Gaussian fluctuations is load-bearing for the heat-equation interpretation; the error bounds or convergence rates (especially for large d) should be stated explicitly so that the approximation's validity for finite n is clear.

    Authors: The subcritical theorems establish convergence in distribution to the stated local-average limit (with explicit bias and Gaussian fluctuations) as n → ∞ under β_n = o(n^{2/(d-1)}). In the revision we will add a remark on convergence rates, noting that the contribution of distant keys decays exponentially in the subcritical regime and that the Gaussian approximation error can be controlled via standard Berry–Esseen bounds on the order statistics of the inner products. The bias and variance constants depend on d through the cap-volume prefactor, which we will make explicit; the results hold for fixed d with n → ∞, and we will clarify the finite-n regime of validity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives scaling limits and regime transitions for softmax attention directly from the explicit model of i.i.d. uniform keys on S^{d-1} with fixed query. The critical exponent β_n^* ≍ n^{2/(d-1)} follows from the local volume scaling of spherical caps (distance distribution near zero has power d-1), a standard geometric fact applied via extreme-value statistics. Limiting laws for ordered weights and attention output in sub-, critical, and super-critical regimes are obtained from tail asymptotics and order statistics without fitted parameters, self-definitions, or load-bearing self-citations. All steps remain within the stated probabilistic assumptions and use classical tools, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The model rests on standard probabilistic assumptions about i.i.d. uniform keys on the sphere and classical limit theorems; no new entities are postulated and the only free parameter is the scaling sequence β_n itself.

free parameters (2)
  • β_n
    Inverse-temperature scaling parameter whose growth rate relative to n determines the regime.
  • d
    Dimension of the sphere on which keys lie; treated as fixed but arbitrary.
axioms (2)
  • domain assumption Keys are i.i.d. uniform on the unit sphere S^{d-1}
    Central modeling assumption stated in the abstract.
  • standard math Standard results from extreme-value theory and local limit laws for distances on the sphere
    Invoked to obtain the local exponent of the distance distribution near zero.

pith-pipeline@v0.9.0 · 5516 in / 1516 out tokens · 32410 ms · 2026-05-12T01:41:42.710358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

260 extracted references · 260 canonical work pages · 2 internal anchors

  1. [1]

    NeurIPS , year=

    Big bird: Transformers for longer sequences , author=. NeurIPS , year=

  2. [2]

    Transformer-

    Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime G and Le, Quoc and Salakhutdinov, Ruslan , booktitle=. Transformer-

  3. [3]

    2007 , publisher=

    Random fields and geometry , author=. 2007 , publisher=

  4. [4]

    Token Sample Complexity of Attention , author=

  5. [5]

    2025 , note=

    Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective , author=. 2025 , note=

  6. [6]

    NeurIPS , year=

    A multiscale analysis of mean-field transformers in the moderate interaction regime , author=. NeurIPS , year=

  7. [7]

    2017 , publisher=

    Fourier analysis on groups , author=. 2017 , publisher=

  8. [8]

    2016 , publisher=

    A course in abstract harmonic analysis , author=. 2016 , publisher=

  9. [9]

    Journal of machine learning research , volume=

    A neural probabilistic language model , author=. Journal of machine learning research , volume=

  10. [10]

    arXiv preprint arXiv:2601.22156 , year=

    Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts , author=. arXiv preprint arXiv:2601.22156 , year=

  11. [11]

    Ye, Xi and Zhang, Wuwei and Yin, Fangcong and Yen, Howard and Chen, Danqi , note=

  12. [12]

    arXiv preprint arXiv:2506.16640 , year=

    Long-context generalization with sparse attention , author=. arXiv preprint arXiv:2506.16640 , year=

  13. [13]

    NeurIPS , year=

    Scale-invariant attention , author=. NeurIPS , year=

  14. [14]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Length generalization of causal transformers without position encoding , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  15. [15]

    ICML , year=

    Softmax is not enough (for sharp size generalisation) , author=. ICML , year=

  16. [16]

    Qu, Jingang and Holzm. Tab

  17. [17]

    NeurIPS , year=

    Selective attention: Enhancing transformer through principled context control , author=. NeurIPS , year=

  18. [18]

    Frank W. J. Olver , title =. 1997 , note =

  19. [19]

    Lecture Notes-Monograph Series , pages=

    Poisson-kingman partitions , author=. Lecture Notes-Monograph Series , pages=. 2003 , publisher=

  20. [20]

    Olav Kallenberg , title =

  21. [21]

    2008 , publisher=

    An introduction to the theory of point processes: volume II: general theory and structure , author=. 2008 , publisher=

  22. [22]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  23. [23]

    2024 , url =

    Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and @fernbear.bsky.social and Boza Vlado and You Jiacheng and Franz Cesista and Braden Koszarsky and @Grad62304977 , title =. 2024 , url =

  24. [24]

    Query-Key Normalization for

    Query-key normalization for transformers , author=. arXiv preprint arXiv:2010.04245 , year=

  25. [25]

    ICML , year=

    On layer normalization in the transformer architecture , author=. ICML , year=

  26. [26]

    ICLR , year=

    Critical attention scaling in long-context transformers , author=. ICLR , year=

  27. [27]

    Generating Long Sequences with Sparse Transformers

    Generating long sequences with sparse transformers , author=. arXiv preprint arXiv:1904.10509 , year=

  28. [28]

    The random graph process is globally synchronizing , year =

    Vishesh Jain and Clayton Mizgerd and Mehtaab Sawhney , note =. The random graph process is globally synchronizing , year =

  29. [29]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  30. [30]

    Longformer: The long-document transformer , author=

  31. [31]

    , journal =

    Derrida, B. , journal =. 1981 , publisher =

  32. [32]

    2025 , note=

    Normalization in Attention Dynamics , author=. 2025 , note=

  33. [33]

    2025 , note=

    Residual connections provably mitigate oversmoothing in graph neural networks , author=. 2025 , note=

  34. [34]

    2025 , note=

    Quantitative Clustering in Mean-Field Transformer Models , author=. 2025 , note=

  35. [35]

    ICLR , year=

    Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation , author=. ICLR , year=

  36. [36]

    Puvvada, Krishna C. and Ladhak, Faisal and Akle Serano, Santiago and Hsieh, Cheng-Ping and Acharya, Shantanu and Majumdar, Somshubra and Jia, Fei and Kriman, Samuel and Sun, Simeng and Rekesh, Dima and Ginsburg, Boris , booktitle=

  37. [37]

    2024 , note=

    Extending llms' context window with 100 samples , author=. 2024 , note=

  38. [38]

    2022 , note=

    Overcoming a theoretical limitation of self-attention , author=. 2022 , note=

  39. [39]

    2023 , note=

    Qwen technical report , author=. 2023 , note=

  40. [40]

    Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , year=. Yarn:

  41. [41]

    2025 , note=

    Scalable-Softmax Is Superior for Attention , author=. 2025 , note=

  42. [42]

    2016 , note=

    Gaussian error linear units (gelus) , author=. 2016 , note=

  43. [43]

    2014 , note=

    On the complete phase synchronization for the Kuramoto model in the mean-field limit , author=. 2014 , note=

  44. [44]

    2024 , note=

    Solutions of stationary McKean-Vlasov equation on a high-dimensional sphere and other Riemannian manifolds , author=. 2024 , note=

  45. [45]

    Physica D: Nonlinear Phenomena , volume=

    Emergence of phase concentration for the Kuramoto--Sakaguchi equation , author=. Physica D: Nonlinear Phenomena , volume=. 2020 , publisher=

  46. [46]

    2019 , publisher=

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=. 2019 , publisher=

  47. [47]

    Annals of Mathematics , volume=

    Asymptotics for a class of non-linear evolution equations, with applications to geometric problems , author=. Annals of Mathematics , volume=. 1983 , publisher=

  48. [48]

    2024 , note=

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. 2024 , note=

  49. [49]

    2025 , note=

    Synchronization of mean-field models on the circle , author=. 2025 , note=

  50. [50]

    Physica D: Nonlinear Phenomena , volume=

    On the complete synchronization of the Kuramoto phase model , author=. Physica D: Nonlinear Phenomena , volume=. 2010 , publisher=

  51. [51]

    Inventiones mathematicae , volume=

    On the trend to global equilibrium for spatially inhomogeneous kinetic systems: the Boltzmann equation , author=. Inventiones mathematicae , volume=. 2005 , publisher=

  52. [52]

    2024 , note=

    Synchronization on circles and spheres with nonlinear interactions , author=. 2024 , note=

  53. [53]

    International conference on machine learning , pages=

    Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

  54. [54]

    Advances in Neural Information Processing Systems , volume=

    The emergence of clusters in self-attention dynamics , author=. Advances in Neural Information Processing Systems , volume=

  55. [55]

    2025 , note=

    Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization , author=. 2025 , note=

  56. [56]

    A mathematical perspective on transformers , author=. Bull. Amer. Math. Soc. , year=

  57. [57]

    International Conference on Artificial Intelligence and Statistics , pages=

    Sinkformers: Transformers with doubly stochastic attention , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  58. [58]

    Advances in Neural Information Processing Systems , volume=

    Redesigning the transformer architecture with insights from multi-particle dynamical systems , author=. Advances in Neural Information Processing Systems , volume=

  59. [59]

    2019 , note=

    Understanding and improving transformer from a multi-particle dynamic system point of view , author=. 2019 , note=

  60. [60]

    NeurIPS , year=

    Attention is all you need , author=. NeurIPS , year=

  61. [61]

    Annales de l'Institut Henri Poincar

    On the trend to global equilibrium for Kuramoto oscillators , author=. Annales de l'Institut Henri Poincar

  62. [62]

    IEEE Transactions on Information Theory , year=

    Convergence analysis of probability flow ode for score-based generative models , author=. IEEE Transactions on Information Theory , year=

  63. [63]

    2009 , publisher=

    Optimal transport: old and new , author=. 2009 , publisher=

  64. [64]

    Acebr. The. Reviews of modern physics , volume=. 2005 , publisher=

  65. [65]

    Asymptotic formation and orbital stability of phase-locked states for the

    Choi, Young-Pil and Ha, Seung-Yeal and Jung, Sungeun and Kim, Yongduck , journal=. Asymptotic formation and orbital stability of phase-locked states for the. 2012 , publisher=

  66. [66]

    Physics reports , volume=

    Synchronization in complex networks , author=. Physics reports , volume=. 2008 , publisher=

  67. [67]

    International Symposium on Mathematical Problems in Theoretical Physics: January 23--29, 1975, Kyoto University, Kyoto/Japan , pages=

    Self-entrainment of a population of coupled non-linear oscillators , author=. International Symposium on Mathematical Problems in Theoretical Physics: January 23--29, 1975, Kyoto University, Kyoto/Japan , pages=. 1975 , organization=

  68. [68]

    1984 , publisher=

    Chemical turbulence , author=. 1984 , publisher=

  69. [69]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  70. [70]

    2025 , note=

    Exact Sequence Classification with Hardmax Transformers , author=. 2025 , note=

  71. [71]

    2023 , note=

    Muse: Text-to-image generation via masked generative transformers , author=. 2023 , note=

  72. [72]

    2024 , note=

    Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms , author=. 2024 , note=

  73. [73]

    Convex optimization: algorithms and complexity , year =

    Bubeck, S. Convex optimization: algorithms and complexity , year =

  74. [74]

    , booktitle =

    Chen, Yongxin and Khong, Sei Zhen and Georgiou, Tryphon T. , booktitle =. On the definiteness of graph Laplacians with negative weights: Geometrical and passivity-based approaches , year =

  75. [75]

    High-Dimensional Probability: An Introduction with Applications in Data Science , year =

    Vershynin, Roman , publisher =. High-Dimensional Probability: An Introduction with Applications in Data Science , year =

  76. [76]

    R. M. Dudley , journal =

  77. [77]

    Sriperumbudur and Kenji Fukumizu and Arthur Gretton and Bernhard Sch

    Bharath K. Sriperumbudur and Kenji Fukumizu and Arthur Gretton and Bernhard Sch. Electronic Journal of Statistics , pages =

  78. [78]

    Deep Learning , year =

    Ian Goodfellow AND Yoshua Bengio AND Aaron Courville , publisher =. Deep Learning , year =

  79. [79]

    Experimental Mathematics , volume=

    Experimental study of energy-minimizing point configurations on spheres , author=. Experimental Mathematics , volume=. 2009 , publisher=

  80. [80]

    Universal optimality of the

    Cohn, Henry and Kumar, Abhinav and Miller, Stephen and Radchenko, Danylo and Viazovska, Maryna , journal=. Universal optimality of the. 2022 , publisher=

Showing first 80 references.