Recognition: 2 theorem links
· Lean TheoremOn the Invariants of Softmax Attention
Pith reviewed 2026-05-10 19:50 UTC · model grok-4.3
The pith
The energy field of softmax attention, defined as the row-centered logit, obeys algebraic invariants and shows variance delocalization across models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We define the energy field as the row-centered attention logit and establish that it exhibits two classes of invariants: mechanism-level ones including per-row zero-sum constraint and rank bound by head dimension, plus spectral signatures; and model-level regularities of variance delocalization across key positions stemming from key incoherence, which hold in all tested autoregressive language models.
What carries the argument
The energy field, defined as the row-centered attention logit, which carries both algebraic constraints from softmax and empirical regularities from key matrix properties.
Load-bearing premise
The observed model-level regularities of variance delocalization hold universally in autoregressive language models rather than only in the specific models and architectures tested.
What would settle it
Finding an autoregressive language model where the energy field's variance concentrates on a small number of key positions, despite the keys satisfying incoherence, would falsify the model-level regularity claim.
Figures
read the original abstract
Softmax attention maps every query--key interaction into a probability distribution, but the underlying structure remains largely unexplored. We define the \emph{energy field}, the row-centered attention logit, and show that it exhibits invariant properties across models, architectures, and inputs. Two classes of invariants emerge. \emph{Mechanism-level} invariants follow from the algebraic structure of softmax attention. They include a per-row zero-sum constraint, a rank bound determined by the head dimension, and spectral signatures that follow from them. \emph{Model-level} regularities are not required by the mechanism, yet hold in every autoregressive language model we test, spanning several architecture families. The energy field distributes its variance over key positions without concentrating at a few. This delocalization traces to a property of the key matrix we call \emph{key incoherence}. These invariants have practical consequences. The rank bound confines the energy field to a low-dimensional subspace. Key incoherence yields a per-head training monitor. All results are verified at multiple context lengths and input texts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines the energy field as the row-centered attention logit and identifies two classes of invariants in softmax attention. Mechanism-level invariants (per-row zero-sum constraint, rank(E) ≤ d_head, spectral signatures) follow algebraically from E = (QK^T / √d)(I − 11^T/n). Model-level regularities (variance delocalization across key positions due to key incoherence) are presented as empirical observations holding in every autoregressive LM tested across architecture families; these yield practical consequences including a low-dimensional subspace confinement and a per-head training monitor. All claims are stated to be verified at multiple context lengths and input texts.
Significance. The algebraic mechanism-level invariants are unconditional and follow directly from the softmax definition, providing a clean structural characterization. If the model-level empirical regularities are shown to be general with proper controls and a quantitative definition of key incoherence, the work could supply useful interpretability tools and training diagnostics for transformers. The practical monitor and rank-bound implications are potentially valuable if the supporting evidence is strengthened.
major comments (2)
- [Abstract] Abstract: The model-level claim that variance delocalization due to key incoherence holds in 'every autoregressive language model we test' lacks any enumeration of the exact models, families, sizes, layers, or inputs examined, as well as a quantitative definition of incoherence (e.g., max |k_i · k_j| / ||k||^2 or RIP constant) and statistical controls (seeds, exclusion criteria, counter-example search). This renders the generality and causal attribution load-bearing for the practical monitor but unverifiable from the given information.
- [Abstract] Abstract: The statement that 'all results are verified at multiple context lengths and input texts' provides no methods, datasets, error bars, or analysis details, making it impossible to assess reproducibility or whether the empirical regularities are robust rather than post-hoc.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight areas where the abstract can be strengthened to better support the empirical claims. We address each major comment below and will revise the manuscript to incorporate additional details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The model-level claim that variance delocalization due to key incoherence holds in 'every autoregressive language model we test' lacks any enumeration of the exact models, families, sizes, layers, or inputs examined, as well as a quantitative definition of incoherence (e.g., max |k_i · k_j| / ||k||^2 or RIP constant) and statistical controls (seeds, exclusion criteria, counter-example search). This renders the generality and causal attribution load-bearing for the practical monitor but unverifiable from the given information.
Authors: We agree that the abstract is insufficiently specific on these points and that this limits verifiability of the model-level claims. In the revised manuscript we will expand the abstract to enumerate the models and families tested (including specific sizes and layers from GPT-style, LLaMA, and additional autoregressive families), provide a quantitative definition of key incoherence as the maximum absolute value of the normalized inner product between distinct key vectors, and briefly note the statistical controls used (multiple random seeds and explicit counter-example searches). These additions will be made without lengthening the abstract excessively. revision: yes
-
Referee: [Abstract] Abstract: The statement that 'all results are verified at multiple context lengths and input texts' provides no methods, datasets, error bars, or analysis details, making it impossible to assess reproducibility or whether the empirical regularities are robust rather than post-hoc.
Authors: We concur that the verification statement requires supporting methodological information. We will revise the abstract to point to a new dedicated methods subsection (or appendix) that specifies the datasets and input texts used, the exact context lengths examined, the analysis procedures, and any error bars or robustness metrics computed across runs. This change will allow readers to evaluate reproducibility directly. revision: yes
Circularity Check
No significant circularity: mechanism invariants are direct algebraic consequences; model-level claims are empirical observations.
full rationale
The paper explicitly separates mechanism-level invariants (per-row zero-sum, rank(E) ≤ d_head, spectral signatures) as following from the algebraic definition of the energy field E as the row-centered attention logit. These reduce immediately to the identity E = (QK^T/√d)(I - 11^T/n) without any fitting, prediction, or self-citation. Model-level regularities (variance delocalization via key incoherence) are presented as observed patterns across tested models rather than theorems or fitted predictions derived from the paper's equations. No load-bearing self-citations, uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear in the derivation chain. The central claims remain independent of the empirical tests and do not reduce to their inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Softmax attention maps every query-key interaction into a probability distribution
invented entities (2)
-
energy field
no independent evidence
-
key incoherence
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The energy field is the row-centered logit: E_ij = Z_ij − μ_i … By construction, ∑ E_ij = 0 … rank(Ẽ) ≤ d_h + 1
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Key incoherence μ_K = L · max ||k_j||² / ||K||_F² … mean μ_K = 1.5 across 16 models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GQA: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of EMNLP, 2023
2023
-
[2]
Chapman & Hall, 1986
John Aitchison.The Statistical Analysis of Compositional Data. Chapman & Hall, 1986
1986
-
[3]
Pythia: A suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023
2023
-
[4]
Exact matrix completion via convex optimization
Emmanuel J Cand` es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009
2009
-
[5]
Rethinking attention with performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Repre- sentations, 2021
2021
-
[6]
Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023
Mojan Javaheripi et al. Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023. 13
2023
-
[7]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Transformers are RNNs: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran¸ cois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, 2020
2020
-
[9]
Compressible softmax-attended language under incompressible attention.arXiv preprint, 2026
Wonsuk Lee. Compressible softmax-attended language under incompressible attention.arXiv preprint, 2026
2026
-
[10]
Academic Press, 3rd edition, 2009
St´ ephane Mallat.A Wavelet Tour of Signal Processing: The Sparse Way. Academic Press, 3rd edition, 2009
2009
-
[11]
Language models are unsupervised multitask learners.OpenAI Technical Report, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Technical Report, 2019
2019
-
[12]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[13]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Attention is all you need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017
2017
-
[15]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representa- tions, 2024
2024
-
[16]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022. A Appendix: The energy field as a centered log-ratio transform The arithmetic mean used to define the energy field (Definit...
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.