arxiv: 2605.02907 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

On the Invariants of Softmax Attention

Wonsuk Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords softmax attentionenergy fieldinvariantskey incoherenceattention logitslanguage modelsrank boundvariance delocalization

0 comments

The pith

The energy field of softmax attention, defined as the row-centered logit, obeys algebraic invariants and shows variance delocalization across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the energy field as the attention logits after centering each row to sum to zero. It proves that mechanism-level invariants such as the zero-sum property and a rank bound set by the head dimension follow directly from the structure of softmax. Across tested autoregressive language models the energy field also spreads its variance evenly over key positions instead of concentrating on a few, a pattern traced to key incoherence. These patterns suggest a per-head training monitor and confirm that attention stays inside a low-dimensional subspace. A reader would care because the invariants offer concrete checks and constraints on how attention behaves without inspecting every weight.

Core claim

We define the energy field as the row-centered attention logit and establish that it exhibits two classes of invariants: mechanism-level ones including per-row zero-sum constraint and rank bound by head dimension, plus spectral signatures; and model-level regularities of variance delocalization across key positions stemming from key incoherence, which hold in all tested autoregressive language models.

What carries the argument

The energy field, defined as the row-centered attention logit, which carries both algebraic constraints from softmax and empirical regularities from key matrix properties.

Load-bearing premise

The observed model-level regularities of variance delocalization hold universally in autoregressive language models rather than only in the specific models and architectures tested.

What would settle it

Finding an autoregressive language model where the energy field's variance concentrates on a small number of key positions, despite the keys satisfying incoherence, would falsify the model-level regularity claim.

Figures

Figures reproduced from arXiv: 2605.02907 by Wonsuk Lee.

**Figure 1.** Figure 1: The causal energy field Eij of a single LLaMA-3.2-1B head (layer 5, head 0, L = 256, processing Dickens). (a) Three-dimensional surface over the causal region (j ≤ i). Ridges (red) mark high energy, valleys (blue) mark low energy. Every row sums to zero, so red and blue balance exactly. The acausal region (j > i) is masked. (b) Filled contour plot of the causal energy field, with iso-energy lines. Vertical… view at source ↗

**Figure 2.** Figure 2: Key incoherence µK vs. model size for all 16 models at L = 256. Markers show the median µK across all heads; error bars span the interquartile range. Despite nearly two orders of magnitude in parameter count, µK shows no systematic trend and remains near 1.5 for every architecture family. The gray dotted line marks µK = 1, perfectly uniform key norms. 5.2 Cross-architecture universality [PITH_FULL_IMAGE:f… view at source ↗

**Figure 3.** Figure 3: Spectral signatures in the flattened energy signal of LLaMA-3.2-1B, layer 5, head 0, [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Softmax attention maps every query--key interaction into a probability distribution, but the underlying structure remains largely unexplored. We define the \emph{energy field}, the row-centered attention logit, and show that it exhibits invariant properties across models, architectures, and inputs. Two classes of invariants emerge. \emph{Mechanism-level} invariants follow from the algebraic structure of softmax attention. They include a per-row zero-sum constraint, a rank bound determined by the head dimension, and spectral signatures that follow from them. \emph{Model-level} regularities are not required by the mechanism, yet hold in every autoregressive language model we test, spanning several architecture families. The energy field distributes its variance over key positions without concentrating at a few. This delocalization traces to a property of the key matrix we call \emph{key incoherence}. These invariants have practical consequences. The rank bound confines the energy field to a low-dimensional subspace. Key incoherence yields a per-head training monitor. All results are verified at multiple context lengths and input texts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The algebraic invariants of the energy field follow directly from centering the attention logits, but the model-level delocalization claim needs much tighter experimental documentation.

read the letter

The main takeaway is that the mechanism-level invariants are unconditional consequences of the softmax algebra once you define the energy field as the row-centered logits. The per-row zero sum and the rank bound at most equal to head dimension drop out immediately from the centering matrix, and the spectral signatures follow from that. This part is straightforward linear algebra and worth writing down cleanly for anyone who works with attention matrices day to day. The paper also separates these from the model-level regularities, which is a useful distinction even if the latter turn out to be less universal than claimed. Defining the energy field and linking variance delocalization to key incoherence is the clearest new piece here, and the practical suggestion of a per-head monitor is reasonable if the incoherence property can be made operational. The soft spot is the empirical generality. The abstract asserts the delocalization holds in every autoregressive language model tested across families, yet supplies no list of models or sizes, no quantitative definition of incoherence, no mention of how many layers or context lengths were actually checked, and no search for counterexamples. Without those controls the causal attribution stays loose and the monitor's reliability is hard to assess. The algebraic claims stand on their own, but the broader story about training monitors rests on the weakest link. This is the kind of paper that could interest people working on transformer interpretability or attention analysis. The mechanism invariants are solid enough to be checked quickly by a referee, while the empirical section would need the full methods and data to decide whether the delocalization result is reproducible or selective. I would send it to peer review if the full manuscript supplies the missing experimental details, because documenting the algebraic structure is useful even if the model-level claims require revision.

Referee Report

2 major / 0 minor

Summary. The paper defines the energy field as the row-centered attention logit and identifies two classes of invariants in softmax attention. Mechanism-level invariants (per-row zero-sum constraint, rank(E) ≤ d_head, spectral signatures) follow algebraically from E = (QK^T / √d)(I − 11^T/n). Model-level regularities (variance delocalization across key positions due to key incoherence) are presented as empirical observations holding in every autoregressive LM tested across architecture families; these yield practical consequences including a low-dimensional subspace confinement and a per-head training monitor. All claims are stated to be verified at multiple context lengths and input texts.

Significance. The algebraic mechanism-level invariants are unconditional and follow directly from the softmax definition, providing a clean structural characterization. If the model-level empirical regularities are shown to be general with proper controls and a quantitative definition of key incoherence, the work could supply useful interpretability tools and training diagnostics for transformers. The practical monitor and rank-bound implications are potentially valuable if the supporting evidence is strengthened.

major comments (2)

[Abstract] Abstract: The model-level claim that variance delocalization due to key incoherence holds in 'every autoregressive language model we test' lacks any enumeration of the exact models, families, sizes, layers, or inputs examined, as well as a quantitative definition of incoherence (e.g., max |k_i · k_j| / ||k||^2 or RIP constant) and statistical controls (seeds, exclusion criteria, counter-example search). This renders the generality and causal attribution load-bearing for the practical monitor but unverifiable from the given information.
[Abstract] Abstract: The statement that 'all results are verified at multiple context lengths and input texts' provides no methods, datasets, error bars, or analysis details, making it impossible to assess reproducibility or whether the empirical regularities are robust rather than post-hoc.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight areas where the abstract can be strengthened to better support the empirical claims. We address each major comment below and will revise the manuscript to incorporate additional details.

read point-by-point responses

Referee: [Abstract] Abstract: The model-level claim that variance delocalization due to key incoherence holds in 'every autoregressive language model we test' lacks any enumeration of the exact models, families, sizes, layers, or inputs examined, as well as a quantitative definition of incoherence (e.g., max |k_i · k_j| / ||k||^2 or RIP constant) and statistical controls (seeds, exclusion criteria, counter-example search). This renders the generality and causal attribution load-bearing for the practical monitor but unverifiable from the given information.

Authors: We agree that the abstract is insufficiently specific on these points and that this limits verifiability of the model-level claims. In the revised manuscript we will expand the abstract to enumerate the models and families tested (including specific sizes and layers from GPT-style, LLaMA, and additional autoregressive families), provide a quantitative definition of key incoherence as the maximum absolute value of the normalized inner product between distinct key vectors, and briefly note the statistical controls used (multiple random seeds and explicit counter-example searches). These additions will be made without lengthening the abstract excessively. revision: yes
Referee: [Abstract] Abstract: The statement that 'all results are verified at multiple context lengths and input texts' provides no methods, datasets, error bars, or analysis details, making it impossible to assess reproducibility or whether the empirical regularities are robust rather than post-hoc.

Authors: We concur that the verification statement requires supporting methodological information. We will revise the abstract to point to a new dedicated methods subsection (or appendix) that specifies the datasets and input texts used, the exact context lengths examined, the analysis procedures, and any error bars or robustness metrics computed across runs. This change will allow readers to evaluate reproducibility directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity: mechanism invariants are direct algebraic consequences; model-level claims are empirical observations.

full rationale

The paper explicitly separates mechanism-level invariants (per-row zero-sum, rank(E) ≤ d_head, spectral signatures) as following from the algebraic definition of the energy field E as the row-centered attention logit. These reduce immediately to the identity E = (QK^T/√d)(I - 11^T/n) without any fitting, prediction, or self-citation. Model-level regularities (variance delocalization via key incoherence) are presented as observed patterns across tested models rather than theorems or fitted predictions derived from the paper's equations. No load-bearing self-citations, uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear in the derivation chain. The central claims remain independent of the empirical tests and do not reduce to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the algebraic properties of softmax and empirical observations in tested models; new concepts (energy field, key incoherence) are introduced without independent evidence outside the paper.

axioms (1)

domain assumption Softmax attention maps every query-key interaction into a probability distribution
Stated as the starting point in the abstract.

invented entities (2)

energy field no independent evidence
purpose: Row-centered attention logit to expose invariant properties
Newly defined in the paper to reveal structure across models and inputs.
key incoherence no independent evidence
purpose: Property of the key matrix explaining variance delocalization
Introduced to account for the observed model-level regularity.

pith-pipeline@v0.9.0 · 5470 in / 1345 out tokens · 67486 ms · 2026-05-10T19:50:53.969580+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The energy field is the row-centered logit: E_ij = Z_ij − μ_i … By construction, ∑ E_ij = 0 … rank(Ẽ) ≤ d_h + 1
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Key incoherence μ_K = L · max ||k_j||² / ||K||_F² … mean μ_K = 1.5 across 16 models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 4 internal anchors

[1]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of EMNLP, 2023

2023
[2]

Chapman & Hall, 1986

John Aitchison.The Statistical Analysis of Compositional Data. Chapman & Hall, 1986

1986
[3]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

2023
[4]

Exact matrix completion via convex optimization

Emmanuel J Cand` es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009

2009
[5]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Repre- sentations, 2021

2021
[6]

Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023

Mojan Javaheripi et al. Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023. 13

2023
[7]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran¸ cois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, 2020

2020
[9]

Compressible softmax-attended language under incompressible attention.arXiv preprint, 2026

Wonsuk Lee. Compressible softmax-attended language under incompressible attention.arXiv preprint, 2026

2026
[10]

Academic Press, 3rd edition, 2009

St´ ephane Mallat.A Wavelet Tour of Signal Processing: The Sparse Way. Academic Press, 3rd edition, 2009

2009
[11]

Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

2019
[12]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[13]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017

2017
[15]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representa- tions, 2024

2024
[16]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022. A Appendix: The energy field as a centered log-ratio transform The arithmetic mean used to define the energy field (Definit...

work page internal anchor Pith review arXiv 2022