pith. machine review for the scientific record. sign in

arxiv: 2605.02907 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

On the Invariants of Softmax Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords softmax attentionenergy fieldinvariantskey incoherenceattention logitslanguage modelsrank boundvariance delocalization
0
0 comments X

The pith

The energy field of softmax attention, defined as the row-centered logit, obeys algebraic invariants and shows variance delocalization across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the energy field as the attention logits after centering each row to sum to zero. It proves that mechanism-level invariants such as the zero-sum property and a rank bound set by the head dimension follow directly from the structure of softmax. Across tested autoregressive language models the energy field also spreads its variance evenly over key positions instead of concentrating on a few, a pattern traced to key incoherence. These patterns suggest a per-head training monitor and confirm that attention stays inside a low-dimensional subspace. A reader would care because the invariants offer concrete checks and constraints on how attention behaves without inspecting every weight.

Core claim

We define the energy field as the row-centered attention logit and establish that it exhibits two classes of invariants: mechanism-level ones including per-row zero-sum constraint and rank bound by head dimension, plus spectral signatures; and model-level regularities of variance delocalization across key positions stemming from key incoherence, which hold in all tested autoregressive language models.

What carries the argument

The energy field, defined as the row-centered attention logit, which carries both algebraic constraints from softmax and empirical regularities from key matrix properties.

Load-bearing premise

The observed model-level regularities of variance delocalization hold universally in autoregressive language models rather than only in the specific models and architectures tested.

What would settle it

Finding an autoregressive language model where the energy field's variance concentrates on a small number of key positions, despite the keys satisfying incoherence, would falsify the model-level regularity claim.

Figures

Figures reproduced from arXiv: 2605.02907 by Wonsuk Lee.

Figure 1
Figure 1. Figure 1: The causal energy field Eij of a single LLaMA-3.2-1B head (layer 5, head 0, L = 256, processing Dickens). (a) Three-dimensional surface over the causal region (j ≤ i). Ridges (red) mark high energy, valleys (blue) mark low energy. Every row sums to zero, so red and blue balance exactly. The acausal region (j > i) is masked. (b) Filled contour plot of the causal energy field, with iso-energy lines. Vertical… view at source ↗
Figure 2
Figure 2. Figure 2: Key incoherence µK vs. model size for all 16 models at L = 256. Markers show the median µK across all heads; error bars span the interquartile range. Despite nearly two orders of magnitude in parameter count, µK shows no systematic trend and remains near 1.5 for every architecture family. The gray dotted line marks µK = 1, perfectly uniform key norms. 5.2 Cross-architecture universality [PITH_FULL_IMAGE:f… view at source ↗
Figure 3
Figure 3. Figure 3: Spectral signatures in the flattened energy signal of LLaMA-3.2-1B, layer 5, head 0, [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Softmax attention maps every query--key interaction into a probability distribution, but the underlying structure remains largely unexplored. We define the \emph{energy field}, the row-centered attention logit, and show that it exhibits invariant properties across models, architectures, and inputs. Two classes of invariants emerge. \emph{Mechanism-level} invariants follow from the algebraic structure of softmax attention. They include a per-row zero-sum constraint, a rank bound determined by the head dimension, and spectral signatures that follow from them. \emph{Model-level} regularities are not required by the mechanism, yet hold in every autoregressive language model we test, spanning several architecture families. The energy field distributes its variance over key positions without concentrating at a few. This delocalization traces to a property of the key matrix we call \emph{key incoherence}. These invariants have practical consequences. The rank bound confines the energy field to a low-dimensional subspace. Key incoherence yields a per-head training monitor. All results are verified at multiple context lengths and input texts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper defines the energy field as the row-centered attention logit and identifies two classes of invariants in softmax attention. Mechanism-level invariants (per-row zero-sum constraint, rank(E) ≤ d_head, spectral signatures) follow algebraically from E = (QK^T / √d)(I − 11^T/n). Model-level regularities (variance delocalization across key positions due to key incoherence) are presented as empirical observations holding in every autoregressive LM tested across architecture families; these yield practical consequences including a low-dimensional subspace confinement and a per-head training monitor. All claims are stated to be verified at multiple context lengths and input texts.

Significance. The algebraic mechanism-level invariants are unconditional and follow directly from the softmax definition, providing a clean structural characterization. If the model-level empirical regularities are shown to be general with proper controls and a quantitative definition of key incoherence, the work could supply useful interpretability tools and training diagnostics for transformers. The practical monitor and rank-bound implications are potentially valuable if the supporting evidence is strengthened.

major comments (2)
  1. [Abstract] Abstract: The model-level claim that variance delocalization due to key incoherence holds in 'every autoregressive language model we test' lacks any enumeration of the exact models, families, sizes, layers, or inputs examined, as well as a quantitative definition of incoherence (e.g., max |k_i · k_j| / ||k||^2 or RIP constant) and statistical controls (seeds, exclusion criteria, counter-example search). This renders the generality and causal attribution load-bearing for the practical monitor but unverifiable from the given information.
  2. [Abstract] Abstract: The statement that 'all results are verified at multiple context lengths and input texts' provides no methods, datasets, error bars, or analysis details, making it impossible to assess reproducibility or whether the empirical regularities are robust rather than post-hoc.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight areas where the abstract can be strengthened to better support the empirical claims. We address each major comment below and will revise the manuscript to incorporate additional details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The model-level claim that variance delocalization due to key incoherence holds in 'every autoregressive language model we test' lacks any enumeration of the exact models, families, sizes, layers, or inputs examined, as well as a quantitative definition of incoherence (e.g., max |k_i · k_j| / ||k||^2 or RIP constant) and statistical controls (seeds, exclusion criteria, counter-example search). This renders the generality and causal attribution load-bearing for the practical monitor but unverifiable from the given information.

    Authors: We agree that the abstract is insufficiently specific on these points and that this limits verifiability of the model-level claims. In the revised manuscript we will expand the abstract to enumerate the models and families tested (including specific sizes and layers from GPT-style, LLaMA, and additional autoregressive families), provide a quantitative definition of key incoherence as the maximum absolute value of the normalized inner product between distinct key vectors, and briefly note the statistical controls used (multiple random seeds and explicit counter-example searches). These additions will be made without lengthening the abstract excessively. revision: yes

  2. Referee: [Abstract] Abstract: The statement that 'all results are verified at multiple context lengths and input texts' provides no methods, datasets, error bars, or analysis details, making it impossible to assess reproducibility or whether the empirical regularities are robust rather than post-hoc.

    Authors: We concur that the verification statement requires supporting methodological information. We will revise the abstract to point to a new dedicated methods subsection (or appendix) that specifies the datasets and input texts used, the exact context lengths examined, the analysis procedures, and any error bars or robustness metrics computed across runs. This change will allow readers to evaluate reproducibility directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity: mechanism invariants are direct algebraic consequences; model-level claims are empirical observations.

full rationale

The paper explicitly separates mechanism-level invariants (per-row zero-sum, rank(E) ≤ d_head, spectral signatures) as following from the algebraic definition of the energy field E as the row-centered attention logit. These reduce immediately to the identity E = (QK^T/√d)(I - 11^T/n) without any fitting, prediction, or self-citation. Model-level regularities (variance delocalization via key incoherence) are presented as observed patterns across tested models rather than theorems or fitted predictions derived from the paper's equations. No load-bearing self-citations, uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results appear in the derivation chain. The central claims remain independent of the empirical tests and do not reduce to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the algebraic properties of softmax and empirical observations in tested models; new concepts (energy field, key incoherence) are introduced without independent evidence outside the paper.

axioms (1)
  • domain assumption Softmax attention maps every query-key interaction into a probability distribution
    Stated as the starting point in the abstract.
invented entities (2)
  • energy field no independent evidence
    purpose: Row-centered attention logit to expose invariant properties
    Newly defined in the paper to reveal structure across models and inputs.
  • key incoherence no independent evidence
    purpose: Property of the key matrix explaining variance delocalization
    Introduced to account for the observed model-level regularity.

pith-pipeline@v0.9.0 · 5470 in / 1345 out tokens · 67486 ms · 2026-05-10T19:50:53.969580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 4 internal anchors

  1. [1]

    GQA: Training generalized multi-query transformer models from multi-head checkpoints

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´ on, and Sumit Sanghai. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of EMNLP, 2023

  2. [2]

    Chapman & Hall, 1986

    John Aitchison.The Statistical Analysis of Compositional Data. Chapman & Hall, 1986

  3. [3]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. InInternational Conference on Machine Learning, 2023

  4. [4]

    Exact matrix completion via convex optimization

    Emmanuel J Cand` es and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009

  5. [5]

    Rethinking attention with performers

    Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Repre- sentations, 2021

  6. [6]

    Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023

    Mojan Javaheripi et al. Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023. 13

  7. [7]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

  8. [8]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran¸ cois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, 2020

  9. [9]

    Compressible softmax-attended language under incompressible attention.arXiv preprint, 2026

    Wonsuk Lee. Compressible softmax-attended language under incompressible attention.arXiv preprint, 2026

  10. [10]

    Academic Press, 3rd edition, 2009

    St´ ephane Mallat.A Wavelet Tour of Signal Processing: The Sparse Way. Academic Press, 3rd edition, 2009

  11. [11]

    Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Technical Report, 2019

  12. [12]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  13. [13]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Tim- oth´ ee Lacroix, Baptiste Rozi` ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  14. [14]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Infor- mation Processing Systems, volume 30, 2017

  15. [15]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representa- tions, 2024

  16. [16]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  17. [17]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022. A Appendix: The energy field as a centered log-ratio transform The arithmetic mean used to define the energy field (Definit...