Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

P. J. Narayanan; Tejas Pradeep Shirodkar

arxiv: 2606.19491 · v1 · pith:5Q2EVXA4new · submitted 2026-06-17 · 💻 cs.LG · stat.ML

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

Tejas Pradeep Shirodkar , P. J. Narayanan This is my paper

Pith reviewed 2026-06-26 21:12 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords LayerNormdead directionsactivation covarianceFisher informationsingular directionstransformersnormalization

0 comments

The pith

The inverse-scale direction of LayerNorm affine parameters is an exact algebraic kernel of the post-final-norm centred activation covariance for any input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in LayerNorm transformers a single direction taken from the scale parameters alone is exactly orthogonal to all centred post-norm activations, creating a dead direction in parameter space where the directional Fisher information vanishes. This identity holds for arbitrary input distributions because it follows directly from the mean-subtraction projector inside standard LayerNorm. The same direction is absent in RMSNorm models, which lack that projector, and the distinction is visible from the weights without any data pass. Because the direction is read in closed form, it supplies the cheapest diagnostic of singular structure at LLM scale and explains why the residual stream's smallest singular value stays stable block-to-block in most measured models.

Core claim

The inverse-scale direction γ^{-1}/‖γ^{-1}‖ of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm.

What carries the argument

The inverse-scale direction γ^{-1}/‖γ^{-1}‖ of the LayerNorm affine parameter, which is an algebraic kernel of the centred post-norm covariance due to the mean-subtraction projector.

If this is right

The residual stream's smallest singular value is preserved block-to-block on 13 of the 14 measured transformers on their own input distribution.
The presence or absence of the predicted kernel direction classifies a transformer as LayerNorm or RMSNorm from its parameters alone.
At random initialization the predicted direction matches the measured bottom singular direction of the activation covariance to four decimal places on all nine LayerNorm models.
On trained checkpoints the eigenvalue along the kernel direction deepens by roughly three orders of magnitude, opening additional dead directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Checkpoints can be scanned for normalization type and for the depth of this singular structure without running inference or knowing the training code.
The algebraic identity supplies a coordinate along which one could add a regularizer that penalizes further deepening of the dead direction during continued training.
Because the kernel is present from initialization, any training procedure that preserves or amplifies it is automatically selecting for singular minima along a known axis.

Load-bearing premise

The derivation requires the precise mean-subtraction projector inside standard LayerNorm and covariance taken after the final normalization step.

What would settle it

Compute the centred activation covariance matrix after the final LayerNorm on any input distribution and verify whether its quadratic form along γ^{-1}/‖γ^{-1}‖ is numerically zero.

Figures

Figures reproduced from arXiv: 2606.19491 by P. J. Narayanan, Tejas Pradeep Shirodkar.

**Figure 1.** Figure 1: The result in three pictures. (a) A dead direction u at θ0 ∈ ΣT is a unit vector along which the Fisher metric degenerates: K(θ0 + tu) = c t2k vanishes faster than quadratic, so u ⊤F(θ0 + tu) u = Θ(t 2(k−1)). The framework of Shirodkar (2026) derives such directions in closed form from a network’s affine parameters. (b) For LayerNorm-equipped pretrained transformers the predicted direction is γ −1/∥γ −1∥,… view at source ↗

**Figure 2.** Figure 2: Pythia-1B σmin developmental arc across 8 pretraining revisions (step 1 → 143,000). (a) Residual-stream σmin depth profile, one curve per revision; the bottom curve is step1 (initialisation), the top curves are mature checkpoints. (b) Ratio σmin(Xℓ)/σmin(X0) exceeds 1 at every depth on every revision, so Corollary 4 holds across pretraining time, not only at the mature checkpoint. The ratio amplitude ris… view at source ↗

**Figure 3.** Figure 3: Residual-stream σmin depth-invariance against the true input embedding X0 (Corollary 58). (a) σ(r0)(Xℓ)/σ(r0)(X0) versus normalized residual-stream depth on the 13 sub-layerpipeline models under the uniform text-only protocol (fp32 forward, fp64 covariance, X0 reference); 11 stay at or above 1 at every block, and the only two that dip below are the Gemma 4 releases (V-JEPA 2, the 14th model in Tab. 2, p… view at source ↗

**Figure 4.** Figure 4: Pythia-1B σmin developmental arc across 8 pretraining revisions (step 1 → step 143k). (a) Residual-stream depth profile per revision, viridis-colored by step. (b) Last residual-stream block ratio (against the X0 embedding) rises ∼ 12× →∼ 299× (peak at step50000), settling to ∼ 206× during pretraining: depth-invariance is dynamic, not initialization. only the FFN sub-layers (mlp.fc1, mlp.fc2) of all 12 bloc… view at source ↗

**Figure 5.** Figure 5: ViT FFN fine-tuning preserves Corollary 58. (a) DINOv2-base depth profile of [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Schur-ratio ∆γ = 1 − R(h) along u = γ −1/∥γ −1∥ at the QKV input (Type A site). Left: cross-model summary (random-init and trained-checkpoint pairs); Pythia random-init values are 5- seed pooled medians with noise-floor recovery applied at sites where both λmin(Aℓ) and (Aℓ)u,u sit at the fp64 cov floor. LayerNorm Pythia at random init clusters at ∆γ ≈ 0 for h ≥ 768 (the framework’s algebraic prediction); t… view at source ↗

**Figure 7.** Figure 7: Per-layer rate structure test on TinyLlama- [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

read the original abstract

Pretrained transformers sit near singular minima of the loss, where the Fisher information metric degenerates along dead directions: directions in parameter space along which the directional Fisher vanishes. Locating such a direction normally needs a forward pass and an eigendecomposition of activations, or a sampling-based complexity estimate; none returns a direction computable from the network's parameters alone. We give one, for LayerNorm transformers. The inverse-scale direction $\gamma^{-1}/\|\gamma^{-1}\|$ of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm. We test it on $14$ pretrained transformers ($9$ LayerNorm, $5$ RMSNorm; $160$M-$35$B; language and vision objectives). At random initialisation the predicted direction matches the measured bottom singular direction (one forward pass, direct SVD) to four decimal places on $9/9$ LayerNorm models, and is correctly absent on $5/5$ RMSNorm models, which lack the mean-subtraction projector that creates it. On the trained checkpoint the covariance eigenvalue along this direction deepens by ${\sim}10^3\times$ and further dead directions open; the random-init-to-trained gap is a one-forward-pass, per-checkpoint readout of singular structure along the predicted coordinate. Two consequences follow in closed form: the residual stream's smallest singular value is preserved block-to-block on $13/14$ transformers measured on their own input distribution, the one exception (Gemma$4$-$31$B) a genuine dead direction the same read pinpoints; and the kernel direction's presence classifies a transformer's normalisation from the parameters alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a parameter-only algebraic kernel for a dead direction in LayerNorm transformers that follows directly from the mean subtraction and matches the tests.

read the letter

The main takeaway is that after standard LayerNorm the post-norm activations have an exact zero-variance direction given by the inverse of the gamma scale vector. This holds for any input because the pre-affine normalized vector sums exactly to zero, forcing the dot product with gamma inverse to be constant.

What is new is extracting this kernel straight from the parameters with no forward pass or eigendecomposition. The experiments show the predicted direction matches the measured bottom singular vector to four decimals at initialization on all nine LayerNorm models and is absent on all five RMSNorm models, which lack the mean projector. On trained checkpoints the eigenvalue along this direction shrinks further by orders of magnitude, and the paper notes block-to-block singular-value preservation on thirteen of the fourteen models.

The algebraic identity and the clean separation from RMSNorm are the strongest parts. The checks are straightforward and the architectural dependence is precise.

The softer spots are in the extension to parameter-space dead directions and singular minima. The activation covariance kernel is shown exactly, but the steps connecting it to the Fisher metric or loss landscape are sketched rather than derived in full. The trained-model observations are consistent but remain descriptive.

This is for readers working on transformer internals, normalization effects, or cheap geometric diagnostics at scale. Anyone tracking how architecture choices shape optimization geometry will find the parameter-only readout useful.

It deserves a serious referee. The central claim is exact and the evidence lines up, even if some downstream implications could be tightened.

Referee Report

0 major / 3 minor

Summary. The paper asserts that in transformers using standard LayerNorm, the normalized inverse-scale direction γ^{-1}/‖γ^{-1}‖ of the LayerNorm affine parameters is an exact algebraic kernel of the centered covariance of post-final-norm activations for any input distribution. This follows directly from the mean-subtraction projector enforced by LayerNorm (absent in RMSNorm), forcing zero variance along that direction. The identity is tested on 14 pretrained models (9 LayerNorm, 5 RMSNorm; 160M–35B parameters) at random initialization, where it matches the measured bottom singular direction to four decimal places on all LayerNorm models and is correctly absent on all RMSNorm models. On trained checkpoints the eigenvalue along the direction deepens by ~10^3×; closed-form consequences include block-to-block preservation of the residual stream's smallest singular value on 13/14 models.

Significance. If the algebraic identity holds, the result supplies a parameter-only, forward-pass-free and eigensolve-free diagnostic for a dead direction in the Fisher metric of LayerNorm transformers. The exact match at initialization across scales and the clean separation from RMSNorm models provide strong, architecture-specific evidence. The block-to-block singular-value preservation is a direct, falsifiable downstream prediction that follows in closed form from the kernel property.

minor comments (3)

[§3] §3 (derivation): the step from the mean-zero property of z to Var(v·y)=0 is immediate, but the subsequent claim that this induces a dead direction in parameter space would benefit from an explicit one-line link to the directional Fisher (even if standard).
[Table 1] Table 1: the four-decimal-place match is reported via cosine similarity; stating the precise numerical values of the bottom singular vector components for one representative model would make the match more transparent.
[§5.2] §5.2: the statement that the random-init-to-trained gap is 'a one-forward-pass readout' is clear, but the precise definition of the gap (difference in log-eigenvalue or ratio) should be given explicitly.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of the manuscript, the assessment of its significance, and the recommendation to accept. The report correctly identifies the algebraic identity, its parameter-only nature, the empirical verification across scales and normalization types, and the closed-form downstream predictions.

Circularity Check

0 steps flagged

No significant circularity; algebraic identity from LayerNorm definition

full rationale

The central claim is an exact algebraic identity: the mean-subtraction projector in standard LayerNorm forces ∑z_i=0 on the pre-affine vector z, which directly implies that the post-affine output y is orthogonal to γ^{-1} (hence zero variance along that direction in the centred covariance) for any input distribution. This follows immediately from the definition of LayerNorm without any fitted parameters, predictions, or self-citations. The paper explicitly notes the identity fails for RMSNorm (which lacks the projector) and confirms this experimentally on 5 RMSNorm models. No load-bearing step reduces to a fit, self-citation chain, or ansatz; the derivation is self-contained against the architecture equations and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the mathematical definition of LayerNorm (including mean subtraction) and the definition of centred activation covariance; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption LayerNorm includes explicit mean subtraction that creates a projector absent in RMSNorm
The kernel identity is stated to hold only for LayerNorm and to be correctly absent for RMSNorm models.
domain assumption Covariance is computed on activations after the final normalization layer
The algebraic kernel is defined with respect to the post-final-norm centred covariance.

pith-pipeline@v0.9.1-grok · 5893 in / 1369 out tokens · 24684 ms · 2026-06-26T21:12:12.585965+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks
cs.LG 2026-06 unverdicted novelty 7.0

Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions ...
Dead-Direction Signatures: A Cheap Spectral Reading of Singular Complexity
cs.LG 2026-06 unverdicted novelty 7.0

Dead-Direction Signatures provide closed-form spectral readings of dead directions in network activations and gradients that track rank deficits at singular minima, offering a cheap directional alternative to SGLD-based LLC.

Reference graph

Works this paper leans on

54 extracted references · 4 canonical work pages · cited by 2 Pith papers

[1]

S.-i. Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer, 2016. URL https://link.springer.com/book/10.1007/978-4-431-55978-8

work page doi:10.1007/978-4-431-55978-8 2016
[2]

Amari, H

S.-i. Amari, H. Park, and T. Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18 0 (5): 0 1007--1065, 2006. URL https://doi.org/10.1162/neco.2006.18.5.1007

work page doi:10.1162/neco.2006.18.5.1007 2006
[3]

Ashkboos, M

S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman. SliceGPT : Compress large language models by deleting rows and columns. In International Conference on Learning Representations (ICLR), 2024

2024
[4]

Assran, A

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, A. Kamoun, X. Chen, K. Sinha, Y. LeCun, M. Rabbat, and N. Ballas. V-JEPA 2 : Self-supervised video models enable understanding, prediction, and planning. arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025
[5]

Biderman, H

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O'Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling. ICML, 2023

2023
[6]

Boix-Adsera, E

E. Boix-Adsera, E. Littwin, E. Abbe, S. Bengio, and J. Susskind. Transformers learn through gradual rank increase. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2306.07042

arXiv 2023
[7]

J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations (ICLR), 2021. arXiv:2103.00065; documents the edge-of-stability regime where the top Hessian eigenvalue stabilises near 2/ . Standard sharpness-side phase-transi...

arXiv 2021
[8]

Dong, J.-B

Y. Dong, J.-B. Cordonnier, and A. Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), 2021. URL https://arxiv.org/abs/2103.03404

arXiv 2021
[9]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

2021
[10]

Eschenhagen, A

R. Eschenhagen, A. Immer, R. E. Turner, F. Schneider, and P. Hennig. K ronecker-factored approximate curvature for modern neural network architectures. In NeurIPS, 2023

2023
[11]

Ettori, N

D. Ettori, N. Darabi, S. Tayebati, R. Krishnan, M. Subedar, O. Tickoo, and A. R. Trivedi. EigenTrack : Spectral activation feature tracking for hallucination and out-of-distribution detection in LLM s and VLM s. arXiv:2509.15735, 2025

arXiv 2025
[12]

George, C

T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent. Fast approximate natural gradient descent in a K ronecker-factored eigenbasis. In NeurIPS, 2018

2018
[13]

Ghorbani, S

B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via H essian eigenvalue density. In ICML, 2019

2019
[14]

Grosse and J

R. Grosse and J. Martens. A K ronecker-factored approximate F isher matrix for convolution layers. In ICML, 2016. URL https://arxiv.org/abs/1602.01407

Pith/arXiv arXiv 2016
[15]

K. He, X. Chen, S. Xie, Y. Li, P. Doll \'a r, and R. Girshick. Masked autoencoders are scalable vision learners. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 16000--16009, 2022

2022
[16]

Hironaka

H. Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79 0 (1): 0 109--326, 1964. URL https://www.jstor.org/stable/1970486

arXiv 1964
[17]

Hoogland, G

J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2402.02364

arXiv 2024
[18]

N. K. Jha and B. Reagen. NerVE : Nonlinear eigenspectrum dynamics in LLM feed-forward networks. arXiv:2603.06922, 2026

arXiv 2026
[19]

Karakida, S

R. Karakida, S. Akaho, and S.-i. Amari. Universal statistics of F isher information in deep neural networks: Mean field approach. In AISTATS, 2019

2019
[20]

Karakida, S

R. Karakida, S. Akaho, and S.-i. Amari. Pathological spectra of the F isher information metric and its variants in deep neural networks. Neural Computation, 33 0 (8): 0 2274--2307, 2021

2021
[21]

T. X. Khanh, T. Q. Hoa, L. D. Trung, and P. T. Duc. Spectral entropy collapse as an empirical signature of delayed generalisation in grokking. arXiv:2604.13123, 2026

Pith/arXiv arXiv 2026
[22]

J. Kim, B. Lee, C. Park, Y. Oh, B. Kim, T. Yoo, S. Shin, D. Han, J. Shin, and K. M. Yoo. Peri-LN : Revisiting normalization layer in the transformer architecture. arXiv preprint, 2025. URL https://arxiv.org/abs/2502.02732. Names the pre-norm + post-norm pattern ``Peri-LN'' and analyses its effect on activation magnitudes (linear vs exponential growth) and...

arXiv 2025
[23]

Kunstner, L

F. Kunstner, L. Balles, and P. Hennig. Limitations of the empirical F isher approximation for natural gradient descent. In NeurIPS, 2019. URL https://arxiv.org/abs/1905.12558

arXiv 2019
[24]

E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025. URL https://proceedings.mlr.press/v258/lau25a.html

2025
[25]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with Kronecker -factored approximate curvature. In ICML, 2015. URL https://arxiv.org/abs/1503.05671

arXiv 2015
[26]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023. URL https://arxiv.org/abs/2301.05217

Pith/arXiv arXiv 2023
[27]

L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2206.03126

arXiv 2022
[28]

Olsson, N

C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. In-context learning and induction heads. Transformer Cir...

2022
[29]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2 : Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023
[30]

V. Papyan. Traces of class/cross-class structure pervade deep learning spectra. JMLR, 21 0 (252): 0 1--64, 2020. URL https://jmlr.org/papers/volume21/20-933/20-933.pdf

2020
[31]

URL https://www.pnas.org/doi/abs/10.1073/pnas

V. Papyan, X. Y. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117 0 (40): 0 24652--24663, 2020. URL https://doi.org/10.1073/pnas.2015509117

work page doi:10.1073/pnas.2015509117 2020
[32]

Pennington and P

J. Pennington and P. Worah. The spectrum of the F isher information matrix of a single-hidden-layer neural network. In NeurIPS, 2018

2018
[33]

S. Plummer. Singular fluctuation as specific heat in B ayesian learning. arXiv:2512.21411, 2025

Pith/arXiv arXiv 2025
[34]

P. M. Riechers. Geometry and dynamics of LayerNorm . arXiv preprint arXiv:2405.04134, 2024. URL https://arxiv.org/abs/2405.04134

arXiv 2024
[35]

Roy and M

O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. 15th European Signal Processing Conference (EUSIPCO), pages 606--610, 2007

2007
[36]

Sagun, U

L. Sagun, U. Evci, V. U. G \"u ney, Y. Dauphin, and L. Bottou. Empirical analysis of the H essian of over-parametrized neural networks. In ICLR Workshop, 2018. arXiv:1706.04454

Pith/arXiv arXiv 2018
[37]

Sanyal, P

A. Sanyal, P. H. S. Torr, and P. K. Dokania. Stable rank normalization for improved generalization in neural networks and GAN s. In ICLR, 2020

2020
[38]

T. P. Shirodkar. Dead directions: Geometric singular learning, 2026. URL https://arxiv.org/abs/2606.05957

Pith/arXiv arXiv 2026
[39]

Sim \'e oni, H

O. Sim \'e oni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. DINOv3 . arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025
[40]

Staats, M

M. Staats, M. Thamm, and B. Rosenow. Small singular values matter: A random matrix analysis of transformer models. arXiv preprint arXiv:2410.17770, 2024. URL https://arxiv.org/abs/2410.17770

arXiv 2024
[41]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. In COLM, 2024. URL https://arxiv.org/abs/2402.17762

Pith/arXiv arXiv 2024
[42]

G. Team. Gemma 2: Improving open language models at a practical size. Technical report, Google DeepMind, 2024 a . URL https://arxiv.org/abs/2408.00118. Documents the Peri-LN sandwich-norm pattern (pre-norm + post-norm RMSNorm) per-block in Gemma 2's Table 1

Pith/arXiv arXiv 2024
[43]

G. Team. Gemma: Open models based on gemini research and technology. Technical report, Google DeepMind, 2024 b . Original Gemma 1 release; pre-LN only RMSNorm with no post-norm wrapping per block; https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf

2024
[44]

G. Team. Gemma 3 technical report. Technical report, Google DeepMind, 2025. URL https://arxiv.org/abs/2503.19786. Retains the sandwich-norm pattern of Gemma 2 and replaces logit soft-capping with QK-norm

Pith/arXiv arXiv 2025
[45]

G. Team. Gemma 4 technical report. Technical report, Google DeepMind, 2026 a . URL https://ai.google.dev/gemma/docs/core. Including E-variants with per-layer-embedding factorisation for on-device deployment

2026
[46]

Q. Team. Qwen3.5 technical report. Technical report, Alibaba, 2026 b . Qwen3.5 series model cards and architecture details on HuggingFace Hub; https://huggingface.co/Qwen

2026
[47]

Q. Team. Qwen3.6 technical report. Technical report, Alibaba, 2026 c . Qwen3.6 series (dense + sparse MoE) model cards and architecture details on HuggingFace Hub; https://huggingface.co/Qwen

2026
[48]

Tschannen, A

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. H \'e naff, J. Harmsen, A. Steiner, and X. Zhai. SigLIP 2 : Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[49]

G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, and D. Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient, 2024. URL https://arxiv.org/abs/2410.02984

arXiv 2024
[50]

J. Wang, X. Ge, W. Shu, Z. He, and X. Qiu. Dimensional collapse in transformer attention outputs: A challenge for sparse dictionary learning. arXiv preprint arXiv:2508.16929, 2025. URL https://arxiv.org/abs/2508.16929

arXiv 2025
[51]

Cambridge Monographs on Applied and Computational Mathematics, vol

S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009. URL https://doi.org/10.1017/CBO9780511800474

work page doi:10.1017/cbo9780511800474 2009
[52]

Watanabe

S. Watanabe. Mathematical Theory of B ayesian Statistics . CRC Press, 2018. URL https://www.routledge.com/9781482238068

arXiv 2018
[53]

Y. Xu. Spectral edge dynamics of training trajectories: Signal--noise geometry across scales. arXiv:2603.15678, 2026

arXiv 2026
[54]

Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney. PyHessian : Neural networks through the lens of the H essian. In IEEE BigData, 2020

2020

[1] [1]

S.-i. Amari. Information Geometry and Its Applications, volume 194 of Applied Mathematical Sciences. Springer, 2016. URL https://link.springer.com/book/10.1007/978-4-431-55978-8

work page doi:10.1007/978-4-431-55978-8 2016

[2] [2]

Amari, H

S.-i. Amari, H. Park, and T. Ozeki. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18 0 (5): 0 1007--1065, 2006. URL https://doi.org/10.1162/neco.2006.18.5.1007

work page doi:10.1162/neco.2006.18.5.1007 2006

[3] [3]

Ashkboos, M

S. Ashkboos, M. L. Croci, M. G. do Nascimento, T. Hoefler, and J. Hensman. SliceGPT : Compress large language models by deleting rows and columns. In International Conference on Learning Representations (ICLR), 2024

2024

[4] [4]

Assran, A

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, A. Kamoun, X. Chen, K. Sinha, Y. LeCun, M. Rabbat, and N. Ballas. V-JEPA 2 : Self-supervised video models enable understanding, prediction, and planning. arXiv preprint arXiv:2506.09985, 2025

Pith/arXiv arXiv 2025

[5] [5]

Biderman, H

S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O'Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling. ICML, 2023

2023

[6] [6]

Boix-Adsera, E

E. Boix-Adsera, E. Littwin, E. Abbe, S. Bengio, and J. Susskind. Transformers learn through gradual rank increase. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2306.07042

arXiv 2023

[7] [7]

J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations (ICLR), 2021. arXiv:2103.00065; documents the edge-of-stability regime where the top Hessian eigenvalue stabilises near 2/ . Standard sharpness-side phase-transi...

arXiv 2021

[8] [8]

Dong, J.-B

Y. Dong, J.-B. Cordonnier, and A. Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), 2021. URL https://arxiv.org/abs/2103.03404

arXiv 2021

[9] [9]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021

2021

[10] [10]

Eschenhagen, A

R. Eschenhagen, A. Immer, R. E. Turner, F. Schneider, and P. Hennig. K ronecker-factored approximate curvature for modern neural network architectures. In NeurIPS, 2023

2023

[11] [11]

Ettori, N

D. Ettori, N. Darabi, S. Tayebati, R. Krishnan, M. Subedar, O. Tickoo, and A. R. Trivedi. EigenTrack : Spectral activation feature tracking for hallucination and out-of-distribution detection in LLM s and VLM s. arXiv:2509.15735, 2025

arXiv 2025

[12] [12]

George, C

T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent. Fast approximate natural gradient descent in a K ronecker-factored eigenbasis. In NeurIPS, 2018

2018

[13] [13]

Ghorbani, S

B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via H essian eigenvalue density. In ICML, 2019

2019

[14] [14]

Grosse and J

R. Grosse and J. Martens. A K ronecker-factored approximate F isher matrix for convolution layers. In ICML, 2016. URL https://arxiv.org/abs/1602.01407

Pith/arXiv arXiv 2016

[15] [15]

K. He, X. Chen, S. Xie, Y. Li, P. Doll \'a r, and R. Girshick. Masked autoencoders are scalable vision learners. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pages 16000--16009, 2022

2022

[16] [16]

Hironaka

H. Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79 0 (1): 0 109--326, 1964. URL https://www.jstor.org/stable/1970486

arXiv 1964

[17] [17]

Hoogland, G

J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2402.02364

arXiv 2024

[18] [18]

N. K. Jha and B. Reagen. NerVE : Nonlinear eigenspectrum dynamics in LLM feed-forward networks. arXiv:2603.06922, 2026

arXiv 2026

[19] [19]

Karakida, S

R. Karakida, S. Akaho, and S.-i. Amari. Universal statistics of F isher information in deep neural networks: Mean field approach. In AISTATS, 2019

2019

[20] [20]

Karakida, S

R. Karakida, S. Akaho, and S.-i. Amari. Pathological spectra of the F isher information metric and its variants in deep neural networks. Neural Computation, 33 0 (8): 0 2274--2307, 2021

2021

[21] [21]

T. X. Khanh, T. Q. Hoa, L. D. Trung, and P. T. Duc. Spectral entropy collapse as an empirical signature of delayed generalisation in grokking. arXiv:2604.13123, 2026

Pith/arXiv arXiv 2026

[22] [22]

J. Kim, B. Lee, C. Park, Y. Oh, B. Kim, T. Yoo, S. Shin, D. Han, J. Shin, and K. M. Yoo. Peri-LN : Revisiting normalization layer in the transformer architecture. arXiv preprint, 2025. URL https://arxiv.org/abs/2502.02732. Names the pre-norm + post-norm pattern ``Peri-LN'' and analyses its effect on activation magnitudes (linear vs exponential growth) and...

arXiv 2025

[23] [23]

Kunstner, L

F. Kunstner, L. Balles, and P. Hennig. Limitations of the empirical F isher approximation for natural gradient descent. In NeurIPS, 2019. URL https://arxiv.org/abs/1905.12558

arXiv 2019

[24] [24]

E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025. URL https://proceedings.mlr.press/v258/lau25a.html

2025

[25] [25]

Martens and R

J. Martens and R. Grosse. Optimizing neural networks with Kronecker -factored approximate curvature. In ICML, 2015. URL https://arxiv.org/abs/1503.05671

arXiv 2015

[26] [26]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023. URL https://arxiv.org/abs/2301.05217

Pith/arXiv arXiv 2023

[27] [27]

L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2206.03126

arXiv 2022

[28] [28]

Olsson, N

C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. In-context learning and induction heads. Transformer Cir...

2022

[29] [29]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. DINOv2 : Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

Pith/arXiv arXiv 2023

[30] [30]

V. Papyan. Traces of class/cross-class structure pervade deep learning spectra. JMLR, 21 0 (252): 0 1--64, 2020. URL https://jmlr.org/papers/volume21/20-933/20-933.pdf

2020

[31] [31]

URL https://www.pnas.org/doi/abs/10.1073/pnas

V. Papyan, X. Y. Han, and D. L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117 0 (40): 0 24652--24663, 2020. URL https://doi.org/10.1073/pnas.2015509117

work page doi:10.1073/pnas.2015509117 2020

[32] [32]

Pennington and P

J. Pennington and P. Worah. The spectrum of the F isher information matrix of a single-hidden-layer neural network. In NeurIPS, 2018

2018

[33] [33]

S. Plummer. Singular fluctuation as specific heat in B ayesian learning. arXiv:2512.21411, 2025

Pith/arXiv arXiv 2025

[34] [34]

P. M. Riechers. Geometry and dynamics of LayerNorm . arXiv preprint arXiv:2405.04134, 2024. URL https://arxiv.org/abs/2405.04134

arXiv 2024

[35] [35]

Roy and M

O. Roy and M. Vetterli. The effective rank: A measure of effective dimensionality. 15th European Signal Processing Conference (EUSIPCO), pages 606--610, 2007

2007

[36] [36]

Sagun, U

L. Sagun, U. Evci, V. U. G \"u ney, Y. Dauphin, and L. Bottou. Empirical analysis of the H essian of over-parametrized neural networks. In ICLR Workshop, 2018. arXiv:1706.04454

Pith/arXiv arXiv 2018

[37] [37]

Sanyal, P

A. Sanyal, P. H. S. Torr, and P. K. Dokania. Stable rank normalization for improved generalization in neural networks and GAN s. In ICLR, 2020

2020

[38] [38]

T. P. Shirodkar. Dead directions: Geometric singular learning, 2026. URL https://arxiv.org/abs/2606.05957

Pith/arXiv arXiv 2026

[39] [39]

Sim \'e oni, H

O. Sim \'e oni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. DINOv3 . arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025

[40] [40]

Staats, M

M. Staats, M. Thamm, and B. Rosenow. Small singular values matter: A random matrix analysis of transformer models. arXiv preprint arXiv:2410.17770, 2024. URL https://arxiv.org/abs/2410.17770

arXiv 2024

[41] [41]

M. Sun, X. Chen, J. Z. Kolter, and Z. Liu. Massive activations in large language models. In COLM, 2024. URL https://arxiv.org/abs/2402.17762

Pith/arXiv arXiv 2024

[42] [42]

G. Team. Gemma 2: Improving open language models at a practical size. Technical report, Google DeepMind, 2024 a . URL https://arxiv.org/abs/2408.00118. Documents the Peri-LN sandwich-norm pattern (pre-norm + post-norm RMSNorm) per-block in Gemma 2's Table 1

Pith/arXiv arXiv 2024

[43] [43]

G. Team. Gemma: Open models based on gemini research and technology. Technical report, Google DeepMind, 2024 b . Original Gemma 1 release; pre-LN only RMSNorm with no post-norm wrapping per block; https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf

2024

[44] [44]

G. Team. Gemma 3 technical report. Technical report, Google DeepMind, 2025. URL https://arxiv.org/abs/2503.19786. Retains the sandwich-norm pattern of Gemma 2 and replaces logit soft-capping with QK-norm

Pith/arXiv arXiv 2025

[45] [45]

G. Team. Gemma 4 technical report. Technical report, Google DeepMind, 2026 a . URL https://ai.google.dev/gemma/docs/core. Including E-variants with per-layer-embedding factorisation for on-device deployment

2026

[46] [46]

Q. Team. Qwen3.5 technical report. Technical report, Alibaba, 2026 b . Qwen3.5 series model cards and architecture details on HuggingFace Hub; https://huggingface.co/Qwen

2026

[47] [47]

Q. Team. Qwen3.6 technical report. Technical report, Alibaba, 2026 c . Qwen3.6 series (dense + sparse MoE) model cards and architecture details on HuggingFace Hub; https://huggingface.co/Qwen

2026

[48] [48]

Tschannen, A

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. H \'e naff, J. Harmsen, A. Steiner, and X. Zhai. SigLIP 2 : Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[49] [49]

G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, and D. Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient, 2024. URL https://arxiv.org/abs/2410.02984

arXiv 2024

[50] [50]

J. Wang, X. Ge, W. Shu, Z. He, and X. Qiu. Dimensional collapse in transformer attention outputs: A challenge for sparse dictionary learning. arXiv preprint arXiv:2508.16929, 2025. URL https://arxiv.org/abs/2508.16929

arXiv 2025

[51] [51]

Cambridge Monographs on Applied and Computational Mathematics, vol

S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009. URL https://doi.org/10.1017/CBO9780511800474

work page doi:10.1017/cbo9780511800474 2009

[52] [52]

Watanabe

S. Watanabe. Mathematical Theory of B ayesian Statistics . CRC Press, 2018. URL https://www.routledge.com/9781482238068

arXiv 2018

[53] [53]

Y. Xu. Spectral edge dynamics of training trajectories: Signal--noise geometry across scales. arXiv:2603.15678, 2026

arXiv 2026

[54] [54]

Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney. PyHessian : Neural networks through the lens of the H essian. In IEEE BigData, 2020

2020