pith. sign in

arxiv: 2607.00603 · v1 · pith:L3IQIX6Knew · submitted 2026-07-01 · 💻 cs.LG

Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

Pith reviewed 2026-07-02 15:53 UTC · model grok-4.3

classification 💻 cs.LG
keywords dead directionsdirectional-Fisher ratesingularity orderlearning coefficientsingular learning theoryneural network structureWatanabe triple
0
0 comments X

The pith

The directional-Fisher rate recovers the order k of each dead direction at a single frozen checkpoint in any basis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method to measure and decompose singular structure in trained neural networks without continuing descent or forcing canonical alignment. It identifies the directional-Fisher rate as the invariant quantity that directly yields the singularity order k for each dead direction, from which the local learning coefficient follows as 1/(2k). The same quantity classifies directions into architecture-fixed singularities versus flat gauge symmetries and assembles the per-direction orders into the global learning coefficient through typed intersections of loci. The approach applies via a pluggable detector to transformers, convolutions, and normalisation layers, recovering predicted orders in both constructed cells and actual trained models. If the central claim holds, singularity analysis becomes a deterministic post-training reading rather than a preconditioned optimisation task.

Core claim

At a single frozen checkpoint the read recovers the order k of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient 1/(2k) follows exactly, in whatever basis the optimizer left. The same read classifies each direction, separating a genuine singularity whose order the architecture fixes from a flat gauge symmetry; the directional-Fisher magnitude settles the cases the order cannot. Where the singular structure enumerates, the per-direction orders assemble through the typed intersection of the loci into the global coefficient matching the closed form.

What carries the argument

The directional-Fisher rate as the master invariant that determines singularity order k and the associated learning coefficient independently of basis choice.

Load-bearing premise

The directional-Fisher rate functions as an invariant master quantity that determines singularity order independently of basis choice and without requiring descent or canonical alignment.

What would settle it

Measuring inconsistent recovered orders k for the same dead direction when the identical checkpoint is expressed in two different bases would show the rate is not basis-independent.

Figures

Figures reproduced from arXiv: 2607.00603 by Tejas Pradeep Shirodkar.

Figure 1
Figure 1. Figure 1: Reading the order off canonical alignment. (a) A trained network leaves a dead [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The read in action, on a log-log directional Fisher against [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The recovered order tracks the structure present. The predicted order is the [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The orthogonaliser sets the basis. Dead-subspace axis-alignment by interior block [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The order fixes 𝜈(𝑘), the live structure absorbs it. Left: on the isolated order-𝑘 cell 𝑦 = 𝑎 𝑠𝑘 + 𝜀 the functional-variance 𝜈ˆ recovers the universality value 𝜈(𝑘) across training size 𝑛, data-averaged over 300 draws (𝜈(2)=0.173, 𝜈(3)=0.278). Right: in the controlled cell 𝑦 = 𝑏 𝑔 + 𝑠 𝑘 𝑐 + 𝜀, the dead direction’s contribution to 𝜈 holds at 𝜈(3) while the regular and dead bases stay distinct and collapses … view at source ↗
Figure 6
Figure 6. Figure 6: The one-pass census separates the optimizer families. The census learning-coefficient [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dead subspace by depth (gauge-fixed d=8). The axis-aligned activation-side [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The gauge floors on the fine-tuned DINOv2 ViT-S/14. The LayerNorm kernel and [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The dead subspace emerges through grokking (gauge-fixed d=8). The dead [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The multi-component read at block 1 (d8 squared-ReLU, seed 42). Each panel scans the top eight directions of the dual-factor bottom subspace; a filled marker is an asymptotic component (the order 𝑘ˆ it reads), a cross a pre-asymptotic one (no order). Vanilla Muon forms no readable order; the scaled-polar orthogonaliser and the gauge both read the squared-ReLU order 𝑘=3 on nearly every component. The reada… view at source ↗
Figure 11
Figure 11. Figure 11: The dead-subspace axis-alignment at task size (1-block d128, 5 seeds). At matched [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The optimiser-axis cell. Left: on the deterministic squared-ReLU teacher–student [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The learning coefficient across granularities (grokked d8 squared-ReLU). The [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Assembling the global learning coefficient against the closed form. The per [PITH_FULL_IMAGE:figures/full_fig_p045_14.png] view at source ↗
read the original abstract

We give a descent-free, alignment-free measurement of singular structure on trained networks. At a single frozen checkpoint the read recovers the order $k$ of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient $1/(2k)$ follows exactly, in whatever basis the optimizer left. The same read classifies each direction, separating a genuine singularity, whose order the architecture fixes, from a flat gauge symmetry; the directional-Fisher magnitude settles the cases the order cannot. A pluggable detector supplies the directions for transformer, convolutional, and normalisation layers. The read recovers the architecture-predicted order across constructed cells and trained networks, including a fine-tuned vision transformer whose dead structure is the LayerNorm-kernel gauge and a from-scratch one whose compressed MLP forms a node-death at its activation order. Where the singular structure enumerates, the per-direction orders assemble, through the typed intersection of the loci, into the global coefficient $(\lambda, m)$ matching the closed form. The method removes the canonical-alignment and descent preconditions of the underlying rate result, turning order-recovery into a deterministic, architecture-general reading. We then map its reach into the Watanabe triple: the order determines the universal singular fluctuation $\nu(k)$, though a trained network's realized $\nu$ falls below it as the live structure absorbs the dead direction's data fluctuation, and the multiplicity recovers from the dominant structure under a single-locus assumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a descent-free, alignment-free method to measure singular structure in trained neural networks at a single frozen checkpoint. It claims that the directional-Fisher rate serves as a master invariant from which the order k of each dead direction (and thus the per-direction learning coefficient 1/(2k)) can be recovered exactly in the optimizer's native basis. A pluggable detector isolates directions for transformer, convolutional, and normalization layers; directions are classified as genuine singularities (order fixed by architecture) versus flat gauge symmetries (settled by rate magnitude). Per-direction orders are assembled via typed intersection of loci into global coefficients (λ, m) matching closed forms. The method is demonstrated on constructed cells and trained vision transformers (LayerNorm-kernel gauge in a fine-tuned model; node-death at activation order in a from-scratch model) and is mapped to the Watanabe triple, determining universal singular fluctuation ν(k) while noting that realized ν falls below the universal value due to live structure.

Significance. If the directional-Fisher rate is shown to be a true basis-independent invariant and the detector recovers exact loci, the work would remove the canonical-alignment and descent preconditions of prior rate results, enabling direct, architecture-general reading of singular structure from arbitrary trained checkpoints. The explicit recovery of architecture-predicted orders, the classification of gauge versus singularity, and the assembly into global (λ, m) would be concrete advances for applying singular learning theory to practical networks; the link to ν(k) and multiplicity under a single-locus assumption is also a useful extension.

major comments (3)
  1. [Abstract] Abstract: the central claim that the directional-Fisher rate is a basis-independent master invariant yielding exact order k (hence 1/(2k)) for any basis the optimizer produced is asserted without derivation; no argument is supplied showing why the rate computation remains invariant under arbitrary orthogonal transformations of the weight space.
  2. [Abstract] Abstract: the pluggable detector is stated to isolate the precise dead direction (singular vector or gauge direction) for every layer type, but no formal guarantee, algorithm, or verification is given that the detector recovers the exact locus for transformers, convolutions, and LayerNorm; this is load-bearing for the order-recovery claim across architectures.
  3. [Abstract] Abstract: the assembly of per-direction orders into global (λ, m) via typed intersection of loci is said to match the closed form, yet the text supplies neither the underlying rate result equations nor a demonstration that the directional-Fisher read is independent rather than circular with respect to prior fitted quantities.
minor comments (1)
  1. The abstract references an 'underlying rate result' without a brief recap or citation, which would help readers connect the new claims to the prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below. The comments correctly identify that several central claims in the abstract are presented without accompanying derivations or formal details; we agree that the abstract requires revision to better support these claims while the main text provides the supporting material.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the directional-Fisher rate is a basis-independent master invariant yielding exact order k (hence 1/(2k)) for any basis the optimizer produced is asserted without derivation; no argument is supplied showing why the rate computation remains invariant under arbitrary orthogonal transformations of the weight space.

    Authors: We agree that the abstract asserts basis-independence of the directional-Fisher rate without a derivation. The main manuscript defines the rate in a manner that makes the invariance under orthogonal transformations immediate from its directional construction, but this is not referenced in the abstract. We will revise the abstract to include a concise pointer to the relevant definition and derivation in the main text. revision: yes

  2. Referee: [Abstract] Abstract: the pluggable detector is stated to isolate the precise dead direction (singular vector or gauge direction) for every layer type, but no formal guarantee, algorithm, or verification is given that the detector recovers the exact locus for transformers, convolutions, and LayerNorm; this is load-bearing for the order-recovery claim across architectures.

    Authors: We agree that the abstract states the detector isolates the precise locus without supplying a formal guarantee, algorithm, or explicit verification. The manuscript provides algorithmic descriptions and experimental verification on constructed cells and trained networks for the listed layer types, but these are not referenced in the abstract. We will revise the abstract to note the existence of the algorithmic construction and verification in the methods and experiments sections. revision: yes

  3. Referee: [Abstract] Abstract: the assembly of per-direction orders into global (λ, m) via typed intersection of loci is said to match the closed form, yet the text supplies neither the underlying rate result equations nor a demonstration that the directional-Fisher read is independent rather than circular with respect to prior fitted quantities.

    Authors: We agree that the abstract claims the match to closed forms without including the underlying rate result equations or an explicit non-circularity argument. The main text assembles the coefficients from the per-direction orders and demonstrates the match on the tested cases using the directional-Fisher reads directly. We will revise the abstract to reference the rate result equations and clarify the independence from fitted quantities. revision: yes

Circularity Check

2 steps flagged

Directional-Fisher rate defined as master invariant makes k-recovery and 1/(2k) tautological; global (λ,m) match to closed form follows by construction from same rate result

specific steps
  1. self definitional [Abstract]
    "At a single frozen checkpoint the read recovers the order $k$ of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient $1/(2k)$ follows exactly, in whatever basis the optimizer left."

    The directional-Fisher rate is introduced as the master invariant that directly produces the coefficient exactly as 1/(2k); recovering k from the rate and obtaining 1/(2k) is therefore equivalent to the definition by construction, with the 'recovery' adding no new content.

  2. self citation load bearing [Abstract]
    "The method removes the canonical-alignment and descent preconditions of the underlying rate result, turning order-recovery into a deterministic, architecture-general reading. ... the per-direction orders assemble, through the typed intersection of the loci, into the global coefficient $(\lambda, m)$ matching the closed form."

    The order-recovery claim and the subsequent assembly into global $(\lambda, m)$ that matches the closed form both rest on the underlying rate result (prior work by the same author) whose preconditions are removed by assertion; no independent derivation of the rate's basis invariance is supplied, so the match is forced by the shared premise.

full rationale

The paper's central derivation chain begins from the directional-Fisher rate as an assumed master invariant that yields the learning coefficient exactly as 1/(2k). Recovering k from this rate and then assembling per-direction orders into the global coefficient that matches the closed form therefore reduces directly to the input definition and the prior rate result, without an independent derivation of basis invariance or detector correctness shown. This matches the self-definitional and self-citation-load-bearing patterns; the remainder of the architecture-general reading and Watanabe mapping inherits the same reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are identifiable from the provided text. The method appears to rest on the directional-Fisher rate as an invariant and on prior closed-form results for the global coefficient.

pith-pipeline@v0.9.1-grok · 5792 in / 1170 out tokens · 26592 ms · 2026-07-02T15:53:37.337533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Miki Aoyagi

    URLhttps: //doi.org/10.1162/neco.2006.18.5.1007. Miki Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units.Neural Networks, 172:106132,

  2. [2]

    Miki Aoyagi and Sumio Watanabe

    URLhttps://doi.org/10.1016/ j.neunet.2024.106132. Miki Aoyagi and Sumio Watanabe. Stochastic complexities of reduced rank regression in Bayesian estimation.Neural Networks, 18(7):924–933,

  3. [3]

    1016/j.neunet.2005.03.014

    URLhttps://doi.org/10. 1016/j.neunet.2005.03.014. Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, and Marius Hobbhahn. Using degeneracy in the loss landscape for mechanistic interpretability,

  4. [4]

    LiamCarroll

    URLhttps://arxiv.org/abs/2405.10927. LiamCarroll. Phasetransitionsin neuralnetworks. Master’sthesis, School ofMathematicsand Statistics, The University of Melbourne,

  5. [5]

    Roger Grosse and James Martens

    URLhttps://arxiv.org/abs/2305.05089. Roger Grosse and James Martens. A Kronecker-factored approximate Fisher matrix for convolution layers. InICML,

  6. [6]

    A Kronecker-factored approximate Fisher matrix for convolution layers

    URLhttps://arxiv.org/abs/1602.01407. Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of character- istic zero.Annals of Mathematics, 79(1):109–326,

  7. [7]

    Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet

    URLhttps://www.jstor.org/ stable/1970486. Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research,

  8. [8]

    Adam: A Method for Stochastic Optimization

    URLhttps://arxiv.org/ abs/1412.6980. Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The local learning coefficient: A singularity-aware complexity measure. InAISTATS,

  9. [9]

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

    URLhttps://arxiv.org/abs/1503.05671. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InICLR,

  10. [10]

    Progress measures for grokking via mechanistic interpretability

    URLhttps://arxiv. org/abs/2301.05217. Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  11. [11]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv:2201.02177,

  12. [12]

    Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

    Tejas Pradeep Shirodkar. Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks, 2026a. URLhttps://arxiv.org/abs/2606.29176. Tejas Pradeep Shirodkar. Dead directions: Geometric singular learning, 2026b. URLhttps: //arxiv.org/abs/2606.05957. Tejas Pradeep Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transf...

  13. [13]

    org/abs/2410.02984

    URLhttps://arxiv. org/abs/2410.02984. Spotlight. Sumio Watanabe.Algebraic Geometry and Statistical Learning Theory. Cambridge University Press,

  14. [14]

    Sumio Watanabe.Mathematical Theory of Bayesian Statistics

    URLhttps://doi.org/10.1017/CBO9780511800474. Sumio Watanabe.Mathematical Theory of Bayesian Statistics. CRC Press,

  15. [15]

    deep floor

    URLhttps: //www.routledge.com/9781482238068. Measuring Dead Directions off Canonical Alignment27 Appendices We give the experiments in full here, from the network setups through to the global-coefficient assembly and the singular-fluctuation cells. Appendix contents A The read: setup and pipeline 27 A.1 Experimental setup . . . . . . . . . . . . . . . . ....

  16. [16]

    subspace

    4e-05 1e-04 2e-02 1e-06 5e-05 All three architectural gauges sit at a deep Fisher floor Figure 8: The gauge floors on the fine-tuned DINOv2 ViT-S/14. The LayerNorm kernel and the attention query–key rotation read a directional Fisher orders of magnitude below a live direction (ratio1); the deep floor is the flat signature the read classifies them by. at f...

  17. [17]

    hold a small architecture fixed and move only the optimiser, isolating the optimiser’s effect from the depth the deep-transformer reads confound it with. The cell is a two-layer squared-ReLU teacher–student MLP (input Measuring Dead Directions off Canonical Alignment37 0 2 4 6 subspace component 0.5 1.0 1.5 2.0 2.5 3.0 3.5 recovered order k k = 3 vanilla ...

  18. [18]

    stalls at the simplex bound, so the exact value needs the structured rank-locus resolution (Section 9). Typing the intersection.The coupling grouping generalises to a per-cell read of the inter- section type, which routes the assembly to its matched rule: a transversal crossing of regular loci to the minimum, a separable sum to the sum, a determinantal lo...