Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

Tejas Pradeep Shirodkar

arxiv: 2607.00603 · v1 · pith:L3IQIX6Knew · submitted 2026-07-01 · 💻 cs.LG

Measuring Dead Directions: Decomposing and Classifying Singular Structure off Canonical Alignment

Tejas Pradeep Shirodkar This is my paper

Pith reviewed 2026-07-02 15:53 UTC · model grok-4.3

classification 💻 cs.LG

keywords dead directionsdirectional-Fisher ratesingularity orderlearning coefficientsingular learning theoryneural network structureWatanabe triple

0 comments

The pith

The directional-Fisher rate recovers the order k of each dead direction at a single frozen checkpoint in any basis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a method to measure and decompose singular structure in trained neural networks without continuing descent or forcing canonical alignment. It identifies the directional-Fisher rate as the invariant quantity that directly yields the singularity order k for each dead direction, from which the local learning coefficient follows as 1/(2k). The same quantity classifies directions into architecture-fixed singularities versus flat gauge symmetries and assembles the per-direction orders into the global learning coefficient through typed intersections of loci. The approach applies via a pluggable detector to transformers, convolutions, and normalisation layers, recovering predicted orders in both constructed cells and actual trained models. If the central claim holds, singularity analysis becomes a deterministic post-training reading rather than a preconditioned optimisation task.

Core claim

At a single frozen checkpoint the read recovers the order k of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient 1/(2k) follows exactly, in whatever basis the optimizer left. The same read classifies each direction, separating a genuine singularity whose order the architecture fixes from a flat gauge symmetry; the directional-Fisher magnitude settles the cases the order cannot. Where the singular structure enumerates, the per-direction orders assemble through the typed intersection of the loci into the global coefficient matching the closed form.

What carries the argument

The directional-Fisher rate as the master invariant that determines singularity order k and the associated learning coefficient independently of basis choice.

Load-bearing premise

The directional-Fisher rate functions as an invariant master quantity that determines singularity order independently of basis choice and without requiring descent or canonical alignment.

What would settle it

Measuring inconsistent recovered orders k for the same dead direction when the identical checkpoint is expressed in two different bases would show the rate is not basis-independent.

Figures

Figures reproduced from arXiv: 2607.00603 by Tejas Pradeep Shirodkar.

**Figure 2.** Figure 2: The read in action, on a log-log directional Fisher against [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The recovered order tracks the structure present. The predicted order is the [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: The orthogonaliser sets the basis. Dead-subspace axis-alignment by interior block [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: The order fixes 𝜈(𝑘), the live structure absorbs it. Left: on the isolated order-𝑘 cell 𝑦 = 𝑎 𝑠𝑘 + 𝜀 the functional-variance 𝜈ˆ recovers the universality value 𝜈(𝑘) across training size 𝑛, data-averaged over 300 draws (𝜈(2)=0.173, 𝜈(3)=0.278). Right: in the controlled cell 𝑦 = 𝑏 𝑔 + 𝑠 𝑘 𝑐 + 𝜀, the dead direction’s contribution to 𝜈 holds at 𝜈(3) while the regular and dead bases stay distinct and collapses … view at source ↗

**Figure 6.** Figure 6: The one-pass census separates the optimizer families. The census learning-coefficient [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Dead subspace by depth (gauge-fixed d=8). The axis-aligned activation-side [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗

**Figure 8.** Figure 8: The gauge floors on the fine-tuned DINOv2 ViT-S/14. The LayerNorm kernel and [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗

**Figure 9.** Figure 9: The dead subspace emerges through grokking (gauge-fixed d=8). The dead [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

**Figure 10.** Figure 10: The multi-component read at block 1 (d8 squared-ReLU, seed 42). Each panel scans the top eight directions of the dual-factor bottom subspace; a filled marker is an asymptotic component (the order 𝑘ˆ it reads), a cross a pre-asymptotic one (no order). Vanilla Muon forms no readable order; the scaled-polar orthogonaliser and the gauge both read the squared-ReLU order 𝑘=3 on nearly every component. The reada… view at source ↗

**Figure 11.** Figure 11: The dead-subspace axis-alignment at task size (1-block d128, 5 seeds). At matched [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗

**Figure 12.** Figure 12: The optimiser-axis cell. Left: on the deterministic squared-ReLU teacher–student [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗

**Figure 13.** Figure 13: The learning coefficient across granularities (grokked d8 squared-ReLU). The [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗

**Figure 14.** Figure 14: Assembling the global learning coefficient against the closed form. The per [PITH_FULL_IMAGE:figures/full_fig_p045_14.png] view at source ↗

read the original abstract

We give a descent-free, alignment-free measurement of singular structure on trained networks. At a single frozen checkpoint the read recovers the order $k$ of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient $1/(2k)$ follows exactly, in whatever basis the optimizer left. The same read classifies each direction, separating a genuine singularity, whose order the architecture fixes, from a flat gauge symmetry; the directional-Fisher magnitude settles the cases the order cannot. A pluggable detector supplies the directions for transformer, convolutional, and normalisation layers. The read recovers the architecture-predicted order across constructed cells and trained networks, including a fine-tuned vision transformer whose dead structure is the LayerNorm-kernel gauge and a from-scratch one whose compressed MLP forms a node-death at its activation order. Where the singular structure enumerates, the per-direction orders assemble, through the typed intersection of the loci, into the global coefficient $(\lambda, m)$ matching the closed form. The method removes the canonical-alignment and descent preconditions of the underlying rate result, turning order-recovery into a deterministic, architecture-general reading. We then map its reach into the Watanabe triple: the order determines the universal singular fluctuation $\nu(k)$, though a trained network's realized $\nu$ falls below it as the live structure absorbs the dead direction's data fluctuation, and the multiplicity recovers from the dominant structure under a single-locus assumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a checkpoint-only way to recover singularity orders via the directional-Fisher rate without descent or alignment steps.

read the letter

The main thing here is a measurement procedure that pulls the order k of dead directions straight from the directional-Fisher rate at one frozen checkpoint, then turns that into the per-direction coefficient 1/(2k) and classifies real singularities versus gauge symmetries. It also assembles the per-direction orders into the global (λ, m) when the structure is fully enumerated.

What is actually new is dropping the canonical-alignment and descent preconditions from the underlying rate result and making the read deterministic and architecture-general. The pluggable detector handles transformer, convolutional, and normalisation layers, and the examples on constructed cells plus two vision transformers (one fine-tuned with LayerNorm gauge, one from-scratch with node death) show the recovered orders matching architecture predictions.

The soft spots sit where the stress-test flagged them. The directional-Fisher rate is treated as a basis-independent master invariant that works in whatever basis the optimizer left, and the detector is assumed to isolate the exact locus for every layer type. The abstract asserts both but supplies no derivation showing invariance under arbitrary orthogonal transformations of the weight space, nor formal checks that the detector recovers the precise singular vector or gauge direction across architectures. If either assumption slips, the recovered k deviates and the global assembly fails. The mapping to the Watanabe triple is sketched but rests on the same unverified steps.

This is for people already working in singular learning theory applied to deep learning who need a practical tool to inspect trained models. A reader who wants to test order recovery on real checkpoints would find the procedure worth trying.

Send it to peer review. The new measurement removes stated preconditions and supplies concrete examples, so the claims deserve referee time even if the invariance and detector guarantees need tightening.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a descent-free, alignment-free method to measure singular structure in trained neural networks at a single frozen checkpoint. It claims that the directional-Fisher rate serves as a master invariant from which the order k of each dead direction (and thus the per-direction learning coefficient 1/(2k)) can be recovered exactly in the optimizer's native basis. A pluggable detector isolates directions for transformer, convolutional, and normalization layers; directions are classified as genuine singularities (order fixed by architecture) versus flat gauge symmetries (settled by rate magnitude). Per-direction orders are assembled via typed intersection of loci into global coefficients (λ, m) matching closed forms. The method is demonstrated on constructed cells and trained vision transformers (LayerNorm-kernel gauge in a fine-tuned model; node-death at activation order in a from-scratch model) and is mapped to the Watanabe triple, determining universal singular fluctuation ν(k) while noting that realized ν falls below the universal value due to live structure.

Significance. If the directional-Fisher rate is shown to be a true basis-independent invariant and the detector recovers exact loci, the work would remove the canonical-alignment and descent preconditions of prior rate results, enabling direct, architecture-general reading of singular structure from arbitrary trained checkpoints. The explicit recovery of architecture-predicted orders, the classification of gauge versus singularity, and the assembly into global (λ, m) would be concrete advances for applying singular learning theory to practical networks; the link to ν(k) and multiplicity under a single-locus assumption is also a useful extension.

major comments (3)

[Abstract] Abstract: the central claim that the directional-Fisher rate is a basis-independent master invariant yielding exact order k (hence 1/(2k)) for any basis the optimizer produced is asserted without derivation; no argument is supplied showing why the rate computation remains invariant under arbitrary orthogonal transformations of the weight space.
[Abstract] Abstract: the pluggable detector is stated to isolate the precise dead direction (singular vector or gauge direction) for every layer type, but no formal guarantee, algorithm, or verification is given that the detector recovers the exact locus for transformers, convolutions, and LayerNorm; this is load-bearing for the order-recovery claim across architectures.
[Abstract] Abstract: the assembly of per-direction orders into global (λ, m) via typed intersection of loci is said to match the closed form, yet the text supplies neither the underlying rate result equations nor a demonstration that the directional-Fisher read is independent rather than circular with respect to prior fitted quantities.

minor comments (1)

The abstract references an 'underlying rate result' without a brief recap or citation, which would help readers connect the new claims to the prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below. The comments correctly identify that several central claims in the abstract are presented without accompanying derivations or formal details; we agree that the abstract requires revision to better support these claims while the main text provides the supporting material.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the directional-Fisher rate is a basis-independent master invariant yielding exact order k (hence 1/(2k)) for any basis the optimizer produced is asserted without derivation; no argument is supplied showing why the rate computation remains invariant under arbitrary orthogonal transformations of the weight space.

Authors: We agree that the abstract asserts basis-independence of the directional-Fisher rate without a derivation. The main manuscript defines the rate in a manner that makes the invariance under orthogonal transformations immediate from its directional construction, but this is not referenced in the abstract. We will revise the abstract to include a concise pointer to the relevant definition and derivation in the main text. revision: yes
Referee: [Abstract] Abstract: the pluggable detector is stated to isolate the precise dead direction (singular vector or gauge direction) for every layer type, but no formal guarantee, algorithm, or verification is given that the detector recovers the exact locus for transformers, convolutions, and LayerNorm; this is load-bearing for the order-recovery claim across architectures.

Authors: We agree that the abstract states the detector isolates the precise locus without supplying a formal guarantee, algorithm, or explicit verification. The manuscript provides algorithmic descriptions and experimental verification on constructed cells and trained networks for the listed layer types, but these are not referenced in the abstract. We will revise the abstract to note the existence of the algorithmic construction and verification in the methods and experiments sections. revision: yes
Referee: [Abstract] Abstract: the assembly of per-direction orders into global (λ, m) via typed intersection of loci is said to match the closed form, yet the text supplies neither the underlying rate result equations nor a demonstration that the directional-Fisher read is independent rather than circular with respect to prior fitted quantities.

Authors: We agree that the abstract claims the match to closed forms without including the underlying rate result equations or an explicit non-circularity argument. The main text assembles the coefficients from the per-direction orders and demonstrates the match on the tested cases using the directional-Fisher reads directly. We will revise the abstract to reference the rate result equations and clarify the independence from fitted quantities. revision: yes

Circularity Check

2 steps flagged

Directional-Fisher rate defined as master invariant makes k-recovery and 1/(2k) tautological; global (λ,m) match to closed form follows by construction from same rate result

specific steps

self definitional [Abstract]
"At a single frozen checkpoint the read recovers the order $k$ of each dead direction from the directional-Fisher rate, the master invariant from which the per-direction learning coefficient $1/(2k)$ follows exactly, in whatever basis the optimizer left."

The directional-Fisher rate is introduced as the master invariant that directly produces the coefficient exactly as 1/(2k); recovering k from the rate and obtaining 1/(2k) is therefore equivalent to the definition by construction, with the 'recovery' adding no new content.
self citation load bearing [Abstract]
"The method removes the canonical-alignment and descent preconditions of the underlying rate result, turning order-recovery into a deterministic, architecture-general reading. ... the per-direction orders assemble, through the typed intersection of the loci, into the global coefficient $(\lambda, m)$ matching the closed form."

The order-recovery claim and the subsequent assembly into global $(\lambda, m)$ that matches the closed form both rest on the underlying rate result (prior work by the same author) whose preconditions are removed by assertion; no independent derivation of the rate's basis invariance is supplied, so the match is forced by the shared premise.

full rationale

The paper's central derivation chain begins from the directional-Fisher rate as an assumed master invariant that yields the learning coefficient exactly as 1/(2k). Recovering k from this rate and then assembling per-direction orders into the global coefficient that matches the closed form therefore reduces directly to the input definition and the prior rate result, without an independent derivation of basis invariance or detector correctness shown. This matches the self-definitional and self-citation-load-bearing patterns; the remainder of the architecture-general reading and Watanabe mapping inherits the same reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review limited to abstract; no explicit free parameters, axioms, or invented entities are identifiable from the provided text. The method appears to rest on the directional-Fisher rate as an invariant and on prior closed-form results for the global coefficient.

pith-pipeline@v0.9.1-grok · 5792 in / 1170 out tokens · 26592 ms · 2026-07-02T15:53:37.337533+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 14 canonical work pages · 5 internal anchors

[1]

Miki Aoyagi

URLhttps: //doi.org/10.1162/neco.2006.18.5.1007. Miki Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units.Neural Networks, 172:106132,

work page doi:10.1162/neco.2006.18.5.1007 2006
[2]

Miki Aoyagi and Sumio Watanabe

URLhttps://doi.org/10.1016/ j.neunet.2024.106132. Miki Aoyagi and Sumio Watanabe. Stochastic complexities of reduced rank regression in Bayesian estimation.Neural Networks, 18(7):924–933,

work page arXiv 2024
[3]

1016/j.neunet.2005.03.014

URLhttps://doi.org/10. 1016/j.neunet.2005.03.014. Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, and Marius Hobbhahn. Using degeneracy in the loss landscape for mechanistic interpretability,

2005
[4]

LiamCarroll

URLhttps://arxiv.org/abs/2405.10927. LiamCarroll. Phasetransitionsin neuralnetworks. Master’sthesis, School ofMathematicsand Statistics, The University of Melbourne,

work page arXiv
[5]

Roger Grosse and James Martens

URLhttps://arxiv.org/abs/2305.05089. Roger Grosse and James Martens. A Kronecker-factored approximate Fisher matrix for convolution layers. InICML,

work page arXiv
[6]

A Kronecker-factored approximate Fisher matrix for convolution layers

URLhttps://arxiv.org/abs/1602.01407. Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of character- istic zero.Annals of Mathematics, 79(1):109–326,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet

URLhttps://www.jstor.org/ stable/1970486. Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research,

work page arXiv
[8]

Adam: A Method for Stochastic Optimization

URLhttps://arxiv.org/ abs/1412.6980. Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The local learning coefficient: A singularity-aware complexity measure. InAISTATS,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

URLhttps://arxiv.org/abs/1503.05671. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InICLR,

work page arXiv
[10]

Progress measures for grokking via mechanistic interpretability

URLhttps://arxiv. org/abs/2301.05217. Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

Tejas Pradeep Shirodkar. Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks, 2026a. URLhttps://arxiv.org/abs/2606.29176. Tejas Pradeep Shirodkar. Dead directions: Geometric singular learning, 2026b. URLhttps: //arxiv.org/abs/2606.05957. Tejas Pradeep Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transf...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

org/abs/2410.02984

URLhttps://arxiv. org/abs/2410.02984. Spotlight. Sumio Watanabe.Algebraic Geometry and Statistical Learning Theory. Cambridge University Press,

work page arXiv
[14]

Sumio Watanabe.Mathematical Theory of Bayesian Statistics

URLhttps://doi.org/10.1017/CBO9780511800474. Sumio Watanabe.Mathematical Theory of Bayesian Statistics. CRC Press,

work page doi:10.1017/cbo9780511800474
[15]

deep floor

URLhttps: //www.routledge.com/9781482238068. Measuring Dead Directions off Canonical Alignment27 Appendices We give the experiments in full here, from the network setups through to the global-coefficient assembly and the singular-fluctuation cells. Appendix contents A The read: setup and pipeline 27 A.1 Experimental setup . . . . . . . . . . . . . . . . ....

work page arXiv
[16]

subspace

4e-05 1e-04 2e-02 1e-06 5e-05 All three architectural gauges sit at a deep Fisher floor Figure 8: The gauge floors on the fine-tuned DINOv2 ViT-S/14. The LayerNorm kernel and the attention query–key rotation read a directional Fisher orders of magnitude below a live direction (ratio1); the deep floor is the flat signature the read classifies them by. at f...

2000
[17]

hold a small architecture fixed and move only the optimiser, isolating the optimiser’s effect from the depth the deep-transformer reads confound it with. The cell is a two-layer squared-ReLU teacher–student MLP (input Measuring Dead Directions off Canonical Alignment37 0 2 4 6 subspace component 0.5 1.0 1.5 2.0 2.5 3.0 3.5 recovered order k k = 3 vanilla ...

2025
[18]

stalls at the simplex bound, so the exact value needs the structured rank-locus resolution (Section 9). Typing the intersection.The coupling grouping generalises to a per-cell read of the inter- section type, which routes the assembly to its matched rule: a transversal crossing of regular loci to the minimum, a separable sum to the sum, a determinantal lo...

2000

[1] [1]

Miki Aoyagi

URLhttps: //doi.org/10.1162/neco.2006.18.5.1007. Miki Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units.Neural Networks, 172:106132,

work page doi:10.1162/neco.2006.18.5.1007 2006

[2] [2]

Miki Aoyagi and Sumio Watanabe

URLhttps://doi.org/10.1016/ j.neunet.2024.106132. Miki Aoyagi and Sumio Watanabe. Stochastic complexities of reduced rank regression in Bayesian estimation.Neural Networks, 18(7):924–933,

work page arXiv 2024

[3] [3]

1016/j.neunet.2005.03.014

URLhttps://doi.org/10. 1016/j.neunet.2005.03.014. Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, and Marius Hobbhahn. Using degeneracy in the loss landscape for mechanistic interpretability,

2005

[4] [4]

LiamCarroll

URLhttps://arxiv.org/abs/2405.10927. LiamCarroll. Phasetransitionsin neuralnetworks. Master’sthesis, School ofMathematicsand Statistics, The University of Melbourne,

work page arXiv

[5] [5]

Roger Grosse and James Martens

URLhttps://arxiv.org/abs/2305.05089. Roger Grosse and James Martens. A Kronecker-factored approximate Fisher matrix for convolution layers. InICML,

work page arXiv

[6] [6]

A Kronecker-factored approximate Fisher matrix for convolution layers

URLhttps://arxiv.org/abs/1602.01407. Heisuke Hironaka. Resolution of singularities of an algebraic variety over a field of character- istic zero.Annals of Mathematics, 79(1):109–326,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet

URLhttps://www.jstor.org/ stable/1970486. Jesse Hoogland, George Wang, Matthew Farrugia-Roberts, Liam Carroll, Susan Wei, and Daniel Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research,

work page arXiv

[8] [8]

Adam: A Method for Stochastic Optimization

URLhttps://arxiv.org/ abs/1412.6980. Edmund Lau, Zach Furman, George Wang, Daniel Murfet, and Susan Wei. The local learning coefficient: A singularity-aware complexity measure. InAISTATS,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

URLhttps://arxiv.org/abs/1503.05671. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InICLR,

work page arXiv

[10] [10]

Progress measures for grokking via mechanistic interpretability

URLhttps://arxiv. org/abs/2301.05217. Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

Tejas Pradeep Shirodkar. Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks, 2026a. URLhttps://arxiv.org/abs/2606.29176. Tejas Pradeep Shirodkar. Dead directions: Geometric singular learning, 2026b. URLhttps: //arxiv.org/abs/2606.05957. Tejas Pradeep Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transf...

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

org/abs/2410.02984

URLhttps://arxiv. org/abs/2410.02984. Spotlight. Sumio Watanabe.Algebraic Geometry and Statistical Learning Theory. Cambridge University Press,

work page arXiv

[14] [14]

Sumio Watanabe.Mathematical Theory of Bayesian Statistics

URLhttps://doi.org/10.1017/CBO9780511800474. Sumio Watanabe.Mathematical Theory of Bayesian Statistics. CRC Press,

work page doi:10.1017/cbo9780511800474

[15] [15]

deep floor

URLhttps: //www.routledge.com/9781482238068. Measuring Dead Directions off Canonical Alignment27 Appendices We give the experiments in full here, from the network setups through to the global-coefficient assembly and the singular-fluctuation cells. Appendix contents A The read: setup and pipeline 27 A.1 Experimental setup . . . . . . . . . . . . . . . . ....

work page arXiv

[16] [16]

subspace

4e-05 1e-04 2e-02 1e-06 5e-05 All three architectural gauges sit at a deep Fisher floor Figure 8: The gauge floors on the fine-tuned DINOv2 ViT-S/14. The LayerNorm kernel and the attention query–key rotation read a directional Fisher orders of magnitude below a live direction (ratio1); the deep floor is the flat signature the read classifies them by. at f...

2000

[17] [17]

hold a small architecture fixed and move only the optimiser, isolating the optimiser’s effect from the depth the deep-transformer reads confound it with. The cell is a two-layer squared-ReLU teacher–student MLP (input Measuring Dead Directions off Canonical Alignment37 0 2 4 6 subspace component 0.5 1.0 1.5 2.0 2.5 3.0 3.5 recovered order k k = 3 vanilla ...

2025

[18] [18]

stalls at the simplex bound, so the exact value needs the structured rank-locus resolution (Section 9). Typing the intersection.The coupling grouping generalises to a per-cell read of the inter- section type, which routes the assembly to its matched rule: a transversal crossing of regular loci to the minimum, a separable sum to the sum, a determinantal lo...

2000