pith. sign in

arxiv: 2606.21158 · v1 · pith:4TF45KASnew · submitted 2026-06-19 · 💻 cs.LG · stat.ML

Dead-Direction Signatures: A Cheap Spectral Reading of Singular Complexity

Pith reviewed 2026-06-26 14:34 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords singular learning theorydead directionsspectral signatureslocal learning coefficientrank deficitFisher Gram matrixactivation spectrareduced-rank regression
0
0 comments X

The pith

Dead-Direction Signatures count dead directions at singular minima through the slope of active-volume log det+ on activation and Fisher Gram matrices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Dead-Direction Signatures deliver closed-form spectral readings of singular structure in deep networks. These readings rest on a predicted structural correlation between activation-side and Fisher-side spectra at any singular minimum. The active-volume log det+(G) slope tracks the rank deficit r, producing slope ratios near the predicted integers 2, 3, 4 for deficits of those sizes. In controlled reduced-rank regression the signatures recover the sign of the closed-form RLCT while, in a modular-addition transformer, they separate model dimension across many orders of magnitude where the standard local learning coefficient remains flat at the same sampling budget.

Core claim

At singular minima the active-volume log det+(G) slope counts the dead directions and tracks the rank-deficit r across r in {1,2,3,4}, with observed slope ratios 2.0, 3.1, 4.0 at r=2,3,4; on reduced-rank regression the DDS observables recover the sign of the closed-form lambda while on a non-linear modular-addition transformer they separate d_model across eighteen orders of magnitude where calibrated LLC is rank-flat.

What carries the argument

The dead-direction framework that predicts a structural correlation between activation- and Fisher-side spectra at singular minima, together with the rank-multiplicative volume identity realized by the active-volume log det+(G) slope.

If this is right

  • DDS observables track the sign of closed-form lambda on reduced-rank regression.
  • In a non-linear modular-addition transformer DDS separates d_model across eighteen orders of magnitude at budgets where LLC is rank-flat.
  • DDS supplies a directional, layer-local reading that complements the integrated posterior scalar given by LLC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Layer-local spectral monitors could be inserted into training loops to detect when a network enters a new singular regime.
  • The same Gram-matrix construction might be applied to other layer-wise statistics beyond activations and per-sample gradients.
  • If the correlation holds across architectures, DDS could serve as a lightweight diagnostic for comparing singularity structure between models of different widths.

Load-bearing premise

A structural correlation exists between activation-side and Fisher-side spectra at every singular minimum.

What would settle it

Measure the active-volume log det+(G) slopes on controlled reduced-rank regression tasks with known rank deficits r=2,3,4; the claim fails if the observed ratios deviate systematically from the predicted integers 2,3,4.

Figures

Figures reproduced from arXiv: 2606.21158 by P. J. Narayanan, Tejas Pradeep Shirodkar.

Figure 1
Figure 1. Figure 1: Dead-Direction Signatures (DDS): the dead-direction primitive and how DDS reads it. (a) The framework’s central object: at a singular minimum with KL order k, the smallest Fisher eigenvalue decays as λmin(F) = Θ(t 2(k−1)) along a dead direction; k=1 is regular (rate 0), k=2, 3 are degenerate (rates 2, 4). (b) Rank-tracking on closed-form RLCT ground truth (§4.1): on the 14- cell Aoyagi 2005 anchor each rat… view at source ↗
Figure 2
Figure 2. Figure 2: A dead direction collapses the Fisher-Gram eigenvalue at the bottleneck. On the Aoyagi 2005 reduced-rank-regression anchor (14 cells, σ=0.1), the smallest positive Fisher-Gram eigenvalue λ + min(G) at the bottleneck layer h1 drops ∼246× the instant a dead direction is present (rank-deficit ≥ 1), while at the dimension-fixed boundary layer h2 it stays flat to 0.3%. The collapse detects the dead direction an… view at source ↗
Figure 3
Figure 3. Figure 3: The cross-cell rank-correlation is a sanity gate, not a discriminator. On the Aoy￾agi 2005 anchor (14 cells), every complexity-monotone observable clears the cross-cell |ρ| against the analytical λ: the DDS observables, calibrated LLC, and a naive H·r capacity proxy, which in fact scores highest (0.99). The discriminating evidence is the rank-multiplicative volume identity ( [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 4
Figure 4. Figure 4: The volume observable counts dead directions; the smallest eigenvalue cannot. (A) Normalised descent of log det+(Gℓ) toward the singular set at an interior layer (deep-linear noisy bridge, L=4, D=20, layer h1), for rank-deficit r=1, . . . , 4: the slope fans to 1 × /2 × /3 × /4× the rank-1 slope (6.9/13.8/21.0/27.8). (B) The smallest positive eigenvalue log λ + min(Gℓ) descends at the same rate for every r… view at source ↗
Figure 5
Figure 5. Figure 5: DDS separates the widths where calibrated LLC does not. Nanda modular-addition width sweep (AdamW+CE, 101 grokked cells of 4 widths × 30 seeds). (Top) DDS observables de￾scend monotonically with dmodel: λ + min(G) spans ∼1018 (ρ=−0.91), σmin is monotone (ρ=−0.65). (Bottom) Calibrated LLC at the 4,400-step protocol budget is rank-flat (ρ=−0.10); a 9× SGLD budget at d=128 leaves the across-seed spread essent… view at source ↗
Figure 6
Figure 6. Figure 6: DDS resolves singular structure dynamically; the static identity holds across regime cells. (a) Barak sparse-parity grokking trajectory (30 seeds, SGD+MSE, 240k steps; 30-seed me￾dian + IQR band). Through the val_acc-anchored phases (pre-grok, at-grok, cleanup) the activation￾side σmin(Xh2 ) drops ∼ 7 orders and the Fisher-side u ⊤Gh2 u at the dead direction drops ∼ 14 orders, while calibrated LLC (top tra… view at source ↗
Figure 7
Figure 7. Figure 7: λmin(Gℓ) numerical recipe: fp64 + no-Tikhonov + smallest-positive eigenvalue is the right default at λ → 0. (a) Recipe matters: when the physical signal is below the Tikhonov floor ε = 10−8 , fp32 + Tikhonov clamps and loses signal (|ρ| = 0.10 on width-ranking); fp64 + no￾Tikhonov is faithful down to ∼ 10−13 (|ρ| = 0.96). (b) Trustworthiness band: smallest reliably￾readable eigenvalue per recipe vs ∥Gℓ∥2. … view at source ↗
Figure 8
Figure 8. Figure 8: TMS single-trajectory observable dashboard. (A) loss drops across the main phase tran [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Random-direction null discrimination on the structural correlation λ + min(Gℓ) ∝ σmin(Xℓ) 2 . 4 × 2 panel: rows are probe directions, columns are testbeds; each panel overlays the per-seed Phase A scatter (5 seeds, viridis-coloured) of u ⊤Gh1 u vs σmin(X1) 2 on log-log axes, with cross-seed mean ± std of per-seed Spearman ρ printed in-panel. The four directions are: udead (bot￾tom right-singular vector of… view at source ↗
read the original abstract

Singular learning theory characterises the complexity of a deep network through the geometry of its loss singularities. The local learning coefficient (LLC), the standard estimator of Watanabe's real log canonical threshold (RLCT, $\lambda$), reads this geometry as an integrated Bayesian scalar through SGLD, which needs per-task calibration and $10^4$-$10^6$ forward-backward passes per checkpoint. We introduce Dead-Direction Signatures (DDS), a family of cheap closed-form spectral readings of singular structure: each reads a network's activation matrix or per-sample-gradient Fisher-Gram at a chosen layer, replacing the SGLD posterior chain with spectral linear algebra. The readings rest on a dead-direction framework that predicts a structural correlation between activation- and Fisher-side spectra at any singular minimum, and a rank-multiplicative volume identity that single-eigenvalue monitors cannot produce: the active-volume $\log\det^{+}(G)$ slope counts the dead directions, tracking the rank-deficit $r$ across $r \in \{1,2,3,4\}$ (slope ratios $2.0, 3.1, 4.0$ at $r{=}2,3,4$ against the predicted $2,3,4$), where the smallest eigenvalue is rank-blind. On reduced-rank regression with closed-form $\lambda$, calibrated LLC recovers $\lambda$ at $99\%$ mean and the DDS observables rank-track it at the framework-predicted sign; on a non-linear modular-addition transformer DDS separates $d_{\mathrm{model}}$ across eighteen orders of magnitude where calibrated LLC at the protocol budget is rank-flat. Complementary to LLC's integrated posterior reading, DDS gives a directional, layer-local handle on a network's dead directions, read in closed form from its activation and gradient spectra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Dead-Direction Signatures (DDS), a family of closed-form spectral observables computed from activation matrices and per-sample-gradient Fisher-Gram matrices at chosen layers. These are positioned as cheap alternatives to SGLD-based local learning coefficient (LLC) estimation of the real log canonical threshold (RLCT, λ). The method rests on an introduced 'dead-direction framework' that posits a structural correlation between activation-side and Fisher-side spectra at singular minima together with a rank-multiplicative volume identity; this identity is claimed to let the slope of log det⁺(G) count dead directions exactly, producing integer slope ratios (predicted 2, 3, 4) that track rank deficit r. Empirical support is given on reduced-rank regression (where DDS rank-tracks closed-form λ) and on a modular-addition transformer (where DDS separates d_model while LLC remains flat).

Significance. If the dead-direction framework and its volume identity can be placed on a rigorous footing, DDS would supply a computationally inexpensive, layer-local directional probe that complements the integrated posterior reading of LLC. The reduced-rank regression results show approximate numerical agreement with predicted slopes and sign-consistent ranking of λ, while the transformer experiment illustrates scaling behavior across many orders of magnitude in d_model. These strengths would be strengthened by machine-checked derivations or reproducible code for the spectral quantities.

major comments (3)
  1. [Abstract] Abstract (paragraph beginning 'The readings rest on a dead-direction framework'): the structural correlation between activation-matrix and Fisher-Gram spectra, together with the rank-multiplicative volume identity that makes log det⁺(G) slope count dead directions exactly, is presented as the foundation for all DDS claims, yet no derivation from Watanabe's resolution of singularities or from the local learning coefficient definition is supplied; without it the interpretation of DDS as a spectral reading of singular complexity rests on an unproven assumption whose failure would invalidate both the rank-tracking claim and the comparison to LLC.
  2. [Abstract] Abstract (reported slope ratios): the observed ratios 2.0, 3.1, 4.0 at r=2,3,4 are described as tracking the predicted integers 2,3,4, but the deviations (especially 3.1 vs 3) are not accompanied by error bars, dataset sizes, or the precise experimental protocol; this leaves the central empirical support for the volume identity unverifiable from the given information.
  3. [Abstract] Abstract (reduced-rank regression paragraph): the claim that 'DDS observables rank-track [closed-form λ] at the framework-predicted sign' is load-bearing for the method's utility, yet the manuscript supplies no independent verification that the sign prediction follows from RLCT geometry rather than from the framework's own definition; the circularity risk noted in the reader's report therefore directly affects the strength of this result.
minor comments (1)
  1. [Abstract] Abstract contains a typographic artifact 'r{=}2,3,4' that should be rendered consistently as r=2,3,4.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting the need to strengthen the presentation of the dead-direction framework and its empirical support. We address each major comment point by point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph beginning 'The readings rest on a dead-direction framework'): the structural correlation between activation-matrix and Fisher-Gram spectra, together with the rank-multiplicative volume identity that makes log det⁺(G) slope count dead directions exactly, is presented as the foundation for all DDS claims, yet no derivation from Watanabe's resolution of singularities or from the local learning coefficient definition is supplied; without it the interpretation of DDS as a spectral reading of singular complexity rests on an unproven assumption whose failure would invalidate both the rank-tracking claim and the comparison to LLC.

    Authors: The dead-direction framework is introduced in the manuscript as a structural hypothesis motivated by the geometry of singular minima rather than as a theorem derived from the resolution of singularities or the LLC definition. The contribution centers on defining the DDS observables from this hypothesis and testing their practical utility as inexpensive directional probes. We will revise the abstract and add a new subsection in the introduction that explicitly labels the framework as a modeling assumption, discusses its motivation from observed rank deficits, and clarifies its relation to SLT without claiming a full derivation. revision: partial

  2. Referee: [Abstract] Abstract (reported slope ratios): the observed ratios 2.0, 3.1, 4.0 at r=2,3,4 are described as tracking the predicted integers 2,3,4, but the deviations (especially 3.1 vs 3) are not accompanied by error bars, dataset sizes, or the precise experimental protocol; this leaves the central empirical support for the volume identity unverifiable from the given information.

    Authors: The slope ratios are obtained from the reduced-rank regression experiments whose protocol, dataset sizes, and multiple runs are described in the full manuscript. The abstract format limits detail. In revision we will append error bars (computed over independent runs), state the number of samples and runs, and add a reference to the experimental section directly in the abstract. revision: yes

  3. Referee: [Abstract] Abstract (reduced-rank regression paragraph): the claim that 'DDS observables rank-track [closed-form λ] at the framework-predicted sign' is load-bearing for the method's utility, yet the manuscript supplies no independent verification that the sign prediction follows from RLCT geometry rather than from the framework's own definition; the circularity risk noted in the reader's report therefore directly affects the strength of this result.

    Authors: The sign prediction is generated by the volume identity inside the framework and is then checked for consistency against the independently computed closed-form λ in the reduced-rank regression model. This supplies an external test because λ is obtained from the known algebraic geometry of the model, not from the DDS spectra. We nevertheless recognize the risk that the framework's internal logic could influence the interpretation. In revision we will add explicit language separating the framework's derivation of the sign from the subsequent empirical verification against closed-form λ, and we will note that stronger geometric justification remains desirable. revision: partial

Circularity Check

1 steps flagged

Dead-direction framework defines the activation-Fisher correlation and volume identity that DDS then 'predicts' and measures

specific steps
  1. self definitional [Abstract]
    "The readings rest on a dead-direction framework that predicts a structural correlation between activation- and Fisher-side spectra at any singular minimum, and a rank-multiplicative volume identity that single-eigenvalue monitors cannot produce: the active-volume log det+(G) slope counts the dead directions, tracking the rank-deficit r across r ∈ {1,2,3,4} (slope ratios 2.0, 3.1, 4.0 at r=2,3,4 against the predicted 2,3,4)"

    The framework is introduced as predicting the exact correlation and volume identity that enable the DDS observables to count dead directions via the log det+ slope; the DDS family is then defined to read those same quantities, so the 'prediction' of the correlation and the rank-tracking mechanism are equivalent to the framework's definitions rather than derived independently from singular learning theory.

full rationale

The abstract states that DDS readings rest on a dead-direction framework that predicts the structural correlation and rank-multiplicative volume identity, with the active-volume log det+(G) slope counting dead directions at the predicted integer ratios. The framework is presented as foundational without a shown derivation from RLCT geometry or LLC in the provided text, and the observables are constructed directly from the framework's posited correlation and identity. This reduces the central claim (DDS as a spectral reading of singular complexity) to consistency with the framework's own assumptions by construction, with empirical slope ratios (2.0, 3.1, 4.0) serving as confirmation rather than independent test. No self-citation chain is involved; the circularity is internal to the framework-observable pairing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; the central claim rests on an unstated dead-direction framework whose independent support is not visible. No free parameters are explicitly named. One invented entity (dead directions) is introduced without a falsifiable handle outside the paper.

axioms (1)
  • ad hoc to paper Structural correlation exists between activation-side and Fisher-side spectra at any singular minimum
    Invoked to justify replacing SGLD with spectral linear algebra; location: abstract sentence beginning 'The readings rest on a dead-direction framework'
invented entities (1)
  • dead directions no independent evidence
    purpose: Directions whose count is read from the active-volume slope of the Gram matrix
    New postulated structural feature whose existence is used to interpret the spectral slopes

pith-pipeline@v0.9.1-grok · 5874 in / 1539 out tokens · 13689 ms · 2026-06-26T14:34:26.092268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dead-Direction Conditioners: Gauge-Equivariant Preconditioning for Deep Networks

    cs.LG 2026-06 unverdicted novelty 7.0

    Dead-Direction Conditioners provide gauge-equivariant preconditioning by conditioning optimizer state on symmetry orbits, yielding improved resistance to over-training collapse and higher detection of dead directions ...

Reference graph

Works this paper leans on

34 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    M. Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, 172: 0 106132, 2024. URL https://doi.org/10.1016/j.neunet.2024.106132

  2. [2]

    Aoyagi and S

    M. Aoyagi and S. Watanabe. Stochastic complexities of reduced rank regression in B ayesian estimation. Neural Networks, 18 0 (7): 0 924--933, 2005. URL https://doi.org/10.1016/j.neunet.2005.03.014

  3. [3]

    Barak, B

    B. Barak, B. L. Edelman, S. Goel, S. Kakade, E. Malach, and C. Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. In NeurIPS, 2022. URL https://arxiv.org/abs/2207.08799

  4. [4]

    Boix-Adsera, E

    E. Boix-Adsera, E. Littwin, E. Abbe, S. Bengio, and J. Susskind. Transformers learn through gradual rank increase. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2306.07042

  5. [5]

    Dong, J.-B

    Y. Dong, J.-B. Cordonnier, and A. Loukas. Attention is not all you need: pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning (ICML), 2021. URL https://arxiv.org/abs/2103.03404

  6. [6]

    Elhage, T

    N. Elhage, T. Hume, C. Olsson, N. Nanda, T. Henighan, S. Johnston, S. E. Showk, N. Joseph, N. DasSarma, B. Mann, D. Hernandez, A. Askell, K. Ndousse, A. Jones, D. Drain, A. Chen, Y. Bai, D. Ganguli, L. Lovitt, Z. Hatfield-Dodds, J. Kernion, T. Conerly, S. Kravec, S. Fort, S. Kadavath, J. Jacobson, E. Tran-Johnson, J. Kaplan, J. Clark, T. Brown, S. McCandl...

  7. [7]

    Eschenhagen, A

    R. Eschenhagen, A. Immer, R. E. Turner, F. Schneider, and P. Hennig. K ronecker-factored approximate curvature for modern neural network architectures. In NeurIPS, 2023

  8. [8]

    Ettori, N

    D. Ettori, N. Darabi, S. Tayebati, R. Krishnan, M. Subedar, O. Tickoo, and A. R. Trivedi. EigenTrack : Spectral activation feature tracking for hallucination and out-of-distribution detection in LLM s and VLM s. arXiv:2509.15735, 2025

  9. [9]

    George, C

    T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent. Fast approximate natural gradient descent in a K ronecker-factored eigenbasis. In NeurIPS, 2018

  10. [10]

    Ghorbani, S

    B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via H essian eigenvalue density. In ICML, 2019

  11. [11]

    Grosse and J

    R. Grosse and J. Martens. A K ronecker-factored approximate F isher matrix for convolution layers. In ICML, 2016. URL https://arxiv.org/abs/1602.01407

  12. [12]

    Hironaka

    H. Hironaka. Resolution of singularities of an algebraic variety over a field of characteristic zero. Annals of Mathematics, 79 0 (1): 0 109--326, 1964. URL https://www.jstor.org/stable/1970486

  13. [13]

    Hoogland, G

    J. Hoogland, G. Wang, M. Farrugia-Roberts, L. Carroll, S. Wei, and D. Murfet. Loss landscape degeneracy and stagewise development in transformers. Transactions on Machine Learning Research, 2024. URL https://arxiv.org/abs/2402.02364

  14. [14]

    N. K. Jha and B. Reagen. NerVE : Nonlinear eigenspectrum dynamics in LLM feed-forward networks. arXiv:2603.06922, 2026

  15. [15]

    Karakida, S

    R. Karakida, S. Akaho, and S.-i. Amari. Universal statistics of F isher information in deep neural networks: Mean field approach. In AISTATS, 2019

  16. [16]

    Karakida, S

    R. Karakida, S. Akaho, and S.-i. Amari. Pathological spectra of the F isher information metric and its variants in deep neural networks. Neural Computation, 33 0 (8): 0 2274--2307, 2021

  17. [17]

    T. X. Khanh, T. Q. Hoa, L. D. Trung, and P. T. Duc. Spectral entropy collapse as an empirical signature of delayed generalisation in grokking. arXiv:2604.13123, 2026

  18. [18]

    Kunstner, L

    F. Kunstner, L. Balles, and P. Hennig. Limitations of the empirical F isher approximation for natural gradient descent. In NeurIPS, 2019. URL https://arxiv.org/abs/1905.12558

  19. [19]

    E. Lau, Z. Furman, G. Wang, D. Murfet, and S. Wei. The local learning coefficient: A singularity-aware complexity measure. In AISTATS, 2025. URL https://proceedings.mlr.press/v258/lau25a.html

  20. [20]

    Martens and R

    J. Martens and R. Grosse. Optimizing neural networks with Kronecker -factored approximate curvature. In ICML, 2015. URL https://arxiv.org/abs/1503.05671

  21. [21]

    Nanda, L

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023. URL https://arxiv.org/abs/2301.05217

  22. [22]

    L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2206.03126

  23. [23]

    Pennington and P

    J. Pennington and P. Worah. The spectrum of the F isher information matrix of a single-hidden-layer neural network. In NeurIPS, 2018

  24. [24]

    Power, Y

    A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022

  25. [25]

    Sagun, U

    L. Sagun, U. Evci, V. U. G \"u ney, Y. Dauphin, and L. Bottou. Empirical analysis of the H essian of over-parametrized neural networks. In ICLR Workshop, 2018. arXiv:1706.04454

  26. [26]

    T. P. Shirodkar. Dead directions: Geometric singular learning, 2026. URL https://arxiv.org/abs/2606.05957

  27. [27]

    T. P. Shirodkar and P. J. Narayanan. Algebraic dead directions in LayerNorm transformers: A forward-pass-only diagnostic at LLM scale, 2026. URL https://arxiv.org/abs/2606.19491

  28. [28]

    Staats, M

    M. Staats, M. Thamm, and B. Rosenow. Small singular values matter: A random matrix analysis of transformer models. arXiv preprint arXiv:2410.17770, 2024. URL https://arxiv.org/abs/2410.17770

  29. [29]

    devinterp : A library for developmental interpretability

    Timaeus and collaborators. devinterp : A library for developmental interpretability. https://github.com/timaeus-research/devinterp, 2024. Python package

  30. [30]

    G. Wang, J. Hoogland, S. van Wingerden, Z. Furman, and D. Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient. In International Conference on Learning Representations (ICLR), 2025. URL https://arxiv.org/abs/2410.02984. Spotlight

  31. [31]

    Cambridge Monographs on Applied and Computational Mathematics, vol

    S. Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, 2009. URL https://doi.org/10.1017/CBO9780511800474

  32. [32]

    Watanabe

    S. Watanabe. Mathematical Theory of B ayesian Statistics . CRC Press, 2018. URL https://www.routledge.com/9781482238068

  33. [33]

    Y. Xu. Spectral edge dynamics of training trajectories: Signal--noise geometry across scales. arXiv:2603.15678, 2026

  34. [34]

    Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney. PyHessian : Neural networks through the lens of the H essian. In IEEE BigData, 2020