pith. machine review for the scientific record. sign in

arxiv: 2605.09887 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· math.DG

Recognition: no theorem link

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Eslam Zaher, Fred Roosta, Maciej Trzaskowski, Quan Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.DG
keywords sparse autoencodersactivation manifoldsscaling lawslayerwise geometrycurvatureintrinsic dimensionreconstruction errorGemma models
0
0 comments X

The pith

Manifold geometry of activations sets the per-layer width scaling of sparse autoencoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders reconstruct model activations as sparse linear combinations of features, yet their reconstruction quality and scaling behavior differ sharply from layer to layer in ways single-layer scaling laws cannot capture. The paper shows this variation arises because activation spaces form curved manifolds whose intrinsic dimension changes across layers, so a single linear dictionary cannot approximate all layers equally well. By fitting hundreds of SAEs across 68 layers in two Gemma models and regressing the resulting per-layer scaling exponents on four geometric summaries, the authors find that curvature and intrinsic dimension reliably predict the width exponent, with the same regression coefficients transferring between models. At layers with dense width data, the fitted asymptotic error floor also rises with curvature and dimension, consistent with an irreducible residual left by any linear approximation of a curved manifold.

Core claim

The paper establishes that SAE width-sparsity scaling is a layer-dependent function of manifold structure rather than a universal law. Fitting per-layer scaling-law surfaces on 844 Gemma Scope checkpoints and regressing the derived width exponents on four layerwise geometric summaries yields strong prediction in both models, with coefficients learned on one model transferring to the other. Where richer width grids allow identification of the asymptotic floor, this floor tracks the geometric ordering: higher curvature and intrinsic dimension produce higher floors, matching the second-order residual expected from linear approximations of curved manifolds.

What carries the argument

Regression of per-layer SAE scaling-law parameters (width exponent and asymptotic floor) onto four geometric summaries of the activation manifold, chiefly curvature and intrinsic dimension.

If this is right

  • Scaling laws for SAEs must be treated as layer-specific rather than universal across a model.
  • The asymptotic reconstruction floor rises with layerwise curvature and intrinsic dimension.
  • Regression coefficients learned on one model transfer to predict exponents in another model.
  • SAEs encounter a geometry-dependent wall set by the manifold they reconstruct, not solely a finite-resource ceiling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If geometry drives the scaling, then interventions that flatten manifolds or reduce curvature could lower the effective width needed for good SAE performance.
  • The same geometric predictors may apply to other interpretability methods that assume globally linear structure in activation space.
  • Testing the regression on models trained on different data or with different architectures would reveal how general the geometric law is.
  • Layer selection for efficient SAE training could be guided by measuring curvature and dimension first.

Load-bearing premise

The four chosen geometric summaries of each layer's activation manifold are the causally relevant properties and are measured independently of the SAE fitting process itself.

What would settle it

Apply the fitted regression coefficients to predict width exponents on a third model or architecture whose measured curvature and intrinsic dimension are known, then check whether the actual fitted SAE scaling exponents deviate systematically from those predictions.

Figures

Figures reproduced from arXiv: 2605.09887 by Eslam Zaher, Fred Roosta, Maciej Trzaskowski, Quan Nguyen.

Figure 1
Figure 1. Figure 1: Phenomenology of layerwise SAE reconstruction for Gemma 2 9B. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-layer width-scaling exponents and the geometric wall, in a [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: NMSE versus cosine distance (Gemma 2 9B). [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NMSE versus cosine distance (Gemma 2 2B), matching Fig. 3 on the 2B model. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise distribution of the four geometric features used in the regression analysis. Diagonal: [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-token NMSE distribution at six representative (model, layer, width, sparsity) cells. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Showcase-layer with-floor surface fits sliced at three sparsity targets [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Phenomenology of layerwise SAE reconstruction for Gemma 2 2B, matching Fig. 1 in [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multi-k stability of the Stage 2 regression. Two side-by-side heatmaps (2B left, 9B right) of LOO R2 across rows H1κms , the per-table H2lowρ (lowest-|ρ| pair), and Hfull, and across columns k ∈ {16, 25, 32, 50, 64, 75}. The H1κms row stays high and flat at 2B and decreases monotonically with k at 9B; Hfull tracks H1κms closely at 2B (well above H2lowρ there) but at 9B sits a constant amount below both H1 … view at source ↗
Figure 10
Figure 10. Figure 10: Geometry features vs scaling-law parameters at [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE's width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model's per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that activation manifold geometry (via four layerwise summaries: curvature, intrinsic dimension, and two others) predicts per-layer SAE width exponents and asymptotic floors. Using a two-stage procedure on 844 residual-stream SAEs across 68 layers of Gemma 2 2B and 9B, Stage 1 fits per-layer scaling surfaces while Stage 2 regresses the extracted exponents/floors on the geometric summaries. Key results include successful prediction within each model and cross-model transfer of regression coefficients, plus correlation of the asymptotic floor with geometric ordering at showcase layers, framing SAE limits as a 'geometric wall' rather than a universal scaling law.

Significance. If the geometric summaries are shown to be independent of the SAE fitting process, the work would meaningfully advance understanding of why SAE reconstruction error varies across layers by tying it to intrinsic manifold properties. The cross-model transfer result is a clear strength, supporting generality beyond single-model correlations. This could inform SAE architecture choices and layer selection for interpretability work, moving the field from purely empirical scaling laws toward geometry-aware predictions.

major comments (3)
  1. [Abstract] Abstract and Stage 2 description: The geometric summaries are computed from the identical residual-stream activations used to train and evaluate the 844 SAEs. This creates a circularity risk in the regression, as both the width exponents (fitted quantities) and the predictors derive from the same activation statistics; the cross-model transfer provides partial mitigation but does not address whether the summaries are independent of SAE loss or activation norm/sparsity confounders.
  2. [Stage 2] Stage 2 regression: No details are supplied on how the four geometric summaries are estimated (e.g., estimators used, held-out data, random seeds), error-bar propagation from Stage 1 fits into the regression, or controls for multiple comparisons across 68 layers and two models. These omissions make it impossible to assess whether the reported predictive power is robust or overfit.
  3. [Showcase layers] Showcase layers analysis: The claim that the fitted asymptotic floor 'tracks the layerwise geometric ordering' (higher curvature/intrinsic dimension yields higher floor) is presented qualitatively. Without reported R² values, p-values, or controls for other layerwise covariates, it is unclear whether this supports the irreducible second-order residual interpretation or could arise from other activation properties.
minor comments (2)
  1. [Introduction] The introduction of the term 'geometric wall' is evocative but would benefit from a precise mathematical definition distinguishing it from standard scaling-law ceilings.
  2. [Figures] Figures showing per-layer exponent vs. geometric summary regressions should include confidence intervals derived from the Stage 1 surface fits to allow visual assessment of uncertainty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where additional rigor and clarity would strengthen the work. We respond to each major comment below. Revisions have been made to expand methodological details, add statistical controls and quantifications, and address potential confounders, while preserving the core claim that manifold geometry provides a transferable predictor of layerwise SAE scaling behavior.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Stage 2 description: The geometric summaries are computed from the identical residual-stream activations used to train and evaluate the 844 SAEs. This creates a circularity risk in the regression, as both the width exponents (fitted quantities) and the predictors derive from the same activation statistics; the cross-model transfer provides partial mitigation but does not address whether the summaries are independent of SAE loss or activation norm/sparsity confounders.

    Authors: The geometric summaries are computed directly from the raw activation vectors prior to any SAE training and do not depend on the SAE loss, dictionary atoms, or reconstruction error; they are intrinsic manifold descriptors (curvature via local tangent-space approximations and intrinsic dimension via maximum-likelihood estimators). The width exponents, by contrast, are derived from the scaling of SAE reconstruction error across widths. While both quantities are obtained from the same activation dataset, this is intentional: the geometry is the property the SAEs attempt to approximate. To address possible confounding by activation norm or sparsity, the revised manuscript includes partial-correlation controls for these quantities in Stage 2; the reported predictive power and cross-model transfer coefficients remain stable under these controls. We have updated the abstract and Stage 2 description to make this independence explicit. revision: partial

  2. Referee: [Stage 2] Stage 2 regression: No details are supplied on how the four geometric summaries are estimated (e.g., estimators used, held-out data, random seeds), error-bar propagation from Stage 1 fits into the regression, or controls for multiple comparisons across 68 layers and two models. These omissions make it impossible to assess whether the reported predictive power is robust or overfit.

    Authors: We agree that these details are essential for reproducibility and robustness assessment. The revised Methods section now specifies: (i) the exact estimators (local PCA for curvature, MLE for intrinsic dimension, plus the two additional summaries), (ii) use of held-out activation batches for summary computation, (iii) fixed random seeds for all sampling and fitting steps, and (iv) Bonferroni correction for the 68-layer × 2-model comparisons. Regarding error propagation, Stage 1 fits exhibit low variance given the dense width grids; we have added a sensitivity analysis showing that plausible perturbations of the Stage 1 exponents do not alter the significance or direction of the Stage 2 coefficients. These additions allow readers to evaluate whether the predictive power is overfit. revision: yes

  3. Referee: [Showcase layers] Showcase layers analysis: The claim that the fitted asymptotic floor 'tracks the layerwise geometric ordering' (higher curvature/intrinsic dimension yields higher floor) is presented qualitatively. Without reported R² values, p-values, or controls for other layerwise covariates, it is unclear whether this supports the irreducible second-order residual interpretation or could arise from other activation properties.

    Authors: We accept that the original presentation was qualitative. The revised showcase-layers section now reports Pearson R² and associated p-values for the relationship between each geometric summary and the fitted asymptotic floor. In addition, we include partial-correlation analyses that control for layerwise activation norm and mean sparsity. These quantitative results remain consistent with the geometric-wall interpretation: the floor correlates with curvature and intrinsic dimension even after covariate adjustment, supporting the view that the residual is irreducible for any sparse linear dictionary on a curved manifold. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical two-stage regression on distinct computations

full rationale

The derivation consists of Stage 1 (fitting per-layer scaling surfaces to SAE reconstruction metrics across widths) followed by Stage 2 (regressing the resulting exponents and floors on separately computed geometric summaries of the activation manifold). These are distinct operations: the scaling-law parameters are derived from SAE training outcomes, while curvature and intrinsic dimension are manifold estimators applied to the same activations but not defined in terms of the SAE fits or exponents. The cross-model transfer test further separates the regression coefficients from any single model's data. No equation reduces to its input by construction, no self-citation chain bears the central claim, and no ansatz or uniqueness theorem is smuggled in. The result is an observed correlation, not a definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the per-layer scaling-law functional form, the choice and independence of four geometric summary statistics, and the assumption that any sparse linear dictionary must leave a second-order residual on a curved manifold.

free parameters (2)
  • per-layer width exponent
    Fitted parameter of the scaling-law surface in stage 1 for each of the 68 layers.
  • asymptotic floor
    Fitted at showcase layers where wide width grids permit identification of the irreducible residual.
axioms (1)
  • domain assumption Activation manifolds possess measurable curvature and intrinsic dimension that determine the quality of any sparse linear approximation.
    Invoked to explain why SAE scaling cannot be uniform across layers.
invented entities (1)
  • geometric wall no independent evidence
    purpose: Conceptual label for the irreducible error floor imposed by manifold curvature on sparse linear dictionaries.
    Introduced to frame the observation that fitted floors track geometric ordering.

pith-pipeline@v0.9.0 · 5613 in / 1564 out tokens · 63796 ms · 2026-05-12T04:44:58.680679+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InInterna- tional Conference on Learning Representations (ICLR), 2025. Oral

  2. [2]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning (ICML), 2024

  3. [3]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

  4. [4]

    Sparse autoencoders find highly interpretable features in language models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InInternational Conference on Learning Representations (ICLR), 2024

  5. [5]

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. To...

  6. [6]

    Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn,...

  7. [7]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 278– 300, Miami, Florida...

  8. [8]

    Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark

    Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all lan- guage model features are one-dimensionally linear. InInternational Conference on Learning Representations (ICLR), 2025

  9. [9]

    Michaud, David D

    Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark. The geometry of concepts: Sparse autoencoder feature structure.Entropy, 27(4):344, 2025. arXiv:2410.19750, 2024

  10. [10]

    The geometry of categorical and hierarchical concepts in large language models

    Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models. InInternational Conference on Learning Representations (ICLR), 2025. 10

  11. [11]

    What is a linear representation? what is a multidimen- sional feature? Transformer Circuits Thread, July 2024 update, 2024

    Chris Olah and Adam Jermyn. What is a linear representation? what is a multidimen- sional feature? Transformer Circuits Thread, July 2024 update, 2024. URL https: //transformer-circuits.pub/2024/july-update/index.html

  12. [12]

    Michaud, Liv Gorton, and Tom McGrath

    Eric J. Michaud, Liv Gorton, and Tom McGrath. Understanding sparse autoencoder scaling in the presence of feature manifolds, 2025

  13. [13]

    The origins of representation manifolds in large language models, 2025

    Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models, 2025

  14. [14]

    When models manipulate manifolds: The geometry of a counting task

    Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2026

  15. [15]

    Decomposing the dark matter of sparse autoencoders.Transactions on Machine Learning Research, 2025

    Joshua Engels, Logan Riggs, and Max Tegmark. Decomposing the dark matter of sparse autoencoders.Transactions on Machine Learning Research, 2025. arXiv:2410.14670

  16. [16]

    SAE reconstruction errors are (empirically) pathological

    Wes Gurnee. SAE reconstruction errors are (empirically) pathological. AI Align- ment Forum, 2024. URL https://www.lesswrong.com/posts/rZPiuFxESMxCDHe4B/ sae-reconstruction-errors-are-empirically-pathological

  17. [17]

    Macke, and Davide Zoccolan

    Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019

  18. [18]

    The geometry of hidden representations of large transformer models

    Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

  19. [19]

    Mohamed A. Mabrok. Latent semantic manifolds in large language models, 2026

  20. [20]

    The information geometry of softmax: Probing and steering, 2026

    Kiho Park, Todd Nief, Yo Joong Choe, and Victor Veitch. The information geometry of softmax: Probing and steering, 2026

  21. [21]

    Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders, 2024

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders, 2024

  22. [22]

    Springer, Tokyo, 2016

    Shun-ichi Amari.Information Geometry and Its Applications, volume 194 ofApplied Mathe- matical Sciences. Springer, Tokyo, 2016

  23. [23]

    N. N. Cencov.Statistical Decision Rules and Optimal Inference, volume 53 ofTranslations of Mathematical Monographs. American Mathematical Society, Providence, RI, 1982

  24. [24]

    Bhattacharyya

    A. Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distributions.Bulletin of the Calcutta Mathematical Society, 35:99–109, 1943

  25. [25]

    Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7:12140, 2017

    Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7:12140, 2017

  26. [26]

    Gemma 2: Improving open language models at a practical size, 2024

    Gemma Team. Gemma 2: Improving open language models at a practical size, 2024

  27. [27]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  28. [28]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations (ICLR), 2017

  29. [29]

    F. N. Fritsch and R. E. Carlson. Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980

  30. [30]

    D. M. Allen. The relationship between variable selection and data augmentation and a method for prediction.Technometrics, 16(1):125–127, 1974. 11

  31. [31]

    M. Stone. Cross-validatory choice and assessment of statistical predictions.Journal of the Royal Statistical Society. Series B (Methodological), 36(2):111–133, 1974

  32. [32]

    J. A. Nelder and R. Mead. A simplex method for function minimization.The Computer Journal, 7(4):308–313, 1965

  33. [33]

    Lee.Introduction to Riemannian Manifolds, volume 176 ofGraduate Texts in Mathe- matics

    John M. Lee.Introduction to Riemannian Manifolds, volume 176 ofGraduate Texts in Mathe- matics. Springer, Cham, 2nd edition, 2018

  34. [34]

    do Carmo.Riemannian Geometry

    Manfredo P. do Carmo.Riemannian Geometry. Birkhäuser, Boston, MA, 1992

  35. [35]

    Springer, Cham, 2017

    Nihat Ay, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer.Information Geometry, volume 64 ofErgebnisse der Mathematik und ihrer Grenzgebiete. Springer, Cham, 2017

  36. [36]

    Little, Mauro Maggioni, and Lorenzo Rosasco

    Anna V . Little, Mauro Maggioni, and Lorenzo Rosasco. Multiscale geometric methods for data sets I: Multiscale SVD, noise and curvature.Applied and Computational Harmonic Analysis, 43(3):504–567, 2017

  37. [37]

    Arias, and Steven T

    Alan Edelman, Tomás A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality constraints.SIAM Journal on Matrix Analysis and Applications, 20(2):303–353,

  38. [38]

    doi: 10.1137/S0895479895290954

  39. [39]

    Morrison

    Jack Sherman and Winifred J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix.The Annals of Mathematical Statistics, 21(1):124–127,

  40. [40]

    The Annals of Mathe- matical Statistics22(1), 79–86 (1951) https://doi.org/10.1214/aoms/1177729694

    doi: 10.1214/aoms/1177729893. A Differential-Geometric Foundations This appendix provides the full differential-geometric development summarised in Section 2.4. Standard references are Lee [33], do Carmo [34] for Riemannian geometry and Amari [22], Ay et al

  41. [41]

    the activations are made of a spectrum of components with different amount of structure

    for information geometry. Smooth manifolds.Asmooth manifold M of dimension m is a space that locally resembles Rm: it is covered by coordinate charts (Uα, φα), each mapping an open patch of M smoothly onto a region of Rm, with smooth (C ∞) transition functions on overlaps. At each point q∈ M , thetangent space TqM collects all velocity vectors of smooth c...