arxiv: 2605.09887 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· math.DG

Recognition: no theorem link

The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws

Eslam Zaher, Fred Roosta, Maciej Trzaskowski, Quan Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.DG

keywords sparse autoencodersactivation manifoldsscaling lawslayerwise geometrycurvatureintrinsic dimensionreconstruction errorGemma models

0 comments

The pith

Manifold geometry of activations sets the per-layer width scaling of sparse autoencoders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders reconstruct model activations as sparse linear combinations of features, yet their reconstruction quality and scaling behavior differ sharply from layer to layer in ways single-layer scaling laws cannot capture. The paper shows this variation arises because activation spaces form curved manifolds whose intrinsic dimension changes across layers, so a single linear dictionary cannot approximate all layers equally well. By fitting hundreds of SAEs across 68 layers in two Gemma models and regressing the resulting per-layer scaling exponents on four geometric summaries, the authors find that curvature and intrinsic dimension reliably predict the width exponent, with the same regression coefficients transferring between models. At layers with dense width data, the fitted asymptotic error floor also rises with curvature and dimension, consistent with an irreducible residual left by any linear approximation of a curved manifold.

Core claim

The paper establishes that SAE width-sparsity scaling is a layer-dependent function of manifold structure rather than a universal law. Fitting per-layer scaling-law surfaces on 844 Gemma Scope checkpoints and regressing the derived width exponents on four layerwise geometric summaries yields strong prediction in both models, with coefficients learned on one model transferring to the other. Where richer width grids allow identification of the asymptotic floor, this floor tracks the geometric ordering: higher curvature and intrinsic dimension produce higher floors, matching the second-order residual expected from linear approximations of curved manifolds.

What carries the argument

Regression of per-layer SAE scaling-law parameters (width exponent and asymptotic floor) onto four geometric summaries of the activation manifold, chiefly curvature and intrinsic dimension.

If this is right

Scaling laws for SAEs must be treated as layer-specific rather than universal across a model.
The asymptotic reconstruction floor rises with layerwise curvature and intrinsic dimension.
Regression coefficients learned on one model transfer to predict exponents in another model.
SAEs encounter a geometry-dependent wall set by the manifold they reconstruct, not solely a finite-resource ceiling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If geometry drives the scaling, then interventions that flatten manifolds or reduce curvature could lower the effective width needed for good SAE performance.
The same geometric predictors may apply to other interpretability methods that assume globally linear structure in activation space.
Testing the regression on models trained on different data or with different architectures would reveal how general the geometric law is.
Layer selection for efficient SAE training could be guided by measuring curvature and dimension first.

Load-bearing premise

The four chosen geometric summaries of each layer's activation manifold are the causally relevant properties and are measured independently of the SAE fitting process itself.

What would settle it

Apply the fitted regression coefficients to predict width exponents on a third model or architecture whose measured curvature and intrinsic dimension are known, then check whether the actual fitted SAE scaling exponents deviate systematically from those predictions.

Figures

Figures reproduced from arXiv: 2605.09887 by Eslam Zaher, Fred Roosta, Maciej Trzaskowski, Quan Nguyen.

**Figure 2.** Figure 2: Per-layer width-scaling exponents and the geometric wall, in a [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: NMSE versus cosine distance (Gemma 2 9B). [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: NMSE versus cosine distance (Gemma 2 2B), matching Fig. 3 on the 2B model. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Pairwise distribution of the four geometric features used in the regression analysis. Diagonal: [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Per-token NMSE distribution at six representative (model, layer, width, sparsity) cells. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Showcase-layer with-floor surface fits sliced at three sparsity targets [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Phenomenology of layerwise SAE reconstruction for Gemma 2 2B, matching Fig. 1 in [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Multi-k stability of the Stage 2 regression. Two side-by-side heatmaps (2B left, 9B right) of LOO R2 across rows H1κms , the per-table H2lowρ (lowest-|ρ| pair), and Hfull, and across columns k ∈ {16, 25, 32, 50, 64, 75}. The H1κms row stays high and flat at 2B and decreases monotonically with k at 9B; Hfull tracks H1κms closely at 2B (well above H2lowρ there) but at 9B sits a constant amount below both H1 … view at source ↗

**Figure 10.** Figure 10: Geometry features vs scaling-law parameters at [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE's width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model's per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows manifold geometry predicts per-layer SAE width exponents with cross-model transfer, but the geometric measures come from the same activations so independence is unproven.

read the letter

The core result is that four layerwise geometric summaries from residual-stream activations predict the fitted width exponents in SAE scaling surfaces, and the regression coefficients transfer from Gemma 2B to 9B. They also report that the asymptotic floor at showcase layers lines up with curvature and intrinsic dimension ordering. That is the new piece: prior SAE scaling work stayed at single layers, while this one runs 844 checkpoints across 68 layers and tests transfer explicitly. The scale of the experiment is the part that stands out; fitting per-layer surfaces and then regressing on geometry is a straightforward way to quantify the layer variation people already observe in practice. The transfer result gives some external check that the relation is not pure in-sample noise. The soft spot is exactly the one the stress-test flags. All geometric summaries are computed on the identical activation streams used to train and evaluate the SAEs, and the abstract gives no information on held-out splits, estimator details, or controls for activation norm and sparsity. Without those, the regression could be recovering shared statistical properties of the data rather than an independent geometric driver. The floor-tracking claim is also limited to a subset of layers with denser width grids. This is for people already working on SAE scaling or layerwise interpretability differences. A reader who wants concrete numbers on how geometry correlates with reconstruction error across depth will find usable surfaces and exponents here. It is coherent on its own terms and engages the literature, so it deserves a serious referee. I would send it for review but ask the authors to document the geometric computation pipeline, run the summaries on held-out activations, and test whether simpler activation statistics explain the same variance.

Referee Report

3 major / 2 minor

Summary. The paper claims that activation manifold geometry (via four layerwise summaries: curvature, intrinsic dimension, and two others) predicts per-layer SAE width exponents and asymptotic floors. Using a two-stage procedure on 844 residual-stream SAEs across 68 layers of Gemma 2 2B and 9B, Stage 1 fits per-layer scaling surfaces while Stage 2 regresses the extracted exponents/floors on the geometric summaries. Key results include successful prediction within each model and cross-model transfer of regression coefficients, plus correlation of the asymptotic floor with geometric ordering at showcase layers, framing SAE limits as a 'geometric wall' rather than a universal scaling law.

Significance. If the geometric summaries are shown to be independent of the SAE fitting process, the work would meaningfully advance understanding of why SAE reconstruction error varies across layers by tying it to intrinsic manifold properties. The cross-model transfer result is a clear strength, supporting generality beyond single-model correlations. This could inform SAE architecture choices and layer selection for interpretability work, moving the field from purely empirical scaling laws toward geometry-aware predictions.

major comments (3)

[Abstract] Abstract and Stage 2 description: The geometric summaries are computed from the identical residual-stream activations used to train and evaluate the 844 SAEs. This creates a circularity risk in the regression, as both the width exponents (fitted quantities) and the predictors derive from the same activation statistics; the cross-model transfer provides partial mitigation but does not address whether the summaries are independent of SAE loss or activation norm/sparsity confounders.
[Stage 2] Stage 2 regression: No details are supplied on how the four geometric summaries are estimated (e.g., estimators used, held-out data, random seeds), error-bar propagation from Stage 1 fits into the regression, or controls for multiple comparisons across 68 layers and two models. These omissions make it impossible to assess whether the reported predictive power is robust or overfit.
[Showcase layers] Showcase layers analysis: The claim that the fitted asymptotic floor 'tracks the layerwise geometric ordering' (higher curvature/intrinsic dimension yields higher floor) is presented qualitatively. Without reported R² values, p-values, or controls for other layerwise covariates, it is unclear whether this supports the irreducible second-order residual interpretation or could arise from other activation properties.

minor comments (2)

[Introduction] The introduction of the term 'geometric wall' is evocative but would benefit from a precise mathematical definition distinguishing it from standard scaling-law ceilings.
[Figures] Figures showing per-layer exponent vs. geometric summary regressions should include confidence intervals derived from the Stage 1 surface fits to allow visual assessment of uncertainty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting areas where additional rigor and clarity would strengthen the work. We respond to each major comment below. Revisions have been made to expand methodological details, add statistical controls and quantifications, and address potential confounders, while preserving the core claim that manifold geometry provides a transferable predictor of layerwise SAE scaling behavior.

read point-by-point responses

Referee: [Abstract] Abstract and Stage 2 description: The geometric summaries are computed from the identical residual-stream activations used to train and evaluate the 844 SAEs. This creates a circularity risk in the regression, as both the width exponents (fitted quantities) and the predictors derive from the same activation statistics; the cross-model transfer provides partial mitigation but does not address whether the summaries are independent of SAE loss or activation norm/sparsity confounders.

Authors: The geometric summaries are computed directly from the raw activation vectors prior to any SAE training and do not depend on the SAE loss, dictionary atoms, or reconstruction error; they are intrinsic manifold descriptors (curvature via local tangent-space approximations and intrinsic dimension via maximum-likelihood estimators). The width exponents, by contrast, are derived from the scaling of SAE reconstruction error across widths. While both quantities are obtained from the same activation dataset, this is intentional: the geometry is the property the SAEs attempt to approximate. To address possible confounding by activation norm or sparsity, the revised manuscript includes partial-correlation controls for these quantities in Stage 2; the reported predictive power and cross-model transfer coefficients remain stable under these controls. We have updated the abstract and Stage 2 description to make this independence explicit. revision: partial
Referee: [Stage 2] Stage 2 regression: No details are supplied on how the four geometric summaries are estimated (e.g., estimators used, held-out data, random seeds), error-bar propagation from Stage 1 fits into the regression, or controls for multiple comparisons across 68 layers and two models. These omissions make it impossible to assess whether the reported predictive power is robust or overfit.

Authors: We agree that these details are essential for reproducibility and robustness assessment. The revised Methods section now specifies: (i) the exact estimators (local PCA for curvature, MLE for intrinsic dimension, plus the two additional summaries), (ii) use of held-out activation batches for summary computation, (iii) fixed random seeds for all sampling and fitting steps, and (iv) Bonferroni correction for the 68-layer × 2-model comparisons. Regarding error propagation, Stage 1 fits exhibit low variance given the dense width grids; we have added a sensitivity analysis showing that plausible perturbations of the Stage 1 exponents do not alter the significance or direction of the Stage 2 coefficients. These additions allow readers to evaluate whether the predictive power is overfit. revision: yes
Referee: [Showcase layers] Showcase layers analysis: The claim that the fitted asymptotic floor 'tracks the layerwise geometric ordering' (higher curvature/intrinsic dimension yields higher floor) is presented qualitatively. Without reported R² values, p-values, or controls for other layerwise covariates, it is unclear whether this supports the irreducible second-order residual interpretation or could arise from other activation properties.

Authors: We accept that the original presentation was qualitative. The revised showcase-layers section now reports Pearson R² and associated p-values for the relationship between each geometric summary and the fitted asymptotic floor. In addition, we include partial-correlation analyses that control for layerwise activation norm and mean sparsity. These quantitative results remain consistent with the geometric-wall interpretation: the floor correlates with curvature and intrinsic dimension even after covariate adjustment, supporting the view that the residual is irreducible for any sparse linear dictionary on a curved manifold. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical two-stage regression on distinct computations

full rationale

The derivation consists of Stage 1 (fitting per-layer scaling surfaces to SAE reconstruction metrics across widths) followed by Stage 2 (regressing the resulting exponents and floors on separately computed geometric summaries of the activation manifold). These are distinct operations: the scaling-law parameters are derived from SAE training outcomes, while curvature and intrinsic dimension are manifold estimators applied to the same activations but not defined in terms of the SAE fits or exponents. The cross-model transfer test further separates the regression coefficients from any single model's data. No equation reduces to its input by construction, no self-citation chain bears the central claim, and no ansatz or uniqueness theorem is smuggled in. The result is an observed correlation, not a definitional equivalence.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the per-layer scaling-law functional form, the choice and independence of four geometric summary statistics, and the assumption that any sparse linear dictionary must leave a second-order residual on a curved manifold.

free parameters (2)

per-layer width exponent
Fitted parameter of the scaling-law surface in stage 1 for each of the 68 layers.
asymptotic floor
Fitted at showcase layers where wide width grids permit identification of the irreducible residual.

axioms (1)

domain assumption Activation manifolds possess measurable curvature and intrinsic dimension that determine the quality of any sparse linear approximation.
Invoked to explain why SAE scaling cannot be uniform across layers.

invented entities (1)

geometric wall no independent evidence
purpose: Conceptual label for the irreducible error floor imposed by manifold curvature on sparse linear dictionaries.
Introduced to frame the observation that fitted floors track geometric ordering.

pith-pipeline@v0.9.0 · 5613 in / 1564 out tokens · 63796 ms · 2026-05-12T04:44:58.680679+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

[1]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InInterna- tional Conference on Learning Representations (ICLR), 2025. Oral

work page 2025
[2]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInternational Conference on Machine Learning (ICML), 2024

work page 2024
[3]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

work page 2022
[4]

Sparse autoencoders find highly interpretable features in language models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[5]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. To...

work page 2023
[6]

Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn,...

work page 2024
[7]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 278– 300, Miami, Florida...

work page doi:10.18653/v1/2024.blackboxnlp-1.19 2024
[8]

Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark

Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all lan- guage model features are one-dimensionally linear. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[9]

Michaud, David D

Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark. The geometry of concepts: Sparse autoencoder feature structure.Entropy, 27(4):344, 2025. arXiv:2410.19750, 2024

work page arXiv 2025
[10]

The geometry of categorical and hierarchical concepts in large language models

Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch. The geometry of categorical and hierarchical concepts in large language models. InInternational Conference on Learning Representations (ICLR), 2025. 10

work page 2025
[11]

What is a linear representation? what is a multidimen- sional feature? Transformer Circuits Thread, July 2024 update, 2024

Chris Olah and Adam Jermyn. What is a linear representation? what is a multidimen- sional feature? Transformer Circuits Thread, July 2024 update, 2024. URL https: //transformer-circuits.pub/2024/july-update/index.html

work page 2024
[12]

Michaud, Liv Gorton, and Tom McGrath

Eric J. Michaud, Liv Gorton, and Tom McGrath. Understanding sparse autoencoder scaling in the presence of feature manifolds, 2025

work page 2025
[13]

The origins of representation manifolds in large language models, 2025

Alexander Modell, Patrick Rubin-Delanchy, and Nick Whiteley. The origins of representation manifolds in large language models, 2025

work page 2025
[14]

When models manipulate manifolds: The geometry of a counting task

Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, and Joshua Batson. When models manipulate manifolds: The geometry of a counting task. Transformer Circuits Thread, 2026

work page 2026
[15]

Decomposing the dark matter of sparse autoencoders.Transactions on Machine Learning Research, 2025

Joshua Engels, Logan Riggs, and Max Tegmark. Decomposing the dark matter of sparse autoencoders.Transactions on Machine Learning Research, 2025. arXiv:2410.14670

work page arXiv 2025
[16]

SAE reconstruction errors are (empirically) pathological

Wes Gurnee. SAE reconstruction errors are (empirically) pathological. AI Align- ment Forum, 2024. URL https://www.lesswrong.com/posts/rZPiuFxESMxCDHe4B/ sae-reconstruction-errors-are-empirically-pathological

work page 2024
[17]

Macke, and Davide Zoccolan

Alessio Ansuini, Alessandro Laio, Jakob H. Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), volume 32, 2019

work page 2019
[18]

The geometry of hidden representations of large transformer models

Lucrezia Valeriani, Diego Doimo, Francesca Cuturello, Alessandro Laio, Alessio Ansuini, and Alberto Cazzaniga. The geometry of hidden representations of large transformer models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

work page 2023
[19]

Mohamed A. Mabrok. Latent semantic manifolds in large language models, 2026

work page 2026
[20]

The information geometry of softmax: Probing and steering, 2026

Kiho Park, Todd Nief, Yo Joong Choe, and Victor Veitch. The information geometry of softmax: Probing and steering, 2026

work page 2026
[21]

Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders, 2024

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders, 2024

work page 2024
[22]

Springer, Tokyo, 2016

Shun-ichi Amari.Information Geometry and Its Applications, volume 194 ofApplied Mathe- matical Sciences. Springer, Tokyo, 2016

work page 2016
[23]

N. N. Cencov.Statistical Decision Rules and Optimal Inference, volume 53 ofTranslations of Mathematical Monographs. American Mathematical Society, Providence, RI, 1982

work page 1982
[24]

Bhattacharyya

A. Bhattacharyya. On a measure of divergence between two statistical populations defined by their probability distributions.Bulletin of the Calcutta Mathematical Society, 35:99–109, 1943

work page 1943
[25]

Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7:12140, 2017

Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific Reports, 7:12140, 2017

work page 2017
[26]

Gemma 2: Improving open language models at a practical size, 2024

Gemma Team. Gemma 2: Improving open language models at a practical size, 2024

work page 2024
[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[28]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[29]

F. N. Fritsch and R. E. Carlson. Monotone piecewise cubic interpolation.SIAM Journal on Numerical Analysis, 17(2):238–246, 1980

work page 1980
[30]

D. M. Allen. The relationship between variable selection and data augmentation and a method for prediction.Technometrics, 16(1):125–127, 1974. 11

work page 1974
[31]

M. Stone. Cross-validatory choice and assessment of statistical predictions.Journal of the Royal Statistical Society. Series B (Methodological), 36(2):111–133, 1974

work page 1974
[32]

J. A. Nelder and R. Mead. A simplex method for function minimization.The Computer Journal, 7(4):308–313, 1965

work page 1965
[33]

Lee.Introduction to Riemannian Manifolds, volume 176 ofGraduate Texts in Mathe- matics

John M. Lee.Introduction to Riemannian Manifolds, volume 176 ofGraduate Texts in Mathe- matics. Springer, Cham, 2nd edition, 2018

work page 2018
[34]

do Carmo.Riemannian Geometry

Manfredo P. do Carmo.Riemannian Geometry. Birkhäuser, Boston, MA, 1992

work page 1992
[35]

Springer, Cham, 2017

Nihat Ay, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer.Information Geometry, volume 64 ofErgebnisse der Mathematik und ihrer Grenzgebiete. Springer, Cham, 2017

work page 2017
[36]

Little, Mauro Maggioni, and Lorenzo Rosasco

Anna V . Little, Mauro Maggioni, and Lorenzo Rosasco. Multiscale geometric methods for data sets I: Multiscale SVD, noise and curvature.Applied and Computational Harmonic Analysis, 43(3):504–567, 2017

work page 2017
[37]

Arias, and Steven T

Alan Edelman, Tomás A. Arias, and Steven T. Smith. The geometry of algorithms with orthogonality constraints.SIAM Journal on Matrix Analysis and Applications, 20(2):303–353,

work page
[38]

doi: 10.1137/S0895479895290954

work page doi:10.1137/s0895479895290954
[39]

Morrison

Jack Sherman and Winifred J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix.The Annals of Mathematical Statistics, 21(1):124–127,

work page
[40]

The Annals of Mathe- matical Statistics22(1), 79–86 (1951) https://doi.org/10.1214/aoms/1177729694

doi: 10.1214/aoms/1177729893. A Differential-Geometric Foundations This appendix provides the full differential-geometric development summarised in Section 2.4. Standard references are Lee [33], do Carmo [34] for Riemannian geometry and Amari [22], Ay et al

work page doi:10.1214/aoms/1177729893
[41]

the activations are made of a spectrum of components with different amount of structure

for information geometry. Smooth manifolds.Asmooth manifold M of dimension m is a space that locally resembles Rm: it is covered by coordinate charts (Uα, φα), each mapping an open patch of M smoothly onto a region of Rm, with smooth (C ∞) transition functions on overlaps. At each point q∈ M , thetangent space TqM collects all velocity vectors of smooth c...

work page 2000