pith. machine review for the scientific record. sign in

arxiv: 2605.09224 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

SMIXAE: Towards Unsupervised Manifold Discovery in Language Models

Collin Francel

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:45 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse autoencodersmanifold discoverylanguage model interpretabilitymixture modelsunsupervised feature learningGemma modelstransformer activations
0
0 comments X

The pith

SMIXAE uses a mixture of sparse autoencoders to learn multidimensional manifold structures directly in language model activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard sparse autoencoders break up complex multidimensional features in transformer activations into separate one-dimensional directions that later require manual grouping. The paper introduces the SMIXAE architecture as a way to model these manifold structures as single units from the start. Experiments on the Gemma 2 2B and 9B models show the new model both recovers previously known manifolds and identifies new ones without post-processing. This matters for interpretability because it removes a step that currently limits how easily researchers can find and understand the internal representations learned by large language models.

Core claim

The Sparse MIXture of Autoencoders (SMIXAE) architecture succeeds at directly learning previously identified manifold structures as well as discovering novel structures within the activations of the open-source Gemma 2 2B and 9B models.

What carries the argument

The SMIXAE mixture architecture, which trains multiple sparse autoencoders together so that each component can represent an entire multidimensional manifold rather than isolated directions.

If this is right

  • Multidimensional features can be discovered and interpreted as single units during training rather than after.
  • Unsupervised discovery becomes possible for manifold structures that were previously hard to isolate.
  • The approach demonstrates success on both 2B and 9B scale open models from the Gemma 2 family.
  • Limitations discussed in the paper point to the need for further scaling and validation work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixture approach could be tested on other transformer families to check whether manifold recovery holds beyond Gemma.
  • If SMIXAE reduces the need for post-training feature grouping, it could speed up automated interpretability pipelines that currently rely on clustering steps.
  • Novel structures found by SMIXAE might correspond to functional behaviors that standard SAEs miss, offering a route to test specific hypotheses about model computation.

Load-bearing premise

The mixture architecture can reliably capture multidimensional manifold structures in language model activations without needing post-hoc grouping, based on results from two specific Gemma models.

What would settle it

Running the same SMIXAE training on a different language model family or on a set of known multidimensional features and finding that the learned representations still split across multiple components requiring manual grouping afterward.

Figures

Figures reproduced from arXiv: 2605.09224 by Collin Francel.

Figure 1
Figure 1. Figure 1: Newline counting manifold in Gemma 2 9B, Layer 11, on 150-char wrapped text. (a, b) SMIXAE bottleneck acti￾vations for the top two experts by ∆R 2 per: (a) Expert 541, rank 1 (Score = 0.548); (b) Expert 1521, rank 2 (Score = 0.500). Points are per-class mean activations colored by number of characters since the previous newline. (c) Visualization from Sinii & Balagansky (2026): PCA over the same layer’s ac… view at source ↗
Figure 2
Figure 2. Figure 2: SMIXAE Bottleneck Activations on Gemma 2 9B, Layer 11. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; small points are individual token activations colored by ground-truth label, and larger points mark per-class means. The + symbol marks the origin of the plot. Weekdays: (a) Expert 76, rank 1, 7-Day Ring (R 2 = 0.855). Hours: (b) Expert 1884, rank 2, AM vs PM (Accuracy = 0.997). … view at source ↗
Figure 3
Figure 3. Figure 3: Gemma 2 2B, Layer 12. Hours: (a) Expert 589, rank 1, 24-Hour Ring (R 2 = 0.681). Temperature: (b) Expert 789, rank 1, Fahrenheit (R 2 = 0.663). Time Units: (c) Expert 1009, rank 2, log _10 Duration (R 2 = 0.686). Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; small points are individual token activations colored by ground-truth label, and larger points mark per-class means. In [… view at source ↗
Figure 4
Figure 4. Figure 4: Gemma 2 9B, Layer 20. Weekdays: (a) Expert 391, rank 1, 7-Day Ring (R 2 = 0.724). Hours: (b) Expert 1041, rank 2, AM vs PM (Accuracy = 0.997). (c) Expert 1078, rank 1, 24-Hour Ring (R 2 = 0.590). Temperature: (d) Expert 1716, rank 1, Fahrenheit (R 2 = 0.875). Time Units: (e) Expert 923, rank 1, log _10 Duration (R 2 = 0.856). Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; small p… view at source ↗
Figure 5
Figure 5. Figure 5: Gemma 2 2B, Layer 12. Random Experts: (a) Expert 1359. (b) Expert 1568. (c) Expert 1578. (d) Expert 213. (e) Expert 234. (f) Expert 293. (g) Expert 477. (h) Expert 51. (i) Expert 524. (j) Expert 587. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; points are individual token activations colored by distance from the origin. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Gemma 2 9B, Layer 11. Random Experts: (a) Expert 1326. (b) Expert 1527. (c) Expert 1537. (d) Expert 210. (e) Expert 229. (f) Expert 289. (g) Expert 464. (h) Expert 508. (i) Expert 51. (j) Expert 571. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; points are individual token activations colored by distance from the origin. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Gemma 2 9B, Layer 20. Random Experts: (a) Expert 1360. (b) Expert 1567. (c) Expert 1578. (d) Expert 217. (e) Expert 236. (f) Expert 296. (g) Expert 474. (h) Expert 520. (i) Expert 53. (j) Expert 583. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; points are individual token activations colored by distance from the origin. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Newline Position (150 chars) — Gemma 2 2B, Layer 12. Newline Position: (a) Expert 1895, rank 2, Periodic Gain (Score = 0.406). (b) Expert 647, rank 1, Periodic Gain (Score = 0.225). Points represent per-class mean activations in the bottleneck space, colored by distance since the last newline. (a) (b) Newline Position [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Newline Position (150 chars) — Gemma 2 9B, Layer 20. Newline Position: (a) Expert 1749, rank 2, Periodic Gain (Score = 0.320). (b) Expert 1520, rank 1, Periodic Gain (Score = 0.280). Points represent per-class mean activations in the bottleneck space, colored by distance since the last newline [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of a single autoencoder trained on a torus or helix. We observe low MSE loss, showcasing that each autoencoder in SMIXAE is capable of learning manifolds when provided sufficient training data in a noise free setting. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) have been used widely to decompose and interpret neural network activations, especially those of transformer language models. One key issue with SAEs is their inability to directly model multidimensional features. Instead, SAEs may tile such features by a set of independent directions that must be grouped together after the SAE training phase, impeding discoverability and interpretation of learned feature representations. We begin to address this issue by introducing the Sparse MIXture of Autoencoders (SMIXAE) architecture. Empirically, we provide evidence that SMIXAE models have success both in directly learning previously identified manifold structures, as well as finding novel structures, within the open source Gemma 2 2B and 9B models. Finally, we discuss several limitations and point towards areas for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Sparse MIXture of Autoencoders (SMIXAE), a mixture-of-autoencoders architecture intended to overcome the tendency of standard sparse autoencoders (SAEs) to tile multidimensional manifold features in language-model activations into independent directions that must be grouped post hoc. The central claim is that SMIXAE directly learns both previously identified and novel manifold structures in the activations of open-source Gemma 2 2B and 9B models, with supporting empirical evidence presented and limitations discussed.

Significance. If the empirical results can be shown to rest on well-defined quantitative metrics, ablations, and validation procedures rather than subjective interpretation, the work would address a recognized limitation in mechanistic interpretability and could reduce dependence on post-training feature grouping. The use of publicly available Gemma 2 models is a positive step toward reproducibility.

major comments (3)
  1. [Experiments] Experiments section: the abstract and main text assert empirical success in directly learning multidimensional manifolds without post-hoc grouping, yet no quantitative metrics (e.g., per-component effective dimensionality, reconstruction error on held-out known manifolds, or clustering purity scores versus SAE baselines) or ablation studies isolating the mixture component are reported. This absence prevents assessment of whether the claimed advantage is architectural or due to hyperparameter choices.
  2. [Method] Method section: the description of how mixture components are assigned to manifold dimensions (as opposed to independent directions) is not accompanied by a formal argument or diagnostic showing that the assignment is enforced by the architecture rather than emerging from training dynamics or initialization; without this, the central distinction from tiled SAE features remains unproven.
  3. [Experiments] Validation of 'previously identified' and 'novel' structures: the paper does not specify the procedure used to confirm that recovered structures correspond to true manifolds (e.g., comparison against known feature dictionaries, geometric tests for dimensionality, or human-interpretability controls), leaving open the possibility that success is driven by subjective selection.
minor comments (2)
  1. [Method] Notation for the mixture weights and sparsity penalties should be unified across equations and text to avoid ambiguity in the definition of the SMIXAE objective.
  2. [Limitations] The limitations section could usefully include a brief discussion of computational overhead relative to standard SAEs, as mixture models typically increase training cost.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments identify important opportunities to strengthen the quantitative rigor, theoretical grounding, and validation procedures in the manuscript. We address each major comment below and will incorporate revisions to address them.

read point-by-point responses
  1. Referee: Experiments section: the abstract and main text assert empirical success in directly learning multidimensional manifolds without post-hoc grouping, yet no quantitative metrics (e.g., per-component effective dimensionality, reconstruction error on held-out known manifolds, or clustering purity scores versus SAE baselines) or ablation studies isolating the mixture component are reported. This absence prevents assessment of whether the claimed advantage is architectural or due to hyperparameter choices.

    Authors: We agree that additional quantitative metrics and ablations would strengthen the claims. The current version relies primarily on visualizations and qualitative recovery of known structures. In the revised manuscript we will add per-component effective dimensionality, reconstruction error on held-out data, clustering purity comparisons against SAE baselines, and ablation studies that isolate the mixture component. These additions will clarify whether the observed advantages arise from the architecture. revision: yes

  2. Referee: Method section: the description of how mixture components are assigned to manifold dimensions (as opposed to independent directions) is not accompanied by a formal argument or diagnostic showing that the assignment is enforced by the architecture rather than emerging from training dynamics or initialization; without this, the central distinction from tiled SAE features remains unproven.

    Authors: We acknowledge that the manuscript would benefit from an explicit formal argument. The mixture-of-autoencoders design is intended to encourage each component to capture a coherent subspace rather than isolated directions, but this was not accompanied by a proof or diagnostic. In the revision we will add a formal argument in the Method section together with empirical diagnostics (component-wise activation statistics and per-component dimensionality measurements) to demonstrate that the manifold assignment is promoted by the architecture. revision: yes

  3. Referee: Validation of 'previously identified' and 'novel' structures: the paper does not specify the procedure used to confirm that recovered structures correspond to true manifolds (e.g., comparison against known feature dictionaries, geometric tests for dimensionality, or human-interpretability controls), leaving open the possibility that success is driven by subjective selection.

    Authors: We agree that the validation procedure should be stated explicitly. The current text describes recovery of known and novel structures but does not detail the exact checks performed. In the revised Experiments section we will specify the full procedure, including direct comparison to published feature dictionaries, geometric dimensionality tests, and human-interpretability controls, to make the identification of manifolds more objective and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal on external models

full rationale

The paper introduces the SMIXAE architecture as a direct response to a known SAE limitation (tiling of multidimensional features) and reports empirical results on activations from independent open-source models (Gemma 2 2B/9B). No derivation chain, equations, or fitted parameters are presented that reduce to self-defined terms or prior self-citations. The central claim rests on application to external data rather than any self-referential construction or renaming of inputs. This is a standard empirical contribution with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work appears to build on standard autoencoder assumptions without new postulates detailed here.

pith-pipeline@v0.9.0 · 5426 in / 1012 out tokens · 44972 ms · 2026-05-12T02:45:39.619945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    Toy Models of Superposition

    Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher. Toy models of superposition. arXiv:2209.10652

  2. [2]

    The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235,

    Modell, Alexander and Rubin-Delanchy, Patrick and Whiteley, Nick. The origins of representation manifolds in large language models. arXiv:2505.18235

  3. [3]

    When models manipulate manifolds: The geometry of a counting task, 2026

    Gurnee, Wes and Ameisen, Emmanuel and Kauvar, Isaac and Tarng, Julius and Pearce, Adam and Olah, Chris and Batson, Joshua. When models manipulate manifolds: The geometry of a counting task. arXiv:2601.04480

  4. [4]

    Chasing the Counting Manifold in Open LLMs , author=

  5. [5]

    The Thirteenth International Conference on Learning Representations , year=

    Not All Language Model Features Are One-Dimensionally Linear , author=. The Thirteenth International Conference on Learning Representations , year=

  6. [6]

    ArXiv , year=

    Progress measures for grokking via mechanistic interpretability , author=. ArXiv , year=

  7. [7]

    and Marzen, Sarah E

    Shai, Adam S. and Marzen, Sarah E. and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Riechers, Paul M. , booktitle =. Transformers Represent Belief State Geometry in their Residual Stream , url =. doi:10.52202/079017-2387 , editor =

  8. [8]

    Extensions of Lipschitz maps into a Hilbert space , volume =

    Johnson, William and Lindenstrauss, Joram , year =. Extensions of Lipschitz maps into a Hilbert space , volume =. Contemporary Mathematics , doi =

  9. [9]

    2023 , eprint=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

  10. [10]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  11. [11]

    2025 , eprint=

    Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling , author=. 2025 , eprint=

  12. [12]

    2025 , eprint=

    Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author=. 2025 , eprint=

  13. [13]

    2024 , eprint=

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders , author=. 2024 , eprint=

  14. [14]

    2025 , note =

    Tom Conerly and Hoagy Cunningham and Adly Templeton and Jack Lindsey and Basil Hosmer and Adam Jermyn , title =. 2025 , note =

  15. [15]

    Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

    Finding Manifolds With Bilinear Autoencoders , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

  16. [16]

    2026 , eprint=

    PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding , author=. 2026 , eprint=

  17. [17]

    A theory of steady-state activity in nerve-fiber networks: I

    Householder, Alston S. A theory of steady-state activity in nerve-fiber networks: I. Definitions and preliminary lemmas. Bulletin of Mathematical Biophysics

  18. [18]

    2024 , eprint=

    Scaling and evaluating sparse autoencoders , author=. 2024 , eprint=

  19. [19]

    2024 , eprint=

    BatchTopK Sparse Autoencoders , author=. 2024 , eprint=

  20. [20]

    2018 , month =

    Christopher Tralie and Nathaniel Saul and Rann Bar-On , title =. 2018 , month =. doi:10.21105/joss.00925 , url =

  21. [21]

    Bauer, Ulrich , TITLE =. J. Appl. Comput. Topol. , FJOURNAL =. 2021 , NUMBER =. doi:10.1007/s41468-021-00071-5 , URL =

  22. [22]

    Schonsheck and Jie Chen and Rongjie Lai , title =

    Stefan C. Schonsheck and Jie Chen and Rongjie Lai , title =. CoRR , volume =. 2019 , url =. 1912.10094 , timestamp =

  23. [23]

    2023 , eprint=

    Minimalistic Unsupervised Learning with the Sparse Manifold Transform , author=. 2023 , eprint=

  24. [24]

    2025 , eprint=

    Not All Language Model Features Are One-Dimensionally Linear , author=. 2025 , eprint=

  25. [25]

    2025 , eprint=

    A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , author=. 2025 , eprint=

  26. [26]

    2025 , eprint=

    Data Whitening Improves Sparse Autoencoder Learning , author=. 2025 , eprint=

  27. [27]

    2017 , eprint=

    Deep Unsupervised Clustering Using Mixture of Autoencoders , author=. 2017 , eprint=

  28. [28]

    2026 , eprint=

    Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control , author=. 2026 , eprint=

  29. [29]

    2026 , eprint=

    Latent Structure of Affective Representations in Large Language Models , author=. 2026 , eprint=

  30. [30]

    Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

    Understanding sparse autoencoder scaling in the presence of feature manifolds , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

  31. [31]

    2026 , eprint=

    Do Sparse Autoencoders Capture Concept Manifolds? , author=. 2026 , eprint=

  32. [32]

    2024 , eprint=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2024 , eprint=

  33. [33]

    2018 , eprint=

    Understanding Deep Neural Networks with Rectified Linear Units , author=. 2018 , eprint=

  34. [34]

    Approximation capabilities of multilayer feedforward networks , journal =

    Kurt Hornik , keywords =. Approximation capabilities of multilayer feedforward networks , journal =. 1991 , issn =. doi:https://doi.org/10.1016/0893-6080(91)90009-T , url =

  35. [35]

    2020 , eprint=

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

  36. [36]

    2023 , note =

    monology , title =. 2023 , note =

  37. [37]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  38. [38]

    2024 , howpublished =

    SAELens , author =. 2024 , howpublished =

  39. [39]

    2024 , howpublished =

    Neuronpedia , author =. 2024 , howpublished =

  40. [40]

    ICML 2024 Workshop on Mechanistic Interpretability , year=

    The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. ICML 2024 Workshop on Mechanistic Interpretability , year=

  41. [41]

    The neural basis of the Weber-Fechner law: a logarithmic mental number line

    Dehaene, Stanislas. The neural basis of the Weber-Fechner law: a logarithmic mental number line. Trends Cogn. Sci

  42. [42]

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

  43. [43]

    2025 , eprint=

    SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability , author=. 2025 , eprint=

  44. [44]

    2025 , eprint=

    Are Sparse Autoencoders Useful? A Case Study in Sparse Probing , author=. 2025 , eprint=

  45. [45]

    2025 , eprint=

    Priors in Time: Missing Inductive Biases for Language Model Interpretability , author=. 2025 , eprint=

  46. [46]

    2026 , eprint=

    Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior , author=. 2026 , eprint=