arxiv: 2605.09224 · v1 · submitted 2026-05-09 · 💻 cs.LG

Recognition: no theorem link

SMIXAE: Towards Unsupervised Manifold Discovery in Language Models

Collin Francel

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersmanifold discoverylanguage model interpretabilitymixture modelsunsupervised feature learningGemma modelstransformer activations

0 comments

The pith

SMIXAE uses a mixture of sparse autoencoders to learn multidimensional manifold structures directly in language model activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard sparse autoencoders break up complex multidimensional features in transformer activations into separate one-dimensional directions that later require manual grouping. The paper introduces the SMIXAE architecture as a way to model these manifold structures as single units from the start. Experiments on the Gemma 2 2B and 9B models show the new model both recovers previously known manifolds and identifies new ones without post-processing. This matters for interpretability because it removes a step that currently limits how easily researchers can find and understand the internal representations learned by large language models.

Core claim

The Sparse MIXture of Autoencoders (SMIXAE) architecture succeeds at directly learning previously identified manifold structures as well as discovering novel structures within the activations of the open-source Gemma 2 2B and 9B models.

What carries the argument

The SMIXAE mixture architecture, which trains multiple sparse autoencoders together so that each component can represent an entire multidimensional manifold rather than isolated directions.

If this is right

Multidimensional features can be discovered and interpreted as single units during training rather than after.
Unsupervised discovery becomes possible for manifold structures that were previously hard to isolate.
The approach demonstrates success on both 2B and 9B scale open models from the Gemma 2 family.
Limitations discussed in the paper point to the need for further scaling and validation work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixture approach could be tested on other transformer families to check whether manifold recovery holds beyond Gemma.
If SMIXAE reduces the need for post-training feature grouping, it could speed up automated interpretability pipelines that currently rely on clustering steps.
Novel structures found by SMIXAE might correspond to functional behaviors that standard SAEs miss, offering a route to test specific hypotheses about model computation.

Load-bearing premise

The mixture architecture can reliably capture multidimensional manifold structures in language model activations without needing post-hoc grouping, based on results from two specific Gemma models.

What would settle it

Running the same SMIXAE training on a different language model family or on a set of known multidimensional features and finding that the learned representations still split across multiple components requiring manual grouping afterward.

Figures

Figures reproduced from arXiv: 2605.09224 by Collin Francel.

**Figure 1.** Figure 1: Newline counting manifold in Gemma 2 9B, Layer 11, on 150-char wrapped text. (a, b) SMIXAE bottleneck activations for the top two experts by ∆R 2 per: (a) Expert 541, rank 1 (Score = 0.548); (b) Expert 1521, rank 2 (Score = 0.500). Points are per-class mean activations colored by number of characters since the previous newline. (c) Visualization from Sinii & Balagansky (2026): PCA over the same layer’s ac… view at source ↗

**Figure 2.** Figure 2: SMIXAE Bottleneck Activations on Gemma 2 9B, Layer 11. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; small points are individual token activations colored by ground-truth label, and larger points mark per-class means. The + symbol marks the origin of the plot. Weekdays: (a) Expert 76, rank 1, 7-Day Ring (R 2 = 0.855). Hours: (b) Expert 1884, rank 2, AM vs PM (Accuracy = 0.997). … view at source ↗

**Figure 3.** Figure 3: Gemma 2 2B, Layer 12. Hours: (a) Expert 589, rank 1, 24-Hour Ring (R 2 = 0.681). Temperature: (b) Expert 789, rank 1, Fahrenheit (R 2 = 0.663). Time Units: (c) Expert 1009, rank 2, log _10 Duration (R 2 = 0.686). Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; small points are individual token activations colored by ground-truth label, and larger points mark per-class means. In [… view at source ↗

**Figure 4.** Figure 4: Gemma 2 9B, Layer 20. Weekdays: (a) Expert 391, rank 1, 7-Day Ring (R 2 = 0.724). Hours: (b) Expert 1041, rank 2, AM vs PM (Accuracy = 0.997). (c) Expert 1078, rank 1, 24-Hour Ring (R 2 = 0.590). Temperature: (d) Expert 1716, rank 1, Fahrenheit (R 2 = 0.875). Time Units: (e) Expert 923, rank 1, log _10 Duration (R 2 = 0.856). Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; small p… view at source ↗

**Figure 5.** Figure 5: Gemma 2 2B, Layer 12. Random Experts: (a) Expert 1359. (b) Expert 1568. (c) Expert 1578. (d) Expert 213. (e) Expert 234. (f) Expert 293. (g) Expert 477. (h) Expert 51. (i) Expert 524. (j) Expert 587. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; points are individual token activations colored by distance from the origin. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Gemma 2 9B, Layer 11. Random Experts: (a) Expert 1326. (b) Expert 1527. (c) Expert 1537. (d) Expert 210. (e) Expert 229. (f) Expert 289. (g) Expert 464. (h) Expert 508. (i) Expert 51. (j) Expert 571. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; points are individual token activations colored by distance from the origin. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Gemma 2 9B, Layer 20. Random Experts: (a) Expert 1360. (b) Expert 1567. (c) Expert 1578. (d) Expert 217. (e) Expert 236. (f) Expert 296. (g) Expert 474. (h) Expert 520. (i) Expert 53. (j) Expert 583. Each plot shows the 3-D bottleneck activations of a single SMIXAE expert; points are individual token activations colored by distance from the origin. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Newline Position (150 chars) — Gemma 2 2B, Layer 12. Newline Position: (a) Expert 1895, rank 2, Periodic Gain (Score = 0.406). (b) Expert 647, rank 1, Periodic Gain (Score = 0.225). Points represent per-class mean activations in the bottleneck space, colored by distance since the last newline. (a) (b) Newline Position [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Newline Position (150 chars) — Gemma 2 9B, Layer 20. Newline Position: (a) Expert 1749, rank 2, Periodic Gain (Score = 0.320). (b) Expert 1520, rank 1, Periodic Gain (Score = 0.280). Points represent per-class mean activations in the bottleneck space, colored by distance since the last newline [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of a single autoencoder trained on a torus or helix. We observe low MSE loss, showcasing that each autoencoder in SMIXAE is capable of learning manifolds when provided sufficient training data in a noise free setting. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) have been used widely to decompose and interpret neural network activations, especially those of transformer language models. One key issue with SAEs is their inability to directly model multidimensional features. Instead, SAEs may tile such features by a set of independent directions that must be grouped together after the SAE training phase, impeding discoverability and interpretation of learned feature representations. We begin to address this issue by introducing the Sparse MIXture of Autoencoders (SMIXAE) architecture. Empirically, we provide evidence that SMIXAE models have success both in directly learning previously identified manifold structures, as well as finding novel structures, within the open source Gemma 2 2B and 9B models. Finally, we discuss several limitations and point towards areas for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMIXAE tries to fix SAE tiling of multidimensional features with a mixture architecture, but the abstract and summary give no metrics or validation to show it actually works.

read the letter

The paper's main move is to replace a single sparse autoencoder with a mixture of them so each component can pick up an entire multidimensional manifold instead of splitting it across many independent directions. They test this on activations from Gemma 2 2B and 9B and claim it recovers both known structures and some new ones directly, without the usual post-training grouping step. That is a straightforward response to a limitation people have noted in mechanistic interpretability work on SAEs. The choice to work with open models is also practical for anyone who wants to check the results later. The idea itself is clear enough and engages with the existing SAE literature on feature discovery. The soft spot is the evidence. The abstract states empirical success but supplies no reconstruction errors, sparsity measures, baseline comparisons, or description of how they confirmed that the mixture components correspond to manifold dimensions rather than just more features. There are no ablations isolating the mixture benefit and no quantitative scores for effective dimensionality or clustering against standard SAEs. The stress-test note correctly flags this gap, and nothing in the summary contradicts it. Without those details it is impossible to tell whether the architecture solves the tiling problem or whether the reported structures depend on hyperparameter choices or subjective reading. This is aimed at researchers already working on SAEs and activation interpretability. Someone outside that group would need the full experimental section before deciding whether the method is worth trying. I would send it for peer review once the authors add concrete metrics, ablations, and validation steps, because the underlying limitation is real and the proposed fix is worth testing properly. As written, the central claim stays too preliminary to evaluate.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Sparse MIXture of Autoencoders (SMIXAE), a mixture-of-autoencoders architecture intended to overcome the tendency of standard sparse autoencoders (SAEs) to tile multidimensional manifold features in language-model activations into independent directions that must be grouped post hoc. The central claim is that SMIXAE directly learns both previously identified and novel manifold structures in the activations of open-source Gemma 2 2B and 9B models, with supporting empirical evidence presented and limitations discussed.

Significance. If the empirical results can be shown to rest on well-defined quantitative metrics, ablations, and validation procedures rather than subjective interpretation, the work would address a recognized limitation in mechanistic interpretability and could reduce dependence on post-training feature grouping. The use of publicly available Gemma 2 models is a positive step toward reproducibility.

major comments (3)

[Experiments] Experiments section: the abstract and main text assert empirical success in directly learning multidimensional manifolds without post-hoc grouping, yet no quantitative metrics (e.g., per-component effective dimensionality, reconstruction error on held-out known manifolds, or clustering purity scores versus SAE baselines) or ablation studies isolating the mixture component are reported. This absence prevents assessment of whether the claimed advantage is architectural or due to hyperparameter choices.
[Method] Method section: the description of how mixture components are assigned to manifold dimensions (as opposed to independent directions) is not accompanied by a formal argument or diagnostic showing that the assignment is enforced by the architecture rather than emerging from training dynamics or initialization; without this, the central distinction from tiled SAE features remains unproven.
[Experiments] Validation of 'previously identified' and 'novel' structures: the paper does not specify the procedure used to confirm that recovered structures correspond to true manifolds (e.g., comparison against known feature dictionaries, geometric tests for dimensionality, or human-interpretability controls), leaving open the possibility that success is driven by subjective selection.

minor comments (2)

[Method] Notation for the mixture weights and sparsity penalties should be unified across equations and text to avoid ambiguity in the definition of the SMIXAE objective.
[Limitations] The limitations section could usefully include a brief discussion of computational overhead relative to standard SAEs, as mixture models typically increase training cost.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments identify important opportunities to strengthen the quantitative rigor, theoretical grounding, and validation procedures in the manuscript. We address each major comment below and will incorporate revisions to address them.

read point-by-point responses

Referee: Experiments section: the abstract and main text assert empirical success in directly learning multidimensional manifolds without post-hoc grouping, yet no quantitative metrics (e.g., per-component effective dimensionality, reconstruction error on held-out known manifolds, or clustering purity scores versus SAE baselines) or ablation studies isolating the mixture component are reported. This absence prevents assessment of whether the claimed advantage is architectural or due to hyperparameter choices.

Authors: We agree that additional quantitative metrics and ablations would strengthen the claims. The current version relies primarily on visualizations and qualitative recovery of known structures. In the revised manuscript we will add per-component effective dimensionality, reconstruction error on held-out data, clustering purity comparisons against SAE baselines, and ablation studies that isolate the mixture component. These additions will clarify whether the observed advantages arise from the architecture. revision: yes
Referee: Method section: the description of how mixture components are assigned to manifold dimensions (as opposed to independent directions) is not accompanied by a formal argument or diagnostic showing that the assignment is enforced by the architecture rather than emerging from training dynamics or initialization; without this, the central distinction from tiled SAE features remains unproven.

Authors: We acknowledge that the manuscript would benefit from an explicit formal argument. The mixture-of-autoencoders design is intended to encourage each component to capture a coherent subspace rather than isolated directions, but this was not accompanied by a proof or diagnostic. In the revision we will add a formal argument in the Method section together with empirical diagnostics (component-wise activation statistics and per-component dimensionality measurements) to demonstrate that the manifold assignment is promoted by the architecture. revision: yes
Referee: Validation of 'previously identified' and 'novel' structures: the paper does not specify the procedure used to confirm that recovered structures correspond to true manifolds (e.g., comparison against known feature dictionaries, geometric tests for dimensionality, or human-interpretability controls), leaving open the possibility that success is driven by subjective selection.

Authors: We agree that the validation procedure should be stated explicitly. The current text describes recovery of known and novel structures but does not detail the exact checks performed. In the revised Experiments section we will specify the full procedure, including direct comparison to published feature dictionaries, geometric dimensionality tests, and human-interpretability controls, to make the identification of manifolds more objective and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal on external models

full rationale

The paper introduces the SMIXAE architecture as a direct response to a known SAE limitation (tiling of multidimensional features) and reports empirical results on activations from independent open-source models (Gemma 2 2B/9B). No derivation chain, equations, or fitted parameters are presented that reduce to self-defined terms or prior self-citations. The central claim rests on application to external data rather than any self-referential construction or renaming of inputs. This is a standard empirical contribution with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work appears to build on standard autoencoder assumptions without new postulates detailed here.

pith-pipeline@v0.9.0 · 5426 in / 1012 out tokens · 44972 ms · 2026-05-12T02:45:39.619945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Toy Models of Superposition

Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher. Toy models of superposition. arXiv:2209.10652

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235,

Modell, Alexander and Rubin-Delanchy, Patrick and Whiteley, Nick. The origins of representation manifolds in large language models. arXiv:2505.18235

work page arXiv
[3]

When models manipulate manifolds: The geometry of a counting task, 2026

Gurnee, Wes and Ameisen, Emmanuel and Kauvar, Isaac and Tarng, Julius and Pearce, Adam and Olah, Chris and Batson, Joshua. When models manipulate manifolds: The geometry of a counting task. arXiv:2601.04480

work page arXiv
[4]

Chasing the Counting Manifold in Open LLMs , author=

work page
[5]

The Thirteenth International Conference on Learning Representations , year=

Not All Language Model Features Are One-Dimensionally Linear , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[6]

ArXiv , year=

Progress measures for grokking via mechanistic interpretability , author=. ArXiv , year=

work page
[7]

and Marzen, Sarah E

Shai, Adam S. and Marzen, Sarah E. and Teixeira, Lucas and Oldenziel, Alexander Gietelink and Riechers, Paul M. , booktitle =. Transformers Represent Belief State Geometry in their Residual Stream , url =. doi:10.52202/079017-2387 , editor =

work page doi:10.52202/079017-2387
[8]

Extensions of Lipschitz maps into a Hilbert space , volume =

Johnson, William and Lindenstrauss, Joram , year =. Extensions of Lipschitz maps into a Hilbert space , volume =. Contemporary Mathematics , doi =

work page
[9]

2023 , eprint=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

work page 2023
[10]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[11]

2025 , eprint=

Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling , author=. 2025 , eprint=

work page 2025
[12]

2025 , eprint=

Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author=. 2025 , eprint=

work page 2025
[13]

2024 , eprint=

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders , author=. 2024 , eprint=

work page 2024
[14]

2025 , note =

Tom Conerly and Hoagy Cunningham and Adly Templeton and Jack Lindsey and Basil Hosmer and Adam Jermyn , title =. 2025 , note =

work page 2025
[15]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

Finding Manifolds With Bilinear Autoencoders , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

work page 2025
[16]

2026 , eprint=

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding , author=. 2026 , eprint=

work page 2026
[17]

A theory of steady-state activity in nerve-fiber networks: I

Householder, Alston S. A theory of steady-state activity in nerve-fiber networks: I. Definitions and preliminary lemmas. Bulletin of Mathematical Biophysics

work page
[18]

2024 , eprint=

Scaling and evaluating sparse autoencoders , author=. 2024 , eprint=

work page 2024
[19]

2024 , eprint=

BatchTopK Sparse Autoencoders , author=. 2024 , eprint=

work page 2024
[20]

2018 , month =

Christopher Tralie and Nathaniel Saul and Rann Bar-On , title =. 2018 , month =. doi:10.21105/joss.00925 , url =

work page doi:10.21105/joss.00925 2018
[21]

Bauer, Ulrich , TITLE =. J. Appl. Comput. Topol. , FJOURNAL =. 2021 , NUMBER =. doi:10.1007/s41468-021-00071-5 , URL =

work page doi:10.1007/s41468-021-00071-5 2021
[22]

Schonsheck and Jie Chen and Rongjie Lai , title =

Stefan C. Schonsheck and Jie Chen and Rongjie Lai , title =. CoRR , volume =. 2019 , url =. 1912.10094 , timestamp =

work page arXiv 2019
[23]

2023 , eprint=

Minimalistic Unsupervised Learning with the Sparse Manifold Transform , author=. 2023 , eprint=

work page 2023
[24]

2025 , eprint=

Not All Language Model Features Are One-Dimensionally Linear , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint=

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

Data Whitening Improves Sparse Autoencoder Learning , author=. 2025 , eprint=

work page 2025
[27]

2017 , eprint=

Deep Unsupervised Clustering Using Mixture of Autoencoders , author=. 2017 , eprint=

work page 2017
[28]

2026 , eprint=

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control , author=. 2026 , eprint=

work page 2026
[29]

2026 , eprint=

Latent Structure of Affective Representations in Large Language Models , author=. 2026 , eprint=

work page 2026
[30]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

Understanding sparse autoencoder scaling in the presence of feature manifolds , author=. Mechanistic Interpretability Workshop at NeurIPS 2025 , year=

work page 2025
[31]

2026 , eprint=

Do Sparse Autoencoders Capture Concept Manifolds? , author=. 2026 , eprint=

work page 2026
[32]

2024 , eprint=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. 2024 , eprint=

work page 2024
[33]

2018 , eprint=

Understanding Deep Neural Networks with Rectified Linear Units , author=. 2018 , eprint=

work page 2018
[34]

Approximation capabilities of multilayer feedforward networks , journal =

Kurt Hornik , keywords =. Approximation capabilities of multilayer feedforward networks , journal =. 1991 , issn =. doi:https://doi.org/10.1016/0893-6080(91)90009-T , url =

work page doi:10.1016/0893-6080(91)90009-t 1991
[35]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

work page 2020
[36]

2023 , note =

monology , title =. 2023 , note =

work page 2023
[37]

2024 , eprint=

Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

work page 2024
[38]

2024 , howpublished =

SAELens , author =. 2024 , howpublished =

work page 2024
[39]

2024 , howpublished =

Neuronpedia , author =. 2024 , howpublished =

work page 2024
[40]

ICML 2024 Workshop on Mechanistic Interpretability , year=

The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author=. ICML 2024 Workshop on Mechanistic Interpretability , year=

work page 2024
[41]

The neural basis of the Weber-Fechner law: a logarithmic mental number line

Dehaene, Stanislas. The neural basis of the Weber-Fechner law: a logarithmic mental number line. Trends Cogn. Sci

work page
[42]

Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin L...

work page 2020
[43]

2025 , eprint=

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability , author=. 2025 , eprint=

work page 2025
[44]

2025 , eprint=

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing , author=. 2025 , eprint=

work page 2025
[45]

2025 , eprint=

Priors in Time: Missing Inductive Biases for Language Model Interpretability , author=. 2025 , eprint=

work page 2025
[46]

2026 , eprint=

Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior , author=. 2026 , eprint=

work page 2026