arxiv: 2604.28119 · v1 · submitted 2026-04-30 · 💻 cs.LG · cs.AI

Recognition: unknown

Do Sparse Autoencoders Capture Concept Manifolds?

Usha Bhalla , Thomas Fel , Can Rager , Sheridan Feucht , Tal Haklay , Daniel Wurgaft , Siddharth Boppana , Matthew Kowal

show 4 more authors

Vasudev Shyam Jack Merullo Atticus Geiger Ekdeep Singh Lubana

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersconcept manifoldsneural interpretabilityfeature dictionariesmanifold learningdictionary learning

0 comments

The pith

Sparse autoencoders capture concept manifolds either by spanning them globally with few features or tiling them locally, but typically mix both in a diluted and fragmented way.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether sparse autoencoders, which are popular for extracting features from neural networks, can represent concepts that lie on continuous low-dimensional manifolds rather than as simple linear directions. It develops a theory showing two distinct capture modes: a global mode where a small number of features together span the whole manifold through their linear combinations, and a local mode where many features each cover a small patch of the manifold. Experiments reveal that real SAEs often use a suboptimal mix of these modes, called dilution, which fragments the manifold representation. This matters because it means the geometric structure of concepts is hard to see from any single feature, pushing interpretability work toward identifying groups of related features instead.

Core claim

SAEs capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, SAEs suboptimally recover continuous structures by mixing the global subspace and local tiling solutions in a fragmented regime called dilution. This explains why manifold structure is rarely visible at the level of individual concepts.

What carries the argument

Global capture via the linear span of a compact group of atoms versus local capture via selective tiling of restricted regions by individual features, which together produce the observed dilution regime.

If this is right

Manifold structure is rarely visible at the level of individual concepts because of the dilution effect.
Post-hoc unsupervised discovery methods that search for coherent groups of atoms can recover the manifolds.
Future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Existing SAE dictionaries might contain recoverable manifold structure if features are clustered or grouped rather than examined in isolation.
The dilution regime could be mitigated by modifying the SAE training objective to penalize fragmentation of continuous structures.
This distinction between global and local capture may generalize to other sparse coding and dictionary learning approaches.

Load-bearing premise

The theoretical models of how SAEs capture manifolds globally or locally apply to SAEs trained on actual neural network activations from real models.

What would settle it

Training an SAE on activations from a network where concepts are known to form a continuous manifold and checking whether the learned features either form a small spanning set or cleanly tile the manifold without mixing; persistent dilution across multiple such tests would support the claim of suboptimal recovery.

Figures

Figures reproduced from arXiv: 2604.28119 by Atticus Geiger, Can Rager, Daniel Wurgaft, Ekdeep Singh Lubana, Jack Merullo, Matthew Kowal, Sheridan Feucht, Siddharth Boppana, Tal Haklay, Thomas Fel, Usha Bhalla, Vasudev Shyam.

**Figure 1.** Figure 1: From directions to manifolds. Under the linear representation hypothesis, concepts correspond to individual directions in activation space, packed as a Grassmannian frame (Strohmer & Heath Jr, 2003). We consider the richer setting where concepts are organized along low-dimensional manifolds that are additively superposed and ask whether and how SAEs recover these geometric objects. Motivated by unprecede… view at source ↗

**Figure 2.** Figure 2: Evidence of manifold structure in model representations and its effect on behavior. (Left) PCA projections of Llama3.1-8B layer 19 activations corresponding to continuous concepts (e.g., age, temperature, day, color) reveal smooth geometric structure rather than isolated directions. (Right) Steering interventions between concept centroids (e.g., Wednesday to Thursday) produce smooth changes in token probab… view at source ↗

**Figure 3.** Figure 3: Tiling vs. Capture. When features are highly selective, manifolds are “tiled” by shattering into sub-parts and features show anti-correlated occurrences (left). Compact capture involves features jointly reconstructing the manifold with no selectivity, resulting in positive couplings for the full support (middle). Dilution occurs when many redundant atoms activate to tile the manifold, but with feature s… view at source ↗

**Figure 4.** Figure 4: Synthetic validation of manifold capture. We construct a controlled benchmark where observations are sparse mixtures of known manifolds embedded in R 128 (dictionary width c=512, sparsity k=4) and make three observations. A) Subspace capture has a sparsity sweet spot. Restricted R 2 measures whether ki atoms suffice to reconstruct each manifold from the superposed codes. Capture peaks near k = 4 and degrad… view at source ↗

**Figure 5.** Figure 5: Piecewise-linear approximation of manifold geometry. (Left) PCA projections of Llama3.1- 8B activations show that manifolds are well described by a small number of global components. (Right) Reconstructing from increasing numbers of SAE features approximates the manifold in a piecewise-linear fashion: each feature captures a local region, and their union progressively covers the full geometry. Matryoshka B… view at source ↗

**Figure 7.** Figure 7: SAE features tile manifolds with tuning curves reminiscent of population coding. Activations of the top features as a function of position along the “years” manifold. Each feature exhibits a localized, smooth activation profile covering a restricted region of the manifold, with overlapping support across features. For the years manifold, most SAEs learn features selective to the ‘ones’ digit (activating pe… view at source ↗

**Figure 6.** Figure 6: Subspace capture on Llama3.1-8B. Variance explained as a function of the number of restricted features, averaged across manifolds and SAE architectures. Performance increases with support size but plateaus well beyond the manifold’s ambient dimension, indicating that current SAEs are in the Dilution regime identified in Sec. 4.2. Features tile manifolds as localized detectors. If SAEs do not capture mani… view at source ↗

**Figure 8.** Figure 8: Receptive field plots for different SAE architectures on the days of week manifold. Sampled points in the ambient space of the manifold are colored by their highest activating SAE feature, highlighting feature selectivity in the ambient space as well as the architectural biases of different SAEs (e.g., angular separability in Top-K SAEs and linear separability in L1). that features exhibit localized activa… view at source ↗

**Figure 9.** Figure 9: Reading manifold geometry from feature groups. (Left) Four views of the days and colors manifolds using the top 3 supervised features per manifold: PCA of activations (ground truth), PCA of the projection onto the decoder subspace spanned by the group, PCA of partial code reconstructions, and raw feature activations as coordinates. Projecting onto the decoder subspace most faithfully recovers the continuou… view at source ↗

**Figure 10.** Figure 10: Unsupervised Discovery from SAE Codes. (Left) The Ising-pipeline recovers known manifolds (temperature, colors, political bias) as distinct feature communities. (Right) The same pipeline surfaces a novel manifold encoding epistemic uncertainty in scientific contexts, demonstrating its utility for generating hypotheses beyond known structures. 6 Unsupervised Manifold Discovery The results of Sec. 5 confirm… view at source ↗

**Figure 11.** Figure 11: (Left) PCA projections show that manifolds are well-described by a small number of view at source ↗

**Figure 12.** Figure 12: (Left) PCA projections show that manifolds are well-described by a small number of view at source ↗

**Figure 13.** Figure 13: Similarity between features learned by different SAEs measured in decoder space (left), view at source ↗

**Figure 14.** Figure 14: The Geometric Duality of Sparse Concepts (Definition 4). (Left) Concept as Direction: SAEs capture manifolds extrinsically by finding a fixed atom group whose linear span contains the manifold. Concept as Points: SAEs capture manifolds intrinsically by sampling landmarks that form a Vietoris–Rips complex homotopy equivalent to the original manifold shape. (Right) It is now well-established that current ap… view at source ↗

**Figure 15.** Figure 15: Why simplicial capture cannot factor an additive mixture of manifolds. A point-based dictionary tiling the joint manifold M = Mi + Mj (right) does not induce tilings of the individual factors Mi ,Mj (left). Each landmark Pk ∈ M encodes one specific joint configuration (mi ,mj ) of co-active factors, and convex combinations of landmarks reach points of M that have no preimage in any single factor. As a con… view at source ↗

**Figure 16.** Figure 16: Synthetic Evaluation Pipeline. We construct a controlled benchmark for manifold recovery by sparse autoencoders. (1) We define a zoo of manifolds (spheres, tori, Möbius strips, etc.) and generate data points by sampling from a sparse mixture: each observation is formed as X = P i ZiUi , where Zi are local coordinates on the i-th manifold and Ui are ambient basis matrices embedding each manifold into high-… view at source ↗

**Figure 17.** Figure 17: Ising coupling matrix Jij recovers latent manifold structure across sparsity regimes. Red = positive coupling ; blue = mutual exclusion (J < 0). At low K (tiling), atoms are shared across manifolds and block structure is weak. At intermediate K (K ≈ 8–16), clean block-diagonal structure emerges. At high K (dilution), atoms over-tile individual manifolds, fragmenting blocks. Manifold zoo view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a theoretical framework showing that sparse autoencoders (SAEs) can capture low-dimensional concept manifolds in two ways: globally, via a compact group of atoms whose linear span contains the full manifold, or locally, via features that each tile a restricted region of the geometry. It empirically identifies a 'dilution' regime in which standard SAE training (linear decoder + L1 sparsity) mixes these solutions suboptimally, fragmenting recovery of continuous structures. This accounts for the rarity of visible manifold geometry at the level of single features and motivates post-hoc unsupervised methods for discovering coherent atom groups rather than isolated directions. The work concludes that future representation learning should treat geometric objects as basic units of interpretability.

Significance. If the dilution finding and the global/local distinction hold on real data, the result would be significant for mechanistic interpretability. It supplies a coherent explanation for why SAEs often fail to surface continuous relationships and gives a concrete rationale for shifting from single-feature analysis to group-based discovery. The framework is conceptually clean and the empirical observation, if robust, directly informs both SAE training objectives and post-processing pipelines. Credit is due for grounding the claims in a new theoretical analysis rather than purely observational fitting.

major comments (2)

[Empirical Evaluation] Empirical section (details implied by abstract and experiments on synthetic manifolds): the central claim that SAEs 'suboptimally recover continuous structures' in a dilution regime rests on toy manifold constructions (e.g., circles or lines with uniform sampling). Real neural activations are high-dimensional, noisy, non-uniform, and entangled; without explicit experiments showing that the same global/local fragmentation appears when SAEs are trained on actual model activations (e.g., from language-model residual streams), the 'suboptimal' characterization and the motivation for post-hoc group discovery remain unsupported for the motivating use case. This is load-bearing for the empirical contribution.
[Theoretical Framework] Theoretical framework (section introducing global vs. local capture): the distinction is plausible under idealized low-dimensional, noise-free assumptions, but the paper does not derive or bound how dilution necessarily arises from standard SAE training on data whose statistics match real activations. A concrete test (e.g., a theorem or controlled ablation showing fragmentation persists when noise and entanglement are added) would strengthen the claim that dilution is a general property rather than an artifact of the synthetic setup.

minor comments (2)

The term 'dilution' is introduced as a new regime mixing global span and local tiling; a formal metric or equation quantifying the degree of mixing (e.g., in terms of atom allocation or reconstruction error decomposition) would make the concept more precise and reproducible.
Provide the exact SAE training hyperparameters, manifold sampling procedures, and quantitative metrics used to identify dilution (e.g., any tables or figures reporting R² or reconstruction quality per regime) so that the empirical observation can be directly replicated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful report. The comments correctly highlight the need to connect our synthetic analysis more directly to real neural activations and to further test the robustness of the dilution regime. We will revise the manuscript accordingly by adding experiments on language-model activations and controlled ablations with noise and non-uniform sampling. Point-by-point responses to the major comments are provided below.

read point-by-point responses

Referee: [Empirical Evaluation] Empirical section (details implied by abstract and experiments on synthetic manifolds): the central claim that SAEs 'suboptimally recover continuous structures' in a dilution regime rests on toy manifold constructions (e.g., circles or lines with uniform sampling). Real neural activations are high-dimensional, noisy, non-uniform, and entangled; without explicit experiments showing that the same global/local fragmentation appears when SAEs are trained on actual model activations (e.g., from language-model residual streams), the 'suboptimal' characterization and the motivation for post-hoc group discovery remain unsupported for the motivating use case. This is load-bearing for the empirical contribution.

Authors: We agree that validation on real activations is important for the motivating application in mechanistic interpretability. The synthetic manifolds were chosen to isolate the global subspace and local tiling mechanisms and to quantify dilution precisely under controlled geometry. In the revised manuscript we will add experiments training SAEs on residual-stream activations from a language model (e.g., GPT-2 small) and evaluate recovery of known continuous structures such as positional or syntactic manifolds. This will test whether fragmentation patterns analogous to the dilution regime appear on real data and thereby support the motivation for post-hoc group discovery. revision: yes
Referee: [Theoretical Framework] Theoretical framework (section introducing global vs. local capture): the distinction is plausible under idealized low-dimensional, noise-free assumptions, but the paper does not derive or bound how dilution necessarily arises from standard SAE training on data whose statistics match real activations. A concrete test (e.g., a theorem or controlled ablation showing fragmentation persists when noise and entanglement are added) would strengthen the claim that dilution is a general property rather than an artifact of the synthetic setup.

Authors: The framework establishes that both global (compact spanning set) and local (selective tiling) capture are feasible for low-dimensional manifolds and that standard L1-regularized training can produce a mixture. Dilution is characterized empirically as the fragmented outcome of this mixture. While a general theorem for arbitrary real activation statistics is not derived, the revised manuscript will include controlled ablations that add Gaussian noise, non-uniform sampling, and mild entanglement to the synthetic manifolds. These will show that the fragmentation persists, providing the concrete test requested and indicating that dilution is not an artifact of the idealized uniform case. revision: partial

Circularity Check

0 steps flagged

No significant circularity; theoretical framework and dilution regime are independently derived.

full rationale

The paper introduces a new theoretical distinction between global (compact atoms spanning the full manifold via linear span) and local (region-tiling atoms) manifold capture by SAEs, then empirically identifies dilution as the observed mixing of these strategies in trained models on both synthetic and real activations. No derivation step reduces by construction to its inputs, self-defines a key quantity, renames a fitted parameter as a prediction, or relies on a load-bearing self-citation chain whose prior result is unverified. The framework provides independent mathematical content, and the central claims remain falsifiable via external experiments rather than tautological. This aligns with the default expectation that most papers exhibit no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about manifold structure in representations and SAE atom allocation behavior, plus the newly introduced dilution regime as a descriptive entity without external falsifiable evidence.

axioms (2)

domain assumption Concepts in neural network representations are organized along low-dimensional manifolds that encode continuous geometric relationships.
Invoked in the abstract as the motivation and growing body of evidence for the work.
domain assumption SAE atoms can be allocated such that their linear combinations either globally span or locally tile manifold geometry.
Core modeling assumption underlying the theoretical framework developed in the paper.

invented entities (1)

dilution regime no independent evidence
purpose: Describes the suboptimal fragmented mixing of global subspace and local tiling solutions observed in SAEs.
New descriptive term introduced to characterize empirical behavior; no independent evidence outside the paper's analysis.

pith-pipeline@v0.9.0 · 5552 in / 1297 out tokens · 106677 ms · 2026-05-07T05:20:22.364203+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry
cs.LG 2026-05 unverdicted novelty 6.0

Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Recurrent neural networks learn to store and generate sequences using non-linear representations , url =

URLhttps://arxiv.org/abs/2511.05541. Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, et al. Interpreting neural networks through the polytope lens.ArXiv e-print, 2022. Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick ...

work page doi:10.18653/v1/2024.blackboxnlp-1.17 2022
[2]

onion-like

nonlinearities. Archetypal SAEs (Fel et al., 2025a) address the algorithmic instability of SAEs, and Matryoshka SAEs (Bussmann et al., 2025) and MP-SAEs (Costa et al., 2025) learn hierarchical concept dictionaries. TFA (Lubana et al., 2025) and T-SAEs (Bhalla et al., 2026) incorporate temporal information into dictionary learning methods, allowing for rec...

2025
[3]

But this point does not correspond to any observation on the data manifold: 1 2 m1 + 1 2 m′ 1 is generically not a point on M1 (it is a chord, not an arc), and likewise for M2

is a valid reconstruction. But this point does not correspond to any observation on the data manifold: 1 2 m1 + 1 2 m′ 1 is generically not a point on M1 (it is a chord, not an arc), and likewise for M2. The convex hull of landmarks that tile the joint manifold M1 +M 2 is therefore fundamentally different from the Minkowski sum conv(M1) +conv(M 2) that wo...
[4]

=0, hence ∥Pj −m ′ 1∥ ≥ ∥Π V2(Pj −m ′ 1)∥=∥m 2 + ΠV2(r)∥ ≥ ∥m 2∥ −ε.(8) Taking the infimum overm′ 1 ∈ M1 on the left and over m2 ∈ M2 on the right yields the result. Point-based dictionary learning tiles the joint manifold.This observation effectively closes the door, for the purposes of this paper, on point-based SAEs as a model offactormanifold recovery...

2004
[5]

Encode the full evaluation set{x (j)}through the SAE to obtain codes{z (j)}
[6]

For each manifold instance i, select the rows where i is active (using the ground-truth active masks) to obtain the manifold-specific codes Zi ∈R ni×c and the corresponding true contributionsM i ∈R ni×d
[7]

At each step, the selected atom’s projection is removed from the residual before selecting the next

Greedily select n atoms by iteratively choosing the decoder direction dj that explains the most residual variance of Mi. At each step, the selected atom’s projection is removed from the residual before selecting the next. 29
[8]

Mask the codes to retain only the n selected atoms: Z(n) i =Z i ⊙e selected, where eselected is a binary mask
[9]

Decode: ˆM (n) i =Z (n) i W ⊤ dec
[10]

An R2 near 1 at n=k i indicates compact subspace capture

Compute the restrictedR 2: R2(i, k, n) = 1− P j ∥m(j) i − ˆm(j,n) i ∥2 P j ∥m(j) i − ¯mi∥2 ,(14) where ¯mi is the mean of the true contributions. An R2 near 1 at n=k i indicates compact subspace capture. We report R2 for n ranging from max(1, ki −2) to ki + 2to visualize how capture improves around the embedding dimension. Note that the greedy selection o...

2014
[11]

SAE codes, however, are not Gaussian

and neighborhood selection (Meinshausen & Bühlmann, 2006). SAE codes, however, are not Gaussian. They are sparse (most entries zero), non-negative (due to ReLU or TopK), and their support is constrained (TopK enforces ∥z∥0 =k ). Applying Gaussian graphical model selection to such data would yield inconsistent estimates of the conditional inde- pendence st...

2006