Toward Identifiable Sparse Autoencoders

Francesco Locatello; Theofanis Karaletsos; Walter Nelson

arxiv: 2605.31245 · v1 · pith:DTSEAFVEnew · submitted 2026-05-29 · 💻 cs.LG

Toward Identifiable Sparse Autoencoders

Walter Nelson , Theofanis Karaletsos , Francesco Locatello This is my paper

Pith reviewed 2026-06-28 22:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersidentifiabilitydictionary learningrestricted isometryneural interpretabilityTopK SAEstability

0 comments

The pith

Minimal changes to TopK sparse autoencoders yield stable and near-identifiable models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders for interpreting neural network representations are prone to instability, with different training runs producing different dictionaries and codes. The authors identify hindering model properties and address them with minimal architectural and training adjustments. This produces identifiable SAEs with lower reconstruction error and better stability. The improvement is explained by connecting to dictionary learning, where learned dictionaries approximately satisfy the restricted isometry condition, making sparse codes near-identifiable.

Core claim

By introducing minimal changes to the standard TopK SAE architecture and training procedure, the authors create two versions of an identifiable SAE (iSAE) that achieve lower reconstruction error and improved stability across training runs. They connect SAEs to traditional dictionary learning and demonstrate that the learned dictionaries satisfy an approximate restricted isometry condition, which renders the sparse codes near-identifiable.

What carries the argument

The iSAE variant of TopK SAE, whose learned dictionary satisfies an approximate restricted isometry condition to ensure near-identifiability of sparse codes

If this is right

iSAEs exhibit improved stability, producing consistent dictionaries and codes across different training runs
The modifications result in lower reconstruction error compared to standard TopK SAEs
Sparse codes in iSAEs are near-identifiable due to the dictionary properties
The approach links sparse autoencoders to classical dictionary learning for theoretical analysis

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If iSAEs become standard, mechanistic interpretability studies could rely on more reproducible feature dictionaries
Similar stability improvements might be applicable to other sparse coding methods in machine learning
This could enable more reliable scaling of interpretability techniques to larger models

Load-bearing premise

Dictionaries learned by the modified SAEs in practice satisfy an approximate restricted isometry condition

What would settle it

Running multiple independent trainings of the iSAE and checking whether the resulting dictionaries and sparse codes are highly similar or identical; alternatively, verifying whether the learned dictionary matrix satisfies the approximate restricted isometry property

Figures

Figures reproduced from arXiv: 2605.31245 by Francesco Locatello, Theofanis Karaletsos, Walter Nelson.

**Figure 1.** Figure 1: Sparse autoencoders approximate nonlinear manifolds (dark blue, mostly occluded) with linear patches (light blue). We show that identifiability hinges on four key ingredients: (a) the approximation being good enough (low reconstruction error), (b) the manifold being sampled densely enough, (c) co-occurring concepts being distinct enough (an approximate restricted isometry property), and (d) sufficiently … view at source ↗

read the original abstract

Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are likely to produce different concept dictionaries and sparse codes. We characterize the model properties that hinder the stability of real-world SAEs, and address each of these problems through minimal changes to the architecture and training procedure. Together, these changes yield two versions of an \textbf{i}dentifiable SAE (iSAE), a variant of the standard TopK SAE with lower reconstruction error and improved stability. We explain this improvement theoretically by connecting SAEs with traditional dictionary learning approaches, and show that the dictionaries learned in practice satisfy an approximate restricted isometry condition, rendering the corresponding sparse codes in those models near-identifiable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives two iSAE variants with better stability and lower error via small changes, plus a dictionary-learning link, but the near-identifiability claim rests on an unquantified approximate RIC.

read the letter

The core claim is that two minimal tweaks to TopK SAEs produce identifiable versions with lower reconstruction error and more stable dictionaries across runs. The authors identify sources of instability in standard SAEs, fix them directly, and explain the gains by linking to classical dictionary learning plus an approximate restricted isometry condition on the learned matrices.

What is new is the specific combination of those architecture and training adjustments together with the explicit dictionary-learning framing. The empirical checks on stability and the RIC observation on real matrices are the practical payoff. The connection to existing identifiability results in dictionary learning is a clean move that avoids reinventing the wheel.

The soft spot is exactly where the stress-test note flags it. The argument needs the learned dictionaries to satisfy an approximate RIC with a constant small enough relative to the sparsity level k to support near-identifiability. Reporting that the condition “holds approximately” without a measured δ_{2k} value or a scaling argument tied to the observed k leaves the theoretical step qualitative rather than tight. That gap is real but not fatal to the rest of the work.

This paper is for the mechanistic interpretability groups that already use SAEs and want more reliable features. Readers who care about consistent dictionaries or who want to import tools from sparse coding will get direct value from the changes and the framing. The ideas are coherent, the experiments are checkable, and the theoretical angle is worth referee scrutiny even if it needs sharpening.

Send it to peer review.

Referee Report

2 major / 0 minor

Summary. The paper claims that standard sparse autoencoders (SAEs) are highly unstable across training runs, theoretically characterizes the model properties responsible, and proposes minimal changes to architecture and training that produce two variants of an identifiable SAE (iSAE) with lower reconstruction error and improved stability. It connects SAEs to classical dictionary learning, asserts that the learned dictionaries satisfy an approximate restricted isometry condition (RIC), and concludes that the resulting sparse codes are therefore near-identifiable.

Significance. If the RIC claim is placed on a quantitative footing that ties the observed constant to sparsity level k and recovery error, the work would supply a concrete theoretical explanation for SAE instability and a practical route to more stable, interpretable dictionaries; the explicit linkage to dictionary-learning recovery guarantees is a strength that could influence how future SAE training objectives are designed.

major comments (2)

[Empirical verification of the RIC (section discussing dictionary properties and identifiability)] The assertion that learned dictionaries satisfy an approximate restricted isometry condition (invoked to conclude near-identifiability of the sparse codes) is load-bearing for the central theoretical claim, yet the manuscript reports only that the condition “holds approximately” without measured values of δ_{2k} or a demonstration that δ_{2k} is small enough relative to the observed sparsity k to satisfy standard dictionary-learning recovery bounds (e.g., δ_{2k} < 1/3 for basis pursuit).
[Theoretical characterization of instability] The theoretical characterization of SAE instability is stated in the abstract and used to motivate the architectural changes, but the manuscript provides no explicit derivation or equation showing how the identified model properties (e.g., non-identifiability of the dictionary or lack of RIP) produce the observed run-to-run variability in concept dictionaries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the empirical and theoretical foundations of our claims regarding identifiable sparse autoencoders. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Empirical verification of the RIC (section discussing dictionary properties and identifiability)] The assertion that learned dictionaries satisfy an approximate restricted isometry condition (invoked to conclude near-identifiability of the sparse codes) is load-bearing for the central theoretical claim, yet the manuscript reports only that the condition “holds approximately” without measured values of δ_{2k} or a demonstration that δ_{2k} is small enough relative to the observed sparsity k to satisfy standard dictionary-learning recovery bounds (e.g., δ_{2k} < 1/3 for basis pursuit).

Authors: We agree that quantitative verification of the RIC is necessary to make the identifiability claim rigorous. In the revised manuscript we will add explicit computations of δ_{2k} on the learned dictionaries from multiple runs, report the observed values as a function of k, and verify that they fall below standard recovery thresholds (e.g., δ_{2k} < 1/3) sufficient for basis pursuit guarantees. This will directly link the constant to sparsity level and reconstruction error. revision: yes
Referee: [Theoretical characterization of instability] The theoretical characterization of SAE instability is stated in the abstract and used to motivate the architectural changes, but the manuscript provides no explicit derivation or equation showing how the identified model properties (e.g., non-identifiability of the dictionary or lack of RIP) produce the observed run-to-run variability in concept dictionaries.

Authors: The manuscript motivates the instability claim via the connection to non-unique dictionary recovery in the absence of RIP, but we acknowledge that an explicit step-by-step derivation linking these properties to run-to-run variability is not presented with dedicated equations. We will add a short subsection in the revision that derives the multiplicity of consistent dictionaries under violated RIP and shows how this induces the observed variability across random initializations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external dictionary learning connections and empirical RIC checks

full rationale

The paper's central chain derives SAE instability theoretically from model properties, introduces minimal architectural and training changes to produce iSAE variants, then invokes standard results from traditional dictionary learning to explain improved stability. It reports that the learned dictionaries satisfy an approximate restricted isometry condition via direct inspection of the matrices obtained in practice, rather than by redefining identifiability in terms of the fitted parameters or renaming a fitted quantity as a prediction. No self-citation is shown to be load-bearing for the identifiability claim, and no step reduces the conclusion to a tautology or ansatz smuggled through prior author work. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters. The central claim rests on the domain assumption that learned dictionaries satisfy an approximate restricted isometry condition.

axioms (1)

domain assumption Dictionaries learned by the modified SAEs satisfy an approximate restricted isometry condition
Invoked in the abstract to conclude that the corresponding sparse codes are near-identifiable.

pith-pipeline@v0.9.1-grok · 5672 in / 1196 out tokens · 20680 ms · 2026-06-28T22:59:02.234108+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages

[1]

URL https://arxiv.org/abs/math/ 0503066. Chen, S. and Donoho, D. Basis pursuit. InProceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, volume 1, pp. 41–44 vol.1, 1994. doi: 10.1109/ACSSC.1994.471413. Chen, S., Billings, S. A., and Luo, W. Orthogonal least squares methods and their application to non-linear sys- tem identificatio...

work page doi:10.1109/acssc.1994.471413 1994
[2]

Zoom in: An introduction to circuits

URL https://openreview.net/forum? id=mQxt8l7JL04. Li, A. J., Srinivas, S., Bhalla, U., and Lakkaraju, H. Eval- uating adversarial robustness of concept representations in sparse autoencoders, 2026. URL https://arxiv. org/abs/2505.16004. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch¨olkopf, B., and Bachem, O. Challenging common assumptio...

work page doi:10.23915/distill.00024.001 2026
[3]

This means the nonzero principal angles betweenU S andU S′ are exactly the principal angles betweenU 1 andU ′

=k− |I| . This means the nonzero principal angles betweenU S andU S′ are exactly the principal angles betweenU 1 andU ′
[4]

Denote byP I the orthogonal projector ontoU I

In particular, we have the claim. Denote byP I the orthogonal projector ontoU I. Claim.The projected dictionary(I−P I)DA∪B satisfies RIP at the same levelδ. Proof of claim.Let zA∪B ∈R |A∪B| denote an arbitrary vector and let zI denote the least-squares minimizer of ∥DA∪BzA∪B −D I zI ∥2. Let r=D A∪BzA∪B −D I zI = (I−P I)DA∪BzA∪B ∈ U ⊥ I be the residual. St...

[1] [1]

URL https://arxiv.org/abs/math/ 0503066. Chen, S. and Donoho, D. Basis pursuit. InProceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, volume 1, pp. 41–44 vol.1, 1994. doi: 10.1109/ACSSC.1994.471413. Chen, S., Billings, S. A., and Luo, W. Orthogonal least squares methods and their application to non-linear sys- tem identificatio...

work page doi:10.1109/acssc.1994.471413 1994

[2] [2]

Zoom in: An introduction to circuits

URL https://openreview.net/forum? id=mQxt8l7JL04. Li, A. J., Srinivas, S., Bhalla, U., and Lakkaraju, H. Eval- uating adversarial robustness of concept representations in sparse autoencoders, 2026. URL https://arxiv. org/abs/2505.16004. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch¨olkopf, B., and Bachem, O. Challenging common assumptio...

work page doi:10.23915/distill.00024.001 2026

[3] [3]

This means the nonzero principal angles betweenU S andU S′ are exactly the principal angles betweenU 1 andU ′

=k− |I| . This means the nonzero principal angles betweenU S andU S′ are exactly the principal angles betweenU 1 andU ′

[4] [4]

Denote byP I the orthogonal projector ontoU I

In particular, we have the claim. Denote byP I the orthogonal projector ontoU I. Claim.The projected dictionary(I−P I)DA∪B satisfies RIP at the same levelδ. Proof of claim.Let zA∪B ∈R |A∪B| denote an arbitrary vector and let zI denote the least-squares minimizer of ∥DA∪BzA∪B −D I zI ∥2. Let r=D A∪BzA∪B −D I zI = (I−P I)DA∪BzA∪B ∈ U ⊥ I be the residual. St...