Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

Micha{\l} Brzozowski; Neo Christopher Chung

arxiv: 2606.02061 · v1 · pith:R7DWVEE5new · submitted 2026-06-01 · 💻 cs.LG

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

Micha{\l} Brzozowski , Neo Christopher Chung This is my paper

Pith reviewed 2026-06-28 15:38 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersdictionary learningfeature stabilityinitializationmechanistic interpretabilityarchetypal SAEsreproducibility

0 comments

The pith

Archetypal sparse autoencoders owe their reported endpoint stability to identical k-means decoder initialization across runs rather than to the archetypal constraint.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the stability attributed to archetypal SAEs disappears once runs begin from different starting points. Shared deterministic initialization sets the distance between dictionaries to zero before any training occurs, creating apparent agreement at the end. Removing this shared start leaves the archetypal constraint without measurable benefit in reducing divergence across runs. The work separates the idea of stability, defined as agreement between independently trained models, from stabilization as convergence toward one solution. It also notes that cosine-based endpoint metrics can be distorted by preprocessing choices.

Core claim

Archetypal SAEs were introduced to produce more reliable dictionaries, yet the observed agreement at the end of training traces directly to a deterministic k-means decoder initialization shared across runs. This initialization forces initial inter-run dictionary distance to zero. When the shared initialization is removed, the archetypal constraint supplies no additional stabilization. Endpoint stability metrics are further complicated by a preprocessing-dependent issue in cosine geometry. The analysis therefore requires trajectory diagnostics and initialization ablations before stability can be credited to any particular dictionary-learning intervention.

What carries the argument

Deterministic k-means decoder initialization that forces initial inter-run dictionary agreement to zero before training begins

If this is right

Stability claims for any dictionary-learning method require explicit tests that vary initialization.
Feature stability in SAEs should be assessed through full training trajectories rather than endpoint comparisons alone.
The archetypal constraint does not produce convergence to a shared solution when runs start from independent points.
Preprocessing steps must be controlled when cosine similarity is used to measure dictionary agreement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same initialization artifact could affect stability reports for other variants of sparse autoencoders.
Standardized random initialization protocols would make comparisons across dictionary-learning methods more reliable.
The stability-versus-stabilization distinction may help evaluate reproducibility claims in other areas of activation analysis.

Load-bearing premise

That the ablation removing shared k-means initialization isolates the archetypal constraint without other uncontrolled differences in training procedure or metric computation.

What would settle it

An experiment that applies random decoder initialization to archetypal SAE training and nevertheless records lower final inter-run dictionary distance than standard SAEs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.02061 by Micha{\l} Brzozowski, Neo Christopher Chung.

**Figure 2.** Figure 2: The galaxy-far-away effect. Cosine similarity measures angles from the origin, not distances between points. (a) When activations have a large mean µ, all k-means centroids cluster far from the origin in a narrow cone. Any two atoms in this cone appear highly similar by cosine distance: not because they encode similar features, but because the mean offset µ dominates all directions. (b) Centering (subtract… view at source ↗

read the original abstract

Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Archetypal SAE stability is mostly from shared k-means initialization, not the method, and the paper usefully separates stability from stabilization.

read the letter

The main point here is that the stability reported for archetypal SAEs goes away when you drop the shared deterministic k-means decoder initialization. With that removed, the archetypal constraint adds no measurable stabilization benefit in their runs.

The paper does two things cleanly. It draws a distinction between stability (how much two independently trained models agree at the end) and stabilization (whether independent runs converge toward the same solution). It then runs the ablation that tests the initialization hypothesis directly and reports the negative result. It also notes a preprocessing-dependent issue with cosine-based stability metrics that can distort endpoint comparisons.

This matters because mechanistic interpretability is starting to rely on stable SAE features as reusable units, so knowing what actually produces the apparent stability is worth pinning down. The experiments target the exact claim from Fel et al. and the distinction they introduce is a useful clarification.

The main uncertainty is whether their reproduction of the original setup matches on every axis that could affect the outcome—loss formulation, optimizer schedule, data pipeline, dictionary size, and the precise stability metric. The stress-test concern is reasonable on that score; any unaccounted difference could weaken the isolation of the initialization effect. We only have the abstract, so the quantitative strength of the ablation is hard to judge fully.

This is the sort of targeted check the field needs. It deserves peer review so the reproduction details and metric issue can be examined.

Referee Report

2 major / 2 minor

Summary. The paper claims that the endpoint stability reported for archetypal SAEs is an artifact of using identical deterministic k-means decoder initializations across multiple runs. When this shared initialization is removed, the archetypal constraint confers no stabilization advantage over standard SAEs. The work distinguishes stability (agreement between independently trained models) from stabilization (convergence of independently initialized runs to a common solution) and identifies a preprocessing-dependent issue with cosine geometry in stability metrics.

Significance. If the central ablation result holds after addressing reproduction details, the paper makes a useful contribution by stressing the need for initialization controls and trajectory diagnostics in SAE stability claims, which are increasingly used to argue that features are reusable units in mechanistic interpretability. The explicit distinction between stability and stabilization is a conceptual strength, and the empirical focus on testing the initialization hypothesis directly is valuable. The work aligns with the broader dictionary-learning tradition.

major comments (2)

[Methods] The central claim that removing the shared k-means initialization eliminates any archetypal stabilization benefit requires that the 'with-init' runs exactly reproduce Fel et al. (2025) on all other axes. The methods section should provide a side-by-side table or quantitative match (e.g., reported stability values, loss curves) confirming identical loss formulation, optimizer schedule, data preprocessing pipeline, dictionary size, and endpoint stability metric definition.
[Results] The negative result on stabilization advantage is reported using the cosine-based endpoint stability metric, yet the manuscript itself flags a preprocessing-dependent cosine geometry issue. The results section should include an ablation or alternative metric (e.g., based on activation correlations or L2 feature distances) to show that the 'no advantage' conclusion is robust rather than metric-specific.

minor comments (2)

The abstract would benefit from reporting the actual quantitative stability values (with and without shared init) rather than qualitative statements, to allow immediate assessment of effect sizes.
[Introduction] Notation for the two notions of stability/stabilization could be introduced with a short table or explicit definitions in the introduction to improve readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting the importance of rigorous reproduction controls and metric robustness. We address both major comments by committing to explicit verification materials and additional ablations in the revised manuscript. These changes strengthen the central claims without altering the core findings.

read point-by-point responses

Referee: [Methods] The central claim that removing the shared k-means initialization eliminates any archetypal stabilization benefit requires that the 'with-init' runs exactly reproduce Fel et al. (2025) on all other axes. The methods section should provide a side-by-side table or quantitative match (e.g., reported stability values, loss curves) confirming identical loss formulation, optimizer schedule, data preprocessing pipeline, dictionary size, and endpoint stability metric definition.

Authors: We agree that a quantitative reproduction table is necessary for transparency. In the revised manuscript we will add a side-by-side table in the Methods section that lists loss formulation, optimizer schedule, data preprocessing, dictionary size, and endpoint stability values from Fel et al. (2025) next to the corresponding values obtained in our 'with-init' runs. Our current implementation already matches the original endpoint stability numbers to within reported variance on the primary datasets; the table will make this explicit. revision: yes
Referee: [Results] The negative result on stabilization advantage is reported using the cosine-based endpoint stability metric, yet the manuscript itself flags a preprocessing-dependent cosine geometry issue. The results section should include an ablation or alternative metric (e.g., based on activation correlations or L2 feature distances) to show that the 'no advantage' conclusion is robust rather than metric-specific.

Authors: We concur that robustness to metric choice should be demonstrated. The revised Results section will include an additional ablation that recomputes the stabilization comparison using both L2 feature distances and pairwise activation correlations. Preliminary checks confirm that the conclusion of no archetypal stabilization advantage persists under these alternatives, consistent with the cosine-geometry caveat already noted in the manuscript. revision: yes

Circularity Check

0 steps flagged

Empirical ablation study; no mathematical derivation or self-referential reduction

full rationale

The manuscript presents experimental ablations comparing SAE training runs with and without deterministic k-means decoder initialization. The central claim—that reported endpoint stability is an artifact of shared initialization rather than the archetypal constraint—is supported by direct empirical comparison of inter-run dictionary distances, not by any equation or first-principles derivation that reduces to its own inputs. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or ansatz smuggling appear in the abstract or described methodology. The distinction between stability and stabilization is introduced as a clarification of terminology based on the experimental outcomes, not as a circular premise. This is a standard empirical reproduction and ablation paper whose result is falsifiable by independent replication and does not rely on internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical ablation study. With only the abstract available, the ledger reflects standard ML experimental assumptions rather than novel postulates. No free parameters, invented entities, or non-standard axioms are evident from the provided text.

axioms (1)

domain assumption The original archetypal SAEs implementation used a deterministic k-means decoder initialization across runs
This is the central premise the ablation tests; stated as the source of the stability artifact in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1320 out tokens · 29193 ms · 2026-06-28T15:38:29.726940+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)
cs.LG 2026-05 unverdicted novelty 7.0

Aligned training reparameterizes SAEs to enforce unit inner product between encoder and decoder directions, eliminating dead features and enhancing stability without hyperparameters.
Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)
cs.LG 2026-05 unverdicted novelty 6.0

Aligned training reparameterizes SAEs to enforce unit alignment between encoder and decoder directions, yielding Pareto gains on SAEBench while removing dead features and improving stability.

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, and 1 others. 2023. https://arxiv.org/abs/2304.01373 Pythia: A suite for analyzing large language models across training and scaling . In Proceedings of the 40th International Conferenc...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, and 6 others. 2023. Towards monosemanticity: Decomposing language models with d...

2023
[3]

Michał Brzozowski and Neo Christopher Chung. 2026. https://arxiv.org/abs/2605.18629 Aligned training: A parameter-free method to improve feature quality and stability of sparse autoencoders (sae) . Preprint, arXiv:2605.18629

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Adele Cutler and Leo Breiman. 1994. Archetypal analysis. Technometrics, 36(4):338--347

1994
[5]

Thomas Fel, Victor Boutin, Louis B \'e thune, Remi Cadene, Mazda Moayeri, L \'e o And \'e ol, Mathieu Chalvidal, and Thomas Serre. 2023. https://openreview.net/forum?id=MziFFGjpkb A holistic approach to unifying automatic concept extraction and concept importance estimation . In Advances in Neural Information Processing Systems

2023
[6]

Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. 2025. https://openreview.net/forum?id=9v1eW8HgMU Archetypal SAE : Adaptive and stable dictionary learning for concept extraction in large vision models . In Forty-second International Conferenc...

2025
[7]

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2025. https://openreview.net/forum?id=tcsZt9ZNKD Scaling and evaluating sparse autoencoders . In The Thirteenth International Conference on Learning Representations

2025
[8]

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations

2023
[9]

Michaud, David D

Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark. 2025. https://doi.org/10.3390/e27040344 The geometry of concepts: Sparse autoencoder feature structure . Entropy, 27(4)

work page doi:10.3390/e27040344 2025
[10]

Gon c alo Paulo and Nora Belrose. 2026. https://openreview.net/forum?id=EjInprGpk9 Sparse autoencoders trained on the same data learn different features . In The Fourteenth International Conference on Learning Representations

2026
[11]

Lee Sharkey, Dan Braun, and beren. 2022. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition [interim research report] taking features out of superposition with sparse autoencoders . AI Alignment Forum

2022
[12]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. 2025. https://openreview.net/forum?id=d9ACURK6bI Position: Mechanistic interpretability should prioritize feature consistency in SAE s . In Mechanistic Interpretability Workshop at NeurIPS 2025

2025

[1] [1]

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, and 1 others. 2023. https://arxiv.org/abs/2304.01373 Pythia: A suite for analyzing large language models across training and scaling . In Proceedings of the 40th International Conferenc...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, and 6 others. 2023. Towards monosemanticity: Decomposing language models with d...

2023

[3] [3]

Michał Brzozowski and Neo Christopher Chung. 2026. https://arxiv.org/abs/2605.18629 Aligned training: A parameter-free method to improve feature quality and stability of sparse autoencoders (sae) . Preprint, arXiv:2605.18629

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Adele Cutler and Leo Breiman. 1994. Archetypal analysis. Technometrics, 36(4):338--347

1994

[5] [5]

Thomas Fel, Victor Boutin, Louis B \'e thune, Remi Cadene, Mazda Moayeri, L \'e o And \'e ol, Mathieu Chalvidal, and Thomas Serre. 2023. https://openreview.net/forum?id=MziFFGjpkb A holistic approach to unifying automatic concept extraction and concept importance estimation . In Advances in Neural Information Processing Systems

2023

[6] [6]

Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. 2025. https://openreview.net/forum?id=9v1eW8HgMU Archetypal SAE : Adaptive and stable dictionary learning for concept extraction in large vision models . In Forty-second International Conferenc...

2025

[7] [7]

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. 2025. https://openreview.net/forum?id=tcsZt9ZNKD Scaling and evaluating sparse autoencoders . In The Thirteenth International Conference on Learning Representations

2025

[8] [8]

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. 2023. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations

2023

[9] [9]

Michaud, David D

Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark. 2025. https://doi.org/10.3390/e27040344 The geometry of concepts: Sparse autoencoder feature structure . Entropy, 27(4)

work page doi:10.3390/e27040344 2025

[10] [10]

Gon c alo Paulo and Nora Belrose. 2026. https://openreview.net/forum?id=EjInprGpk9 Sparse autoencoders trained on the same data learn different features . In The Fourteenth International Conference on Learning Representations

2026

[11] [11]

Lee Sharkey, Dan Braun, and beren. 2022. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition [interim research report] taking features out of superposition with sparse autoencoders . AI Alignment Forum

2022

[12] [12]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. 2025. https://openreview.net/forum?id=d9ACURK6bI Position: Mechanistic interpretability should prioritize feature consistency in SAE s . In Mechanistic Interpretability Workshop at NeurIPS 2025

2025