pith. sign in

arxiv: 2605.18629 · v2 · pith:5FYGF45Jnew · submitted 2026-05-18 · 💻 cs.LG

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

Pith reviewed 2026-06-30 18:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse autoencodersaligned trainingalignment scoredead featurestraining stabilitymodel interpretabilityreparameterizationfeature quality
0
0 comments X

The pith

Enforcing unit inner product between each encoder and decoder direction removes degeneracy in sparse autoencoders and eliminates dead features without new hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders decompose neural activations into features but produce many dead features that never activate and show unstable behavior across training runs. The authors observe that alignment scores, the inner products between corresponding encoder and decoder vectors, follow a bimodal distribution, which they treat as evidence of a training degeneracy. Aligned training is a reparameterization that forces every feature's alignment score to exactly one while leaving all other training choices unchanged. Experiments across models, dictionary sizes, and sparsity levels show the method removes dead features, raises stability, and improves reconstruction on SAEBench benchmarks. The change adds no parameters and combines with existing variants such as Top-K and p-annealing.

Core claim

The central claim is that the bimodal distribution of alignment scores reveals a harmful degeneracy in which some encoder-decoder pairs are misaligned, and that reparameterizing the SAE to enforce an inner product of exactly one for every feature removes this degeneracy, yielding zero dead features, higher stability across seeds, and better reconstruction without any added hyperparameters or computational cost.

What carries the argument

Aligned training, a reparameterization that normalizes the decoder and adjusts the encoder so the inner product between each encoder and decoder direction equals one.

If this is right

  • Dead features disappear across dictionary sizes and sparsity levels.
  • Training becomes stable across different random seeds.
  • Reconstruction quality and SAEBench scores improve without extra cost.
  • The method combines directly with Top-K, BatchTop-K, and p-annealing variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reparameterization could be tested on other dictionary-learning objectives that also learn paired encoder and decoder matrices.
  • If the unit-alignment constraint proves robust, it could become a default preprocessing step before applying any SAE variant for interpretability work.
  • The observation of bimodality may motivate further geometric analysis of how feature directions interact during training.

Load-bearing premise

The bimodal distribution of alignment scores indicates a harmful degeneracy that is best corrected by forcing every alignment score exactly to one rather than some other fixed value or no constraint at all.

What would settle it

A controlled comparison on the same models and data in which aligned training still produces dead features or lower stability than the baseline SAE.

Figures

Figures reproduced from arXiv: 2605.18629 by Micha{\l} Brzozowski, Neo Christopher Chung.

Figure 1
Figure 1. Figure 1: Geometric interpretation of aligned training for a single feature [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Aligned training improves recovered cross-entropy across different sparsity levels. Dictio [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Aligned training improves recovered cross-entropy across different sparsity levels. Dictio [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Aligned training improves TopK and BatchTopK autoencoders in the low-sparsity regime. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Aligned training reduces dead features to near zero without resampling or auxiliary losses. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The dead-feature reduction extends to TopK and BatchTopK. Dictionary size 65K, layer 12 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Aligned training significantly improves cross-seed stability for both ReLU and TopK [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reconstruction metrics for Pythia 160M (layer 8), dictionary size 4096, 3 random seeds. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Alive-feature fraction for Pythia 160M (layer 8), dictionary size 4096. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: SCR metric from SAEBench, dictionary size 65K, Pythia 160M and Gemma 2 2B. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Bimodality of SAE alignment scores across different models and architectures. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 10
Figure 10. Figure 10: Bimodality of SAE alignment scores across different models and architectures. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: MCS vs. alignment score (Pearson r = 0.65). The red vertical line marks ai = 1. C.3 Alignment Scores Are Correlated with Autointerpretability The alignment score is positively correlated with autointerpretability (Pearson r = 0.32; [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Autointerpretability vs. alignment score (Pearson [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reconstruction metrics for Pythia 160M (layer 8) and Gemma 2 2B (layer 12), dictionary [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Weight tying reduces dead features but at the cost of reconstruction quality. Pythia 160M [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 14
Figure 14. Figure 14: Weight tying reduces dead features but at the cost of reconstruction quality. Pythia 160M [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reconstruction metrics, dictionary size 16384, Pythia 160M (layer 8) and Gemma 2 2B [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Alive-feature fraction, dictionary size 16384. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Reconstruction metrics, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Alive-feature fraction, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Reconstruction metrics at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Alive-feature fraction at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training. We propose the \textbf{aligned training}, a parameter-free reparameterization of SAEs that simultaneously improves reconstruction quality, eliminates dead features, and significantly enhances stability across training seeds. Our approach is motivated by an overlooked observation that SAE feature quality, measured by the inner product between encoder and decoder directions (which we call the \textbf{alignment score}), follows a bimodal distribution across all modern architectures. The proposed aligned training enforces a geometric constraint between the encoder and decoder such that their inner product equals one for every feature, which removes a source of degeneracy in the SAE training without adding any hyperparameters. Across multiple models, dictionary sizes, and sparsity levels, the aligned training shows Pareto improvements on the SAEBench benchmarks. Beyond improving dead features, stability and reconstruction, our method readily integrates with techniques in mechanical interpretability such as Top/BatchTop-K architectures and p-Annealing. Overall, the aligned training substantially improves feature quality and stability of SAE without computational complexity or cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes aligned training, a parameter-free reparameterization of sparse autoencoders (SAEs) that enforces the inner product between each feature's encoder and decoder vectors to equal exactly 1. Motivated by an observed bimodal distribution of these alignment scores, the method is claimed to eliminate dead features, improve reconstruction quality and training stability across seeds, and deliver Pareto improvements on SAEBench benchmarks across models, dictionary sizes, and sparsity levels. It integrates with existing SAE variants such as TopK and p-Annealing without added hyperparameters or computational cost.

Significance. If the central claims hold under rigorous verification, the contribution would be significant for mechanistic interpretability: SAEs remain a core tool for feature extraction, and the well-documented problems of dead features and seed instability have previously required ad-hoc fixes with extra hyperparameters or data. A truly parameter-free geometric constraint that simultaneously improves multiple metrics would be a practical advance, especially given its claimed compatibility with other architectures.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (method): The load-bearing premise that the bimodal alignment-score distribution signals a harmful degeneracy best removed by forcing every inner product exactly to 1 is not supported by comparative evidence. The manuscript should demonstrate (via ablation) that target=1 outperforms other fixed targets, the per-feature mean, or the unconstrained baseline; without this, the observed gains could stem from altered gradient flow or capacity rather than removal of the cited degeneracy.
  2. [§4] §4 (experiments): The Pareto-improvement claim on SAEBench is presented across multiple models and settings, but the paper does not report whether the aligned-training runs were matched for total compute or whether the baseline SAEs used identical hyperparameter sweeps; if the baselines were under-optimized, the relative gains are overstated.
  3. [§4.2] §4.2 (dead-feature results): The assertion that aligned training 'eliminates' dead features requires a precise operational definition and quantitative comparison (e.g., fraction of features with activation frequency < threshold); the current description leaves open whether the improvement is absolute or merely relative to a particular baseline.
minor comments (2)
  1. [§2] Notation: the term 'alignment score' is introduced in the abstract but should be formally defined with an equation (e.g., a_i = <e_i, d_i>) at first use in §2 or §3.
  2. [Figure 1] Figure clarity: the histogram of alignment scores (presumably Figure 1) should include the post-aligned-training distribution for direct visual comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point below and will incorporate clarifications and additional evidence in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method): The load-bearing premise that the bimodal alignment-score distribution signals a harmful degeneracy best removed by forcing every inner product exactly to 1 is not supported by comparative evidence. The manuscript should demonstrate (via ablation) that target=1 outperforms other fixed targets, the per-feature mean, or the unconstrained baseline; without this, the observed gains could stem from altered gradient flow or capacity rather than removal of the cited degeneracy.

    Authors: The target of exactly 1 follows directly from the geometric requirement that each decoder vector equals its corresponding encoder direction (removing the observed bimodality where some features have near-zero alignment). We agree that explicit ablations would strengthen this claim. In the revision we will add a controlled ablation comparing fixed targets of 0.5, 1.0 and 1.5, the per-feature mean alignment, and the unconstrained baseline, reporting effects on dead-feature rate, seed stability, and SAEBench scores. revision: yes

  2. Referee: [§4] §4 (experiments): The Pareto-improvement claim on SAEBench is presented across multiple models and settings, but the paper does not report whether the aligned-training runs were matched for total compute or whether the baseline SAEs used identical hyperparameter sweeps; if the baselines were under-optimized, the relative gains are overstated.

    Authors: All runs used identical training budgets (same step count, batch size, and optimizer schedule). Hyperparameter grids for sparsity, learning rate, and dictionary size were swept identically for both aligned and baseline models. We will add an explicit statement in §4 confirming matched compute and identical sweep protocols. revision: yes

  3. Referee: [§4.2] §4.2 (dead-feature results): The assertion that aligned training 'eliminates' dead features requires a precise operational definition and quantitative comparison (e.g., fraction of features with activation frequency < threshold); the current description leaves open whether the improvement is absolute or merely relative to a particular baseline.

    Authors: We will revise §4.2 to define dead features as those with activation frequency below 10^{-5} on the held-out evaluation set and will report the exact fractions for aligned training versus each baseline across all model sizes and sparsity levels, demonstrating that the reduction is absolute (near-zero dead features) rather than merely relative. revision: yes

Circularity Check

0 steps flagged

No significant circularity; aligned training is a direct geometric reparameterization with empirical results

full rationale

The paper's chain starts from an empirical observation of bimodal alignment scores (inner product between encoder and decoder) and introduces a reparameterization that directly enforces this inner product to equal 1 for all features. This constraint is imposed as a training modification rather than derived from or equivalent to any fitted parameters, prior predictions, or self-referential equations. Claimed benefits (elimination of dead features, stability, Pareto gains on SAEBench) are shown via external benchmark evaluations across models and settings, not by algebraic reduction to the input assumptions. No self-citations, uniqueness theorems, or ansatzes from prior author work are used to justify the core method or target value of 1. The derivation remains self-contained as an independent geometric intervention.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that forcing alignment exactly to one removes degeneracy without side effects. No free parameters are introduced. No new entities are postulated.

axioms (1)
  • domain assumption The bimodal distribution of alignment scores indicates a source of degeneracy best addressed by enforcing inner product exactly equal to one for every feature.
    Stated in the abstract as the motivation for the method.

pith-pipeline@v0.9.1-grok · 5783 in / 1269 out tokens · 23155 ms · 2026-06-30T18:23:12.855369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

    cs.LG 2026-06 unverdicted novelty 7.0

    Archetypal SAEs appear stable only because of shared deterministic initialization; removing it eliminates any stabilization benefit from the archetypal constraint.

Reference graph

Works this paper leans on

34 extracted references · 7 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Addressing feature suppression in saes.AI Alignment Forum, 2024

    Lee Sharkey Benjamin Wright. Addressing feature suppression in saes.AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes

  2. [2]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

  3. [3]

    Language models can explain neu- rons in language models, 2023

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html

  4. [4]

    Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

  5. [5]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  6. [6]

    Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

    Michał Brzozowski and Neo Christopher Chung. Ablating archetypes: The stability of archety- pal saes is an artifact of initialization and metric design, 2026. URL https://arxiv.org/ abs/2606.02061

  7. [7]

    Batchtopk sparse autoencoders

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URL https: //openreview.net/forum?id=d4dpOCqybL

  8. [8]

    Update on dictionary learning improvements.Transformer Circuits Thread, 2024

    Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on dictionary learning improvements.Transformer Circuits Thread, 2024. URL https: //transformer-circuits.pub/2024/april-update/index.html#training-saes

  9. [9]

    Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023

    Hoagy Cunningham. Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/ursraZGcpfMjCXtnn/ autointerpretation-finds-sparse-coding-beats-alternatives

  10. [10]

    [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023

    Hoagy Cunningham and Logan Riggs. [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023. URL https://www.lesswrong.com/posts/vBcsAw4rvLsri3JAj/ replication-conjecture-s-sparse-coding-in-small-transformers

  11. [11]

    Toy models of superposition.Transformer Circuits Thread, 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

  12. [12]

    Prince, Matthew Kowal, Victor Boutin, Is- abel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

    Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Is- abel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. Archetypal SAE: Adaptive and stable dictionary learning for concept extraction in large vi- sion models. InForty-second International Conference on Machine Learning, 2025. URL https://openrev...

  13. [13]

    Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988

    Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988. ISSN 0893-6080. doi: https://doi. org/10.1016/0893-6080(88)90014-7. URL https://www.sciencedirect.com/science/ article/pii/0893608088900147. 10

  14. [14]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  15. [15]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

  16. [16]

    [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023

    Robert Huben. [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023. URLhttps://aizi.substack.com/p/research-update-sparse-autoencoder

  17. [17]

    Sparse autoencoders find highly interpretable features in language models

    Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2023

  18. [18]

    Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024

    Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/ index.html#dict-learning-resampling

  19. [19]

    Evaluating sparse autoencoders on targeted concept erasure tasks, 2024

    Adam Karvonen, Can Rager, Samuel Marks, and Neel Nanda. Evaluating sparse autoencoders on targeted concept erasure tasks, 2024. URLhttps://arxiv.org/abs/2411.18895

  20. [20]

    Measuring progress in dictionary learning for language model interpretability with board game models

    Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum? id=qzsDKwGJyB

  21. [21]

    Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

    Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503. 09532

  22. [22]

    Lecun, L

    Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

  23. [23]

    Enhancing neural network interpretability with feature-aligned sparse autoencoders.arXiv preprint arXiv:2411.01220,

    Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URL https://arxiv.org/ abs/2411.01220

  24. [24]

    Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

    Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

  25. [25]

    Sparse autoencoders trained on the same data learn different features

    Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=EjInprGpk9

  26. [26]

    Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

  27. [27]

    Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025. URL https://openreview.net/forum?id= mMPaQzgzAN. 11

  28. [28]

    (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023

    Logan Riggs. (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/wqRqb7h6ZC48iDgfK/ tentatively-found-600-monosemantic-features-in-a-small-lm

  29. [29]

    Einops: Clear and reliable tensor manipulations with einstein-like notation

    Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=oapKSVM2bcj

  30. [30]

    dictionary_learning, 2024

    Adam Karvonen Samuel Marks and Aaron Mueller. dictionary_learning, 2024. URL https: //github.com/saprmarks/dictionary_learning

  31. [31]

    Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023

    Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition

  32. [32]

    Diab, Virginia Smith, and Kun Zhang

    Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

  33. [33]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

  34. [34]

    activation_dim -1␣ dict_size

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- 12 lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0. A Implementation Details All SA...