Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

Micha{\l} Brzozowski; Neo Christopher Chung

arxiv: 2605.18629 · v2 · pith:5FYGF45Jnew · submitted 2026-05-18 · 💻 cs.LG

Aligned Training: A Parameter-Free Method to Improve Feature Quality and Stability of Sparse Autoencoders (SAE)

Micha{\l} Brzozowski , Neo Christopher Chung This is my paper

Pith reviewed 2026-06-30 18:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersaligned trainingalignment scoredead featurestraining stabilitymodel interpretabilityreparameterizationfeature quality

0 comments

The pith

Enforcing unit inner product between each encoder and decoder direction removes degeneracy in sparse autoencoders and eliminates dead features without new hyperparameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders decompose neural activations into features but produce many dead features that never activate and show unstable behavior across training runs. The authors observe that alignment scores, the inner products between corresponding encoder and decoder vectors, follow a bimodal distribution, which they treat as evidence of a training degeneracy. Aligned training is a reparameterization that forces every feature's alignment score to exactly one while leaving all other training choices unchanged. Experiments across models, dictionary sizes, and sparsity levels show the method removes dead features, raises stability, and improves reconstruction on SAEBench benchmarks. The change adds no parameters and combines with existing variants such as Top-K and p-annealing.

Core claim

The central claim is that the bimodal distribution of alignment scores reveals a harmful degeneracy in which some encoder-decoder pairs are misaligned, and that reparameterizing the SAE to enforce an inner product of exactly one for every feature removes this degeneracy, yielding zero dead features, higher stability across seeds, and better reconstruction without any added hyperparameters or computational cost.

What carries the argument

Aligned training, a reparameterization that normalizes the decoder and adjusts the encoder so the inner product between each encoder and decoder direction equals one.

If this is right

Dead features disappear across dictionary sizes and sparsity levels.
Training becomes stable across different random seeds.
Reconstruction quality and SAEBench scores improve without extra cost.
The method combines directly with Top-K, BatchTop-K, and p-annealing variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reparameterization could be tested on other dictionary-learning objectives that also learn paired encoder and decoder matrices.
If the unit-alignment constraint proves robust, it could become a default preprocessing step before applying any SAE variant for interpretability work.
The observation of bimodality may motivate further geometric analysis of how feature directions interact during training.

Load-bearing premise

The bimodal distribution of alignment scores indicates a harmful degeneracy that is best corrected by forcing every alignment score exactly to one rather than some other fixed value or no constraint at all.

What would settle it

A controlled comparison on the same models and data in which aligned training still produces dead features or lower stability than the baseline SAE.

Figures

Figures reproduced from arXiv: 2605.18629 by Micha{\l} Brzozowski, Neo Christopher Chung.

**Figure 2.** Figure 2: Aligned training improves recovered cross-entropy across different sparsity levels. Dictio [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Aligned training improves TopK and BatchTopK autoencoders in the low-sparsity regime. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Aligned training reduces dead features to near zero without resampling or auxiliary losses. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The dead-feature reduction extends to TopK and BatchTopK. Dictionary size 65K, layer 12 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Aligned training significantly improves cross-seed stability for both ReLU and TopK [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Reconstruction metrics for Pythia 160M (layer 8), dictionary size 4096, 3 random seeds. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Alive-feature fraction for Pythia 160M (layer 8), dictionary size 4096. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: SCR metric from SAEBench, dictionary size 65K, Pythia 160M and Gemma 2 2B. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: Bimodality of SAE alignment scores across different models and architectures. [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 10.** Figure 10: Bimodality of SAE alignment scores across different models and architectures. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: MCS vs. alignment score (Pearson r = 0.65). The red vertical line marks ai = 1. C.3 Alignment Scores Are Correlated with Autointerpretability The alignment score is positively correlated with autointerpretability (Pearson r = 0.32; [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Autointerpretability vs. alignment score (Pearson [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Reconstruction metrics for Pythia 160M (layer 8) and Gemma 2 2B (layer 12), dictionary [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Weight tying reduces dead features but at the cost of reconstruction quality. Pythia 160M [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Reconstruction metrics, dictionary size 16384, Pythia 160M (layer 8) and Gemma 2 2B [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Alive-feature fraction, dictionary size 16384. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Reconstruction metrics, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Alive-feature fraction, dictionary size 65K. [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Reconstruction metrics at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗

**Figure 20.** Figure 20: Alive-feature fraction at 500M tokens, dictionary size 65K, Gemma 2 2B (layer 12). [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are one of the main methods to interpret the inner workings of deep neural networks (DNNs), decomposing activations into higher-dimensional features. However, they exhibit critical shortcomings where a large fraction of features are never activated and are unstable. Despite variants of SAEs that attempt to mitigate these issues, they require additional data, resampling, or training. We propose the \textbf{aligned training}, a parameter-free reparameterization of SAEs that simultaneously improves reconstruction quality, eliminates dead features, and significantly enhances stability across training seeds. Our approach is motivated by an overlooked observation that SAE feature quality, measured by the inner product between encoder and decoder directions (which we call the \textbf{alignment score}), follows a bimodal distribution across all modern architectures. The proposed aligned training enforces a geometric constraint between the encoder and decoder such that their inner product equals one for every feature, which removes a source of degeneracy in the SAE training without adding any hyperparameters. Across multiple models, dictionary sizes, and sparsity levels, the aligned training shows Pareto improvements on the SAEBench benchmarks. Beyond improving dead features, stability and reconstruction, our method readily integrates with techniques in mechanical interpretability such as Top/BatchTop-K architectures and p-Annealing. Overall, the aligned training substantially improves feature quality and stability of SAE without computational complexity or cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The aligned training reparameterization forces encoder-decoder inner product to exactly 1 and reports cleaner SAEs with fewer dead features, but the case that this specific target fixes a real degeneracy rather than trading one issue for another is still thin.

read the letter

The new piece is a reparameterization that hard-constrains every feature's encoder-decoder alignment score to 1. It is parameter-free, integrates with existing SAE tricks like Top-K, and the authors report it removes dead features, raises stability across seeds, and improves reconstruction on SAEBench across several models and sparsity settings.

That is useful if it holds. A zero-cost change that reliably reduces dead features would matter inside the SAE literature.

The soft spot is the justification. The bimodal alignment distribution is treated as evidence of degeneracy that is best removed by pinning the value at 1. Nothing shown demonstrates that 1 is better than the per-feature mean, a different constant, or no constraint at all, nor that the bimodality itself drives the problems rather than being a side effect of the usual training dynamics. The gains could come from altered gradient flow or effective capacity instead. The abstract and stress-test note leave this untested.

The experimental coverage looks reasonable for the claim, but the causal story needs tighter checks.

This is for people already training or benchmarking SAEs. It is coherent enough to deserve referee time; the method is simple enough that reviewers can check the results quickly.

Referee Report

3 major / 2 minor

Summary. The paper proposes aligned training, a parameter-free reparameterization of sparse autoencoders (SAEs) that enforces the inner product between each feature's encoder and decoder vectors to equal exactly 1. Motivated by an observed bimodal distribution of these alignment scores, the method is claimed to eliminate dead features, improve reconstruction quality and training stability across seeds, and deliver Pareto improvements on SAEBench benchmarks across models, dictionary sizes, and sparsity levels. It integrates with existing SAE variants such as TopK and p-Annealing without added hyperparameters or computational cost.

Significance. If the central claims hold under rigorous verification, the contribution would be significant for mechanistic interpretability: SAEs remain a core tool for feature extraction, and the well-documented problems of dead features and seed instability have previously required ad-hoc fixes with extra hyperparameters or data. A truly parameter-free geometric constraint that simultaneously improves multiple metrics would be a practical advance, especially given its claimed compatibility with other architectures.

major comments (3)

[Abstract, §3] Abstract and §3 (method): The load-bearing premise that the bimodal alignment-score distribution signals a harmful degeneracy best removed by forcing every inner product exactly to 1 is not supported by comparative evidence. The manuscript should demonstrate (via ablation) that target=1 outperforms other fixed targets, the per-feature mean, or the unconstrained baseline; without this, the observed gains could stem from altered gradient flow or capacity rather than removal of the cited degeneracy.
[§4] §4 (experiments): The Pareto-improvement claim on SAEBench is presented across multiple models and settings, but the paper does not report whether the aligned-training runs were matched for total compute or whether the baseline SAEs used identical hyperparameter sweeps; if the baselines were under-optimized, the relative gains are overstated.
[§4.2] §4.2 (dead-feature results): The assertion that aligned training 'eliminates' dead features requires a precise operational definition and quantitative comparison (e.g., fraction of features with activation frequency < threshold); the current description leaves open whether the improvement is absolute or merely relative to a particular baseline.

minor comments (2)

[§2] Notation: the term 'alignment score' is introduced in the abstract but should be formally defined with an equation (e.g., a_i = <e_i, d_i>) at first use in §2 or §3.
[Figure 1] Figure clarity: the histogram of alignment scores (presumably Figure 1) should include the post-aligned-training distribution for direct visual comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point below and will incorporate clarifications and additional evidence in the revised manuscript.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): The load-bearing premise that the bimodal alignment-score distribution signals a harmful degeneracy best removed by forcing every inner product exactly to 1 is not supported by comparative evidence. The manuscript should demonstrate (via ablation) that target=1 outperforms other fixed targets, the per-feature mean, or the unconstrained baseline; without this, the observed gains could stem from altered gradient flow or capacity rather than removal of the cited degeneracy.

Authors: The target of exactly 1 follows directly from the geometric requirement that each decoder vector equals its corresponding encoder direction (removing the observed bimodality where some features have near-zero alignment). We agree that explicit ablations would strengthen this claim. In the revision we will add a controlled ablation comparing fixed targets of 0.5, 1.0 and 1.5, the per-feature mean alignment, and the unconstrained baseline, reporting effects on dead-feature rate, seed stability, and SAEBench scores. revision: yes
Referee: [§4] §4 (experiments): The Pareto-improvement claim on SAEBench is presented across multiple models and settings, but the paper does not report whether the aligned-training runs were matched for total compute or whether the baseline SAEs used identical hyperparameter sweeps; if the baselines were under-optimized, the relative gains are overstated.

Authors: All runs used identical training budgets (same step count, batch size, and optimizer schedule). Hyperparameter grids for sparsity, learning rate, and dictionary size were swept identically for both aligned and baseline models. We will add an explicit statement in §4 confirming matched compute and identical sweep protocols. revision: yes
Referee: [§4.2] §4.2 (dead-feature results): The assertion that aligned training 'eliminates' dead features requires a precise operational definition and quantitative comparison (e.g., fraction of features with activation frequency < threshold); the current description leaves open whether the improvement is absolute or merely relative to a particular baseline.

Authors: We will revise §4.2 to define dead features as those with activation frequency below 10^{-5} on the held-out evaluation set and will report the exact fractions for aligned training versus each baseline across all model sizes and sparsity levels, demonstrating that the reduction is absolute (near-zero dead features) rather than merely relative. revision: yes

Circularity Check

0 steps flagged

No significant circularity; aligned training is a direct geometric reparameterization with empirical results

full rationale

The paper's chain starts from an empirical observation of bimodal alignment scores (inner product between encoder and decoder) and introduces a reparameterization that directly enforces this inner product to equal 1 for all features. This constraint is imposed as a training modification rather than derived from or equivalent to any fitted parameters, prior predictions, or self-referential equations. Claimed benefits (elimination of dead features, stability, Pareto gains on SAEBench) are shown via external benchmark evaluations across models and settings, not by algebraic reduction to the input assumptions. No self-citations, uniqueness theorems, or ansatzes from prior author work are used to justify the core method or target value of 1. The derivation remains self-contained as an independent geometric intervention.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that forcing alignment exactly to one removes degeneracy without side effects. No free parameters are introduced. No new entities are postulated.

axioms (1)

domain assumption The bimodal distribution of alignment scores indicates a source of degeneracy best addressed by enforcing inner product exactly equal to one for every feature.
Stated in the abstract as the motivation for the method.

pith-pipeline@v0.9.1-grok · 5783 in / 1269 out tokens · 23155 ms · 2026-06-30T18:23:12.855369+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design
cs.LG 2026-06 unverdicted novelty 7.0

Archetypal SAEs appear stable only because of shared deterministic initialization; removing it eliminates any stabilization benefit from the archetypal constraint.

Reference graph

Works this paper leans on

34 extracted references · 7 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Addressing feature suppression in saes.AI Alignment Forum, 2024

Lee Sharkey Benjamin Wright. Addressing feature suppression in saes.AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes

2024
[2]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

2023
[3]

Language models can explain neu- rons in language models, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html

2023
[4]

Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

2024
[5]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

2023
[6]

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

Michał Brzozowski and Neo Christopher Chung. Ablating archetypes: The stability of archety- pal saes is an artifact of initialization and metric design, 2026. URL https://arxiv.org/ abs/2606.02061

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URL https: //openreview.net/forum?id=d4dpOCqybL

2024
[8]

Update on dictionary learning improvements.Transformer Circuits Thread, 2024

Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on dictionary learning improvements.Transformer Circuits Thread, 2024. URL https: //transformer-circuits.pub/2024/april-update/index.html#training-saes

2024
[9]

Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023

Hoagy Cunningham. Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/ursraZGcpfMjCXtnn/ autointerpretation-finds-sparse-coding-beats-alternatives

2023
[10]

[replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023

Hoagy Cunningham and Logan Riggs. [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023. URL https://www.lesswrong.com/posts/vBcsAw4rvLsri3JAj/ replication-conjecture-s-sparse-coding-in-small-transformers

2023
[11]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

2022
[12]

Prince, Matthew Kowal, Victor Boutin, Is- abel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Is- abel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. Archetypal SAE: Adaptive and stable dictionary learning for concept extraction in large vi- sion models. InForty-second International Conference on Machine Learning, 2025. URL https://openrev...

2025
[13]

Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988

Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988. ISSN 0893-6080. doi: https://doi. org/10.1016/0893-6080(88)90014-7. URL https://www.sciencedirect.com/science/ article/pii/0893608088900147. 10

work page doi:10.1016/0893-6080(88)90014-7 1988
[14]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

2025
[16]

[research update] sparse autoencoder features are bimodal.From AI to ZI, 2023

Robert Huben. [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023. URLhttps://aizi.substack.com/p/research-update-sparse-autoencoder

2023
[17]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2023

2023
[18]

Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024

Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/ index.html#dict-learning-resampling

2024
[19]

Evaluating sparse autoencoders on targeted concept erasure tasks, 2024

Adam Karvonen, Can Rager, Samuel Marks, and Neel Nanda. Evaluating sparse autoencoders on targeted concept erasure tasks, 2024. URLhttps://arxiv.org/abs/2411.18895

work page arXiv 2024
[20]

Measuring progress in dictionary learning for language model interpretability with board game models

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum? id=qzsDKwGJyB

2024
[21]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503. 09532

2025
[22]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998
[23]

Enhancing neural network interpretability with feature-aligned sparse autoencoders.arXiv preprint arXiv:2411.01220,

Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URL https://arxiv.org/ abs/2411.01220

work page arXiv 2024
[24]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

2025
[25]

Sparse autoencoders trained on the same data learn different features

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=EjInprGpk9

2026
[26]

Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

2024
[27]

Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025. URL https://openreview.net/forum?id= mMPaQzgzAN. 11

2025
[28]

(tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023

Logan Riggs. (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/wqRqb7h6ZC48iDgfK/ tentatively-found-600-monosemantic-features-in-a-small-lm

2023
[29]

Einops: Clear and reliable tensor manipulations with einstein-like notation

Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=oapKSVM2bcj

2022
[30]

dictionary_learning, 2024

Adam Karvonen Samuel Marks and Aaron Mueller. dictionary_learning, 2024. URL https: //github.com/saprmarks/dictionary_learning

2024
[31]

Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023

Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition

2023
[32]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

2025
[33]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

activation_dim -1␣ dict_size

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- 12 lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0. A Implementation Details All SA...

2025

[1] [1]

Addressing feature suppression in saes.AI Alignment Forum, 2024

Lee Sharkey Benjamin Wright. Addressing feature suppression in saes.AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/ addressing-feature-suppression-in-saes

2024

[2] [2]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

2023

[3] [3]

Language models can explain neu- rons in language models, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neu- rons in language models, 2023. URL https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html

2023

[4] [4]

Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

2024

[5] [5]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

2023

[6] [6]

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

Michał Brzozowski and Neo Christopher Chung. Ablating archetypes: The stability of archety- pal saes is an artifact of initialization and metric design, 2026. URL https://arxiv.org/ abs/2606.02061

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URL https: //openreview.net/forum?id=d4dpOCqybL

2024

[8] [8]

Update on dictionary learning improvements.Transformer Circuits Thread, 2024

Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on dictionary learning improvements.Transformer Circuits Thread, 2024. URL https: //transformer-circuits.pub/2024/april-update/index.html#training-saes

2024

[9] [9]

Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023

Hoagy Cunningham. Autointerpretation finds sparse coding beats alternatives.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/ursraZGcpfMjCXtnn/ autointerpretation-finds-sparse-coding-beats-alternatives

2023

[10] [10]

[replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023

Hoagy Cunningham and Logan Riggs. [replication] conjec- ture’s sparse coding in small transformers.Less Wrong, 2023. URL https://www.lesswrong.com/posts/vBcsAw4rvLsri3JAj/ replication-conjecture-s-sparse-coding-in-small-transformers

2023

[11] [11]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, 2022. URL https: //transformer-circuits.pub/20...

2022

[12] [12]

Prince, Matthew Kowal, Victor Boutin, Is- abel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Is- abel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba E. Ba, and Talia Konkle. Archetypal SAE: Adaptive and stable dictionary learning for concept extraction in large vi- sion models. InForty-second International Conference on Machine Learning, 2025. URL https://openrev...

2025

[13] [13]

Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988

Kunihiko Fukushima. Neocognitron: A hierarchical neural network capable of visual pat- tern recognition.Neural Networks, 1(2):119–130, 1988. ISSN 0893-6080. doi: https://doi. org/10.1016/0893-6080(88)90014-7. URL https://www.sciencedirect.com/science/ article/pii/0893608088900147. 10

work page doi:10.1016/0893-6080(88)90014-7 1988

[14] [14]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020

[15] [15]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

2025

[16] [16]

[research update] sparse autoencoder features are bimodal.From AI to ZI, 2023

Robert Huben. [research update] sparse autoencoder features are bimodal.From AI to ZI, 2023. URLhttps://aizi.substack.com/p/research-update-sparse-autoencoder

2023

[17] [17]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2023

2023

[18] [18]

Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024

Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/ index.html#dict-learning-resampling

2024

[19] [19]

Evaluating sparse autoencoders on targeted concept erasure tasks, 2024

Adam Karvonen, Can Rager, Samuel Marks, and Neel Nanda. Evaluating sparse autoencoders on targeted concept erasure tasks, 2024. URLhttps://arxiv.org/abs/2411.18895

work page arXiv 2024

[20] [20]

Measuring progress in dictionary learning for language model interpretability with board game models

Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. InICML 2024 Workshop on Mechanistic Interpretability, 2024. URL https://openreview.net/forum? id=qzsDKwGJyB

2024

[21] [21]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability, 2025. URL https://arxiv.org/abs/2503. 09532

2025

[22] [22]

Lecun, L

Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791

work page doi:10.1109/5.726791 1998

[23] [23]

Enhancing neural network interpretability with feature-aligned sparse autoencoders.arXiv preprint arXiv:2411.01220,

Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders, 2024. URL https://arxiv.org/ abs/2411.01220

work page arXiv 2024

[24] [24]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=I4e82CIDxv

2025

[25] [25]

Sparse autoencoders trained on the same data learn different features

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=EjInprGpk9

2026

[26] [26]

Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving sparse decomposition of lan- guage model activations with gated sparse autoencoders. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Sy...

2024

[27] [27]

Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders, 2025. URL https://openreview.net/forum?id= mMPaQzgzAN. 11

2025

[28] [28]

(tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023

Logan Riggs. (tentatively) found 600+ monosemantic features in a small lm using sparse autoencoders.AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/wqRqb7h6ZC48iDgfK/ tentatively-found-600-monosemantic-features-in-a-small-lm

2023

[29] [29]

Einops: Clear and reliable tensor manipulations with einstein-like notation

Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=oapKSVM2bcj

2022

[30] [30]

dictionary_learning, 2024

Adam Karvonen Samuel Marks and Aaron Mueller. dictionary_learning, 2024. URL https: //github.com/saprmarks/dictionary_learning

2024

[31] [31]

Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023

Lee Sharkey, Dan Braun, and Beren Millidge. Taking features out of superposition with sparse autoencoders.Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/ interim-research-report-taking-features-out-of-superposition

2023

[32] [32]

Diab, Virginia Smith, and Kun Zhang

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, and Kun Zhang. Position: Mechanistic interpretability should prioritize feature consistency in SAEs. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. URL https://openreview.net/forum?id=d9ACURK6bI

2025

[33] [33]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

activation_dim -1␣ dict_size

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering LLMs? even simple base- 12 lines outperform sparse autoencoders. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=K2CckZjNy0. A Implementation Details All SA...

2025