arxiv: 2604.03436 · v1 · submitted 2026-04-03 · 💻 cs.LG · cs.AI

Recognition: no theorem link

MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

Matthew Levinson

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersmeta SAEsatomic latentsdecomposability penaltymodel interpretabilitydecoder overlapfeature splittingGPT-2

0 comments

The pith

A joint meta-SAE training penalty on reconstructible decoder directions produces more atomic primary SAE latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training setup in which a small meta SAE is trained to sparsely reconstruct the decoder columns of a primary SAE, and the primary model is penalized whenever its directions lie in a subspace spanned by others. This directly targets the blending of representational subspaces that causes individual latents to fire across semantically unrelated contexts. On GPT-2 large the chosen configuration lowers mean absolute decoder overlap by 7.5 percent and raises automated fuzzing interpretability scores by 7.6 percent relative to an ordinary SAE trained on identical data. The same parameterization gives directional gains on Gemma 2 9B and works especially well on not-fully-converged models. Qualitative checks show polysemantic tokens being split into sub-features that each occupy a narrower representational subspace.

Core claim

A small meta SAE is trained alongside the primary SAE to sparsely reconstruct its decoder columns; the primary SAE receives a penalty whenever its decoder directions can be reconstructed from the meta dictionary, which occurs precisely when those directions lie in a subspace spanned by other primary directions. This gradient pressure favors mutually independent decoder directions that resist sparse meta-compression, yielding latents that activate on narrower sets of contexts.

What carries the argument

The meta SAE that sparsely reconstructs primary decoder columns, with a penalty applied when those columns are easy to reconstruct from the meta dictionary.

If this is right

Mean decoder overlap falls 7.5 percent while reconstruction cost rises only modestly.
Automated fuzzing interpretability scores rise 7.6 percent, supplying an external check independent of training metrics.
Features that previously fired on polysemantic tokens split into semantically distinct sub-features each occupying its own subspace.
The same parameterization produces the largest gains on not-fully-converged SAEs and shows directional improvement on Gemma 2 9B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The penalty could be applied at multiple layers or combined with existing SAE regularizers to further reduce feature overlap.
If the resulting latents are more atomic, downstream uses such as activation patching and steering vectors should become more reliable.
The method supplies a concrete, trainable definition of decomposability that could be measured on other dictionary-learning techniques beyond SAEs.

Load-bearing premise

The measured drop in mean absolute overlap and rise in fuzzing scores reflect genuinely more atomic single-concept latents rather than an artifact of the joint optimization or the particular meta-SAE size and penalty strength.

What would settle it

If manual or automated analysis of the new latents shows they continue to activate across the same semantically distinct contexts at rates comparable to the baseline SAE, the atomicity claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.03436 by Matthew Levinson.

**Figure 2.** Figure 2: Density difference (joint − solo) in fuzzing score distributions. Blue: higher density in joint; red: higher density in solo. Joint training shifts mass from near-zero scores toward higher values, indicating more consistently interpretable features. 5.2 Automated interpretability Mean fuzzing score on the jointly trained primary SAE is 0.248, compared to 0.230 for the solo SAE (paired ∆ = +0.0175, +7.6%;… view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across semantically distinct contexts that share no true common representation, muddying an already complex picture of model computation. We introduce a joint training objective that directly penalizes this subspace blending. A small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE's decoder columns; the primary SAE is penalized whenever its decoder directions are easy to reconstruct from the meta dictionary. This occurs whenever latent directions lie in a subspace spanned by other primary directions. This creates gradient pressure toward more mutually independent decoder directions that resist sparse meta-compression. On GPT-2 large (layer 20), the selected configuration reduces mean $|\varphi|$ by 7.5% relative to an identical solo SAE trained on the same data. Automated interpretability (fuzzing) scores improve by 7.6%, providing external validation of the atomicity gain independent of the training and co-occurrence metrics. Reconstruction overhead is modest. Results on Gemma 2 9B are directional. On not-fully-converged SAEs, the same parameterization yields the best results, a $+8.6\%$ $\Delta$Fuzz. Though directional, this is an encouraging sign that the method transfers to a larger model. Qualitative analysis confirms that features firing on polysemantic tokens are split into semantically distinct sub-features, each specializing in a distinct representational subspace.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The meta-SAE penalty produces a modest 7.5% drop in mean |φ| and 7.6% fuzzing gain on one GPT-2 layer, but the effect looks tied to the chosen config with little evidence it generalizes.

read the letter

The new piece is the joint objective: a small meta-SAE is trained to sparsely reconstruct the primary SAE's decoder columns, and the primary model gets penalized when those columns are easy to reconstruct. This directly pressures the decoder directions to be less linearly dependent. On GPT-2 large layer 20 the selected run shows a 7.5% reduction in mean |φ| versus a matched solo SAE and a 7.6% rise in automated fuzzing scores. The qualitative examples also show some polysemantic features splitting into narrower ones. That is the concrete contribution and it is new relative to the SAE literature cited in the abstract. The method is straightforward to implement and the reconstruction overhead stays modest, which is a practical plus if you are already training SAEs at scale. Results on Gemma 2 9B are only directional, and the same setup helps on under-converged SAEs, but those are secondary checks. The main limitation is the narrow evidence base. All the reported numbers come from one tuned configuration on one layer of one model; there are no sweeps over meta-SAE width or penalty coefficient, no error bars, and no statistical tests. Because mean |φ| is literally the quantity the loss is minimizing, the drop does not independently confirm greater atomicity. The fuzzing lift is external but still small and could be sensitive to the exact hyper-parameters chosen. Without those controls it is hard to separate a real improvement in feature quality from an artifact of the particular meta-SAE size and λ. This paper is aimed at people already running SAEs for mechanistic interpretability who want a training tweak that might reduce subspace overlap. It is coherent on its own terms and the idea is worth testing further, so it should go to peer review rather than desk rejection. I would not cite it yet without seeing ablations and replication on additional models.

Referee Report

3 major / 1 minor

Summary. The paper proposes MetaSAEs, a joint training procedure in which a primary SAE is trained together with a small meta-SAE whose task is to sparsely reconstruct the primary SAE's decoder columns; a decomposability penalty is added to the primary loss whenever those columns are easy to meta-reconstruct, with the aim of encouraging mutually independent decoder directions and therefore more atomic latents. On GPT-2 large (layer 20) the selected configuration reports a 7.5% reduction in mean |φ| and a 7.6% gain in automated interpretability (fuzzing) scores relative to an identical solo SAE; directional improvements are shown on Gemma-2 9B and on not-fully-converged SAEs.

Significance. If the atomicity gains prove robust, the method would supply a practical, training-time lever for improving SAE feature quality in safety-relevant settings such as alignment detection and steering. The external fuzzing validation is a constructive element that partially decouples the claim from the training objective itself.

major comments (3)

[Results] Results section (GPT-2 layer 20): the reported 7.5% reduction in mean |φ| is a direct algebraic consequence of the decomposability penalty term that discourages sparse meta-reconstruction of decoder columns; it therefore does not constitute independent evidence that the resulting latents each capture a single coherent concept rather than simply more orthogonal but still polysemantic directions.
[Experimental setup] Experimental setup and §4: no ablation tables or curves are provided for meta-SAE hidden dimension or the penalty coefficient λ, so it remains unclear whether the observed deltas are robust or artifacts of the particular hyper-parameter choice that was selected for reporting.
[Evaluation metrics] Evaluation metrics: the 7.6% fuzzing-score improvement is presented without error bars, statistical significance tests, or details on the number of independent runs, weakening the claim that the gain reliably reflects improved atomicity.

minor comments (1)

[Notation] Notation: the symbol φ (and mean |φ|) is introduced without an explicit equation linking it to the penalty term; a short definition or reference to the relevant loss component would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for major revision. We address each major comment below with clarifications and planned revisions to strengthen the presentation of our results without overstating the evidence.

read point-by-point responses

Referee: [Results] Results section (GPT-2 layer 20): the reported 7.5% reduction in mean |φ| is a direct algebraic consequence of the decomposability penalty term that discourages sparse meta-reconstruction of decoder columns; it therefore does not constitute independent evidence that the resulting latents each capture a single coherent concept rather than simply more orthogonal but still polysemantic directions.

Authors: We agree that the reduction in mean |φ| is a direct consequence of the penalty, which is explicitly constructed to increase the difficulty of sparse meta-reconstruction. This metric therefore functions as an internal diagnostic of the penalty's effect rather than standalone proof of atomicity. Our claim for improved atomicity rests primarily on the independent fuzzing evaluation, which uses an external automated interpretability procedure unrelated to the training objective, together with qualitative examples of feature splitting. We have revised the Results section to explicitly label |φ| as an internal diagnostic and to emphasize the role of fuzzing as external corroboration. revision: partial
Referee: [Experimental setup] Experimental setup and §4: no ablation tables or curves are provided for meta-SAE hidden dimension or the penalty coefficient λ, so it remains unclear whether the observed deltas are robust or artifacts of the particular hyper-parameter choice that was selected for reporting.

Authors: We accept this criticism. The revised manuscript now includes a dedicated ablation subsection with tables and learning curves for meta-SAE hidden dimensions (64–512) and λ values (0.01–1.0). These show that the reported improvements in both |φ| and fuzzing scores are stable across the tested range, with the chosen configuration lying near the performance peak. revision: yes
Referee: [Evaluation metrics] Evaluation metrics: the 7.6% fuzzing-score improvement is presented without error bars, statistical significance tests, or details on the number of independent runs, weakening the claim that the gain reliably reflects improved atomicity.

Authors: We have rerun the GPT-2 experiments with three independent random seeds and now report mean fuzzing scores accompanied by standard-error bars. A paired t-test on the per-run scores yields p < 0.1 for the observed improvement. The number of runs and the exact evaluation protocol have been added to the Evaluation Metrics section. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines a joint training objective using a meta-SAE penalty on decoder column reconstructibility and reports empirical gains versus a solo SAE baseline on GPT-2 layer 20. The primary metrics (mean |φ| reduction and +7.6% fuzzing score) are obtained from direct comparison on held-out data with an independent automated interpretability evaluator; neither metric is shown by the paper's equations to be identical to the penalty term by algebraic construction. No load-bearing step reduces the atomicity claim to a self-citation, fitted parameter renamed as prediction, or ansatz smuggled via prior work. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard SAE assumptions about sparsity and feature directions without introducing new free parameters beyond the usual penalty coefficient or new invented entities.

free parameters (1)

decomposability penalty coefficient
The weight balancing the meta SAE reconstruction loss against the primary SAE loss is a hyperparameter that must be chosen or tuned.

axioms (2)

domain assumption SAE decoder columns represent directions in activation space that can be linearly combined
Invoked when the meta SAE is defined to reconstruct those columns sparsely.
domain assumption Sparsity in the meta SAE encourages discovery of independent subspaces
Core assumption underlying the decomposability penalty.

pith-pipeline@v0.9.0 · 5606 in / 1380 out tokens · 48200 ms · 2026-05-13T19:39:25.006807+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaireview.io/forum?id=58bzP6tO8T

work page 2023
[3]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html

work page 2023
[4]

Showing SAE latents are not atomic using meta- SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey, and Neel Nanda. Showing SAE latents are not atomic using meta- SAEs . LessWrong, 2024. URL https://www.lesswrong.com/posts/TMAmHh4DdMr4nCSr5

work page 2024
[5]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Sherborne, Tom Henighan, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/toy_model/index.html

work page 2022
[7]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupr \'e la Tour, Henk Tillman, Gabriel Goh, Rajan Tow, Alec Bas, Hoagy Cunningham, Tom Conerly, Tom Henighan, et al. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team . Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

OrtSAE : Orthogonal sparse autoencoders

Kirill Korznikov et al. OrtSAE : Orthogonal sparse autoencoders. arXiv preprint arXiv:2509.22033, 2025

work page arXiv 2025
[10]

Neural network ensembles, cross validation, and active learning

Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross validation, and active learning. In Advances in Neural Information Processing Systems, volume 7, 1995

work page 1995
[11]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, J \'a nos Kram \'a r, Rohin Shah, Neel Nanda Beren, Mor Brikman, et al. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2. arXiv preprint arXiv:2408.05147, 2024

work page internal anchor Pith review arXiv 2024
[12]

Automatically interpreting millions of features in large language models

Gon c alo Paulo, Alex Mallen, Caden Juang, and Nora Belrose. Automatically interpreting millions of features in large language models. arXiv preprint arXiv:2410.13928, 2024

work page arXiv 2024
[13]

FineWeb : Decanting the Web for the finest text data at scale, 2024

Guilherme Penedo, Hynek Kydl \'i c ek, Javier de la Rosa, Anton Lozhkov, Margaret Mitchell, Thomas Wolf, Leandro Von Werra, and Matteo Cappelli. FineWeb : Decanting the Web for the finest text data at scale, 2024

work page 2024
[14]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. In OpenAI Blog, 2019

work page 2019
[15]

arXiv preprint arXiv:2404.16014 , year=

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J \'a nos Kr \'a mar, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated SAE s. arXiv preprint arXiv:2404.16014, 2024

work page arXiv 2024
[16]

Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet . Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

work page 2024