pith. sign in

arxiv: 2607.01799 · v1 · pith:VDPRZQVBnew · submitted 2026-07-02 · 💻 cs.LG · cs.AI· cs.IT· math.IT

Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability

Pith reviewed 2026-07-03 17:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT
keywords expander sparse autoencoderssparse autoencodersmechanistic interpretabilityparameter-efficient dictionariestopk saeexpander graphsdictionary learningcompressed sensing
0
0 comments X

The pith

Expander SAEs reduce learned decoder values by 293 times while recovering 84 percent of dense cross-entropy loss on Qwen2.5-3B.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Expander SAEs as TopK sparse autoencoders whose decoder and tied encoder are supported only on a left-d-regular expander mask with d much smaller than the feature dimension m. This mask reduces the number of learned decoder values from mn to dn while leaving the underlying sparse coding problem unchanged. Experiments on residual stream activations from Pythia, Qwen2.5-3B, and Llama models show that varying d produces a consistent storage-fidelity frontier. At the most compressed modern-LM point, d equals 7 yields 293 times fewer decoder values than a dense decoder yet retains 84 percent of the recovered cross-entropy loss.

Core claim

Expander SAEs supported on left-d-regular expander masks learn only dn decoder values instead of mn, and at d=7 on Qwen2.5-3B achieve 293 times fewer values while retaining 84 percent of dense CE-loss recovered. Control experiments attribute the storage-fidelity gain to the sparse diverse support structure induced by the expander mask rather than the sheer reduction in parameters. Expansion and column flatness are shown to be sufficient for identifiability of noiseless k-sparse codes, with complementary conditions guaranteeing exact support recovery by OMP.

What carries the argument

The left-d-regular expander mask that restricts decoder and encoder support, reducing storage to dn values and converting the OMP correlation step into an O(dn) gather-and-reduce operation.

If this is right

  • Storage cost for overcomplete dictionaries scales linearly with d instead of m.
  • The matching-pursuit correlation step in OMP becomes an O(dn) operation.
  • Varying d traces a tunable storage-fidelity curve across model scales.
  • Theoretical identifiability holds when the mask satisfies expansion and column flatness.
  • Encoder amortisation accounts for part of any remaining gap versus dense decoders at matched parameter count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same expander construction could be applied to other dictionary-learning objectives that rely on overcomplete bases.
  • If expander structure is the key driver, deterministic constructions such as Ramanujan graphs might further improve the frontier.
  • The reduction in decoder storage could make feature dictionaries with millions of entries practical on current hardware.
  • Connections to compressed-sensing recovery guarantees suggest that expander SAEs may inherit robustness properties from that literature.

Load-bearing premise

The storage-fidelity improvement is produced by the expander mask's sparse and diverse support structure rather than by the reduction in learned decoder values alone.

What would settle it

A control run in which a random d-regular mask with the same number of nonzeros per column matches the expander mask's fidelity at identical parameter count would show that expansion is not required.

Figures

Figures reproduced from arXiv: 2607.01799 by Rodrigo Mendoza-Smith.

Figure 1
Figure 1. Figure 1: Storage–fidelity frontier on Pythia-70M layer 3 (m=512, n=4096, trained encoder). The left panel shows corrected CE-loss recovered versus TopK sparsity k, and the right panel shows relative reconstruction error versus k. Larger d approaches the dense fidelity floor, while smaller d retains high CE recovery with substantially fewer learned decoder values. which measures ℓ2 reconstruction quality on a held-o… view at source ↗
Figure 2
Figure 2. Figure 2: Three controls at k=64 on Pythia-70M layer 3, three seeds. The top row shows reconstruction error, CE-loss recovered, and dead-feature fraction (log scale) for Expander (solid) and Clustered-sparse (dashed). The bottom-left panel compares Expander (solid) against Pruned-retuned dense (dotted) at d ∈ {7, 50, 200}. The bottom-right panel reports the reduced-n dense baseline at matched parameter budget. packs… view at source ↗
Figure 3
Figure 3. Figure 3: Relative reconstruction error versus k for each decoding method on frozen Expander SAE decoders, shared y-range across d. Error bars show the seed min/max range. B.2. Structured OMP for expander dictionaries The d-regular column structure admits three optimisations that turn iterative OMP from a CPU-bound diagnostic into a GPU-batched inference procedure, and a block-size sweep then exposes a continuum of … view at source ↗
Figure 4
Figure 4. Figure 4: Empirical certificate ratios on the trained Expander-SAE grid, one line per d. The dashed horizontal line marks the certificate threshold, and values below the line satisfy the worst-case sufficient condition. Both ratios sit well above the certificate threshold across the entire trained grid, while OMP and the trained encoder reconstruct successfully on the same models. The current sufficient certificates… view at source ↗
Figure 5
Figure 5. Figure 5: Active-support collision diagnostic at k=64 on held-out activations, three seeds. The left panel shows per-sample empirical expansion deficit on the active TopK support, with bars at the median and error bars spanning min to p95. The right panel shows per-sample duplicate-edge count on a log scale. On the trained Expander-SAE’s active supports, the collision deficit and duplicate-edge count fall below the … view at source ↗
Figure 6
Figure 6. Figure 6: Feature novelty diagnostics for Expander-SAE against the Dense-SAE reference (trained encoder). The top-left panel shows activation-Jaccard novelty, the fraction of Expander features whose best-match Jaccard against the dense reference falls below 0.1. The top-right panel shows median per-feature target-token entropy as a monosemanticity proxy. The bottom panels show the firing-rate￾matched novelty null at… view at source ↗
Figure 7
Figure 7. Figure 7: Per-architecture distribution of mean coherence ratings under blinded LLM evaluation, averaged across two judges and three calls each then rounded to integers 1 to 5. n=25 features per architecture, stratified by firing-rate quartile so that quartile distributions match across architectures. Architecture is invisible to the judges, with features identified only by anonymised ID. Mean coherence is 3.59 ± 0.… view at source ↗
Figure 8
Figure 8. Figure 8: Expander-SAE against Clustered-sparse on three metrics at matched (d, k) with m=512, n=4096, trained encoder. Solid lines are Expander-SAE, dashed lines are Clustered-sparse at the matching d, and black is the Dense-SAE reference. The left panel shows reconstruction error, the middle panel shows corrected CE-loss recovered, and the right panel shows dead-feature fraction on a log scale. Expander-SAE acts a… view at source ↗
Figure 9
Figure 9. Figure 9: Expander-SAE against Pruned dense at matched (d, k), three seeds (mean ± SEM). Solid lines are Expander-SAE and dotted lines are Pruned dense. Panels show trained-encoder reconstruction error and corrected CE-loss recovered. excluded. At every shared budget, Expander-SAE recovers more CE than the reduced-n dense baselines, with gaps of 0.10 to 0.21. The parameter saving is therefore not equivalent to runni… view at source ↗
Figure 10
Figure 10. Figure 10: CE-loss recovered at k=64 versus learned-parameter budget (trained encoder). Each group of bars sits at the budget shared by an Expander-SAE configuration (d ∈ {7, 30, 50, 100, 200} at n=4096) and three dense baselines reduced to the same budget by lowering n, annotated above each bar. Bars show min/max range across seeds. where 1 matches the clean model’s CE and 0 matches zero-ablation. The CE evaluation… view at source ↗
Figure 11
Figure 11. Figure 11: Fraction of Expander decoder columns whose maximum absolute cosine against any Dense-SAE decoder column is below 0.1. 10 1 10 2 ksynth 0.0 0.2 0.4 0.6 0.8 1.0 exact support recovery rate = 0.0 10 1 10 2 ksynth = 0.01 10 1 10 2 ksynth = 0.05 d=7 d=30 d=50 d=100 d=200 Synthetic OMP recovery on d-regular Expander decoders (m=512, n=4096) [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Exact support-recovery rate as a function of ksynth, with one curve per Expander degree d ∈ {7, 30, 50, 100, 200} and one panel per noise level σ. Points are means over three mask seeds and error bars show the standard error. • ...Debbie Gregory D[Deb] • ...Frederick L[Fred] • ...Buy Nureflex [Buy] A “beginning-of-token” detector firing at sentence-initial positions. The high target-token entropy but low … view at source ↗
Figure 13
Figure 13. Figure 13: Relative reconstruction error ∥Wxˆ − Wx∥2/∥Wx∥2 for the same synthetic experiment as [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

Sparse autoencoders (SAEs) decompose internal activations of neural networks into sparse linear combinations of learned features by fitting an overcomplete dictionary $\mathbf{W}\in\mathbb{R}^{m\times n}$ with $m<n$, and inferring a sparse code $\mathbf{x}\in\mathbb{R}^n$ from $\mathbf{h}\approx\mathbf{W}\mathbf{x}$. This inference problem closely resembles the canonical setup of compressed sensing, but dense decoders requires $O(mn)$ learned values, which becomes costly at large feature counts. We introduce Expander SAEs: TopK SAEs whose decoder and tied encoder are supported on a left-$d$-regular expander mask with $d\ll m$, learning only $dn$ decoder values while keeping the sparse-coding problem $(m,n,k)$ fixed. The same structure reduces storage and turns the matching-pursuit correlation step $\mathbf{W}^\top \mathbf{r}$ in OMP into an $O(dn)$ gather-and-reduce operation. Our experiments show that across Pythia-70M/160M, Qwen2.5-3B, and Llama-3.2-1B residual-stream activations, varying $d$ traces a consistent storage--fidelity frontier, and that at the most compressed modern-LM setting, Qwen2.5-3B with $d=7$ uses $293\times$ fewer learned decoder values than the full dense decoder while retaining $84$% of dense CE-loss recovered. Control experiments show that the improved storage--fidelity tradeoff is driven by sparse, diverse decoder support structure rather than by fewer learned decoder values, and that when sparse and dense decoders are compared at matched parameter count, part of the remaining gap comes from encoder amortisation. On the theoretical side, we show that expansion and column flatness are sufficient for identifiability of noiseless $k$-sparse codes, and we derive complementary sufficient conditions under which OMP recovers the support exactly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Expander SAEs as TopK SAEs whose decoder (and tied encoder) are supported on a left-d-regular expander mask, reducing learned decoder parameters from O(mn) to dn while fixing the (m,n,k) sparse-coding problem. Experiments across Pythia, Qwen2.5-3B, and Llama models trace a storage-fidelity frontier, with the Qwen2.5-3B d=7 setting claiming 293× fewer decoder values at 84% of dense CE-loss recovery. Control experiments are presented to attribute gains to the expander-induced sparse diverse support rather than parameter count alone. Theory supplies sufficient conditions (expansion + column flatness) for noiseless k-sparse identifiability and exact OMP support recovery.

Significance. If the control experiments convincingly isolate the expander structure, the work offers a practical route to scale SAEs for mechanistic interpretability with substantially lower storage and OMP inference cost. The consistent frontier across four model families and the derivation of identifiability/OMP conditions from standard expander-graph properties are strengths. The explicit parameter reduction and the attempt to separate mask structure from mere sparsity count add value for large-scale dictionary learning.

major comments (2)
  1. [Control experiments (abstract and Experiments section)] Control experiments (abstract and Experiments section): The central empirical claim that the storage-fidelity improvement is driven by the expander mask's sparse, diverse decoder support (rather than reduction to dn learned values) rests on these controls. The provided description does not specify whether the baselines include matched-optimization random d-regular masks, identical training schedules, or other ablations that would rule out differences in effective capacity or optimization trajectory; without such detail the attribution remains vulnerable to the alternative explanations noted in the stress-test.
  2. [Theory section] Theoretical results (Theory section): The identifiability and exact OMP recovery theorems assume noiseless k-sparse codes. The empirical claims concern learned representations from noisy neural activations; the manuscript does not provide a quantitative bridge (e.g., robustness bounds or noise-model analysis) showing that the expansion/flatness conditions remain predictive in the learned noisy regime that the storage-fidelity results actually use.
minor comments (2)
  1. [Abstract] The abstract states that 'part of the remaining gap comes from encoder amortisation' when sparse and dense decoders are compared at matched parameter count; a brief parenthetical or footnote clarifying the amortisation mechanism would improve readability.
  2. [Methods] Notation for the left-d-regular expander mask construction and the column-flatness condition could be accompanied by a small illustrative diagram or pseudocode in the methods to aid readers outside graph theory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the paper's significance. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Control experiments (abstract and Experiments section): The central empirical claim that the storage-fidelity improvement is driven by the expander mask's sparse, diverse decoder support (rather than reduction to dn learned values) rests on these controls. The provided description does not specify whether the baselines include matched-optimization random d-regular masks, identical training schedules, or other ablations that would rule out differences in effective capacity or optimization trajectory; without such detail the attribution remains vulnerable to the alternative explanations noted in the stress-test.

    Authors: We agree that the control experiments section requires additional detail to fully substantiate the attribution to the expander structure. In the revised manuscript we will expand the description to state that random d-regular mask baselines were trained with identical optimization schedules, learning rates, batch sizes, and initialization procedures as the expander masks. We will also report additional ablations that enforce matched parameter counts using random sparse supports of comparable density but without the expansion property. These clarifications should address concerns about effective capacity and optimization trajectory differences. revision: yes

  2. Referee: Theoretical results (Theory section): The identifiability and exact OMP recovery theorems assume noiseless k-sparse codes. The empirical claims concern learned representations from noisy neural activations; the manuscript does not provide a quantitative bridge (e.g., robustness bounds or noise-model analysis) showing that the expansion/flatness conditions remain predictive in the learned noisy regime that the storage-fidelity results actually use.

    Authors: The theorems are deliberately stated for the noiseless case to obtain clean sufficient conditions derived from standard expander-graph properties. We acknowledge that the manuscript provides no quantitative robustness bounds or explicit noise-model analysis connecting these conditions to the noisy regime of learned neural activations. In revision we will add a short discussion paragraph in the Theory section noting this gap, explaining that the noiseless results supply intuition for why expander structure aids recovery, and identifying extension to noisy settings as future work. No new quantitative bridge will be derived. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper defines the expander mask via standard left-d-regular expander graph properties drawn from external graph theory, then reports empirical storage-fidelity results on real LM activations together with control experiments that compare against reduced-parameter baselines. Neither the identifiability theorem (expansion + column flatness sufficient for noiseless k-sparse recovery) nor the reported performance numbers (e.g., 293× parameter reduction at 84 % CE recovery) are obtained by fitting a parameter to a subset of the target data and then relabeling that fit as a prediction. No self-citation chain is invoked to justify the core construction or to forbid alternatives; the theoretical conditions are stated as sufficient rather than as a uniqueness result imported from the authors' prior work. The derivation chain therefore remains self-contained against external benchmarks and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the existence of left-d-regular expander graphs with suitable expansion and flatness properties and on the empirical claim that the mask structure itself (rather than parameter count alone) drives the observed tradeoff.

free parameters (1)
  • d
    The regularity degree of the expander mask; varied across experiments to trace the storage-fidelity frontier.
axioms (1)
  • domain assumption Expansion and column flatness of the left-d-regular expander mask are sufficient for identifiability of noiseless k-sparse codes.
    Invoked in the theoretical section to guarantee unique recovery.

pith-pipeline@v0.9.1-grok · 5900 in / 1407 out tokens · 33099 ms · 2026-07-03T17:17:31.461554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 15 canonical work pages · 8 internal anchors

  1. [1]

    Deeply-sparse signal representations (DS 2P)

    Demba Ba. Deeply-sparse signal representations (DS 2P). arXiv:1807.01958,

  2. [2]

    Automatica49(11), 3222–3233 (2013)

    doi: 10.1016/j. acha.2010.07.001. Thomas Blumensath and Mike E. Davies. Gradient pursuits. IEEE Transactions on Signal Processing, 56(6):2370– 2382,

  3. [3]

    BatchTopK sparse autoencoders.arXiv:2412.06410,

    Bart Bussmann, Patrick Leask, and Neel Nanda. BatchTopK sparse autoencoders.arXiv:2412.06410,

  4. [4]

    Matryoshka sparse autoencoders

    Bart Bussmann et al. Matryoshka sparse autoencoders. arXiv:2503.17547,

  5. [5]

    arXiv preprint arXiv:2506.03093 , year=

    Victor Costa, Thomas Fel, Ekdeep Singh Lubana, Ba- hareh Tolooshams, and Demba Ba. From flat to hier- archical: Matching pursuit for sparse representations. arXiv:2506.03093,

  6. [6]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    8 Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability Hoagy Cunningham et al. Sparse autoencoders find highly interpretable features in language models. arXiv:2309.08600,

  7. [7]

    Donoho and Jared Tanner

    David L. Donoho and Jared Tanner. Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal process- ing.Philosophical Transactions of the Royal Society A, 367(1906):4273–4293,

  8. [8]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao et al. The pile: An 800GB dataset of diverse text for language modeling.arXiv:2101.00027,

  9. [9]

    Scaling and evaluating sparse autoencoders

    Leo Gao et al. Scaling and evaluating sparse autoencoders. arXiv:2406.04093,

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  11. [11]

    Superposition is compressed sensing

    David Klindt et al. Superposition is compressed sensing. arXiv:2503.01824,

  12. [12]

    Expander $\ell_0$-Decoding

    Rodrigo Mendoza-Smith and Jared Tanner. Expander ℓ0- decoding.arXiv:1508.01256,

  13. [13]

    A robust parallel algorithm for combinatorial compressed sensing

    Rodrigo Mendoza-Smith and Jared Tanner. Robust expander ℓ0-decoding.arXiv:1704.09012,

  14. [14]

    Switch sparse autoencoders

    Anish Mudide et al. Switch sparse autoencoders. arXiv:2410.08201,

  15. [15]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    Senthooran Rajamanoharan et al. Gated sparse autoencoders. InNeural Information Processing Systems (NeurIPS), 2024a. Senthooran Rajamanoharan et al. JumpReLU sparse autoen- coders.arXiv:2407.14435, 2024b. Adly Templeton et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread,

  16. [16]

    Qwen2.5 Technical Report

    An Yang et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  17. [17]

    Mini-batches are drawn with shuffle=True, drop_last=True , and the loader cycles throughD train untilTsteps are completed

    with β1 = 0.9, β2 = 0.999, ϵ= 10 −8, and a single-period cosine learning-rate schedule overT= 5000steps, ηt = 10−5 + 1 2 3·10 −4 −10 −5 1 + cos(πt/T) .(14) Gradients are clipped to global ℓ2 norm 1.0. Mini-batches are drawn with shuffle=True, drop_last=True , and the loader cycles throughD train untilTsteps are completed. Dead-feature resampling.Following...

  18. [18]

    (4) on the same held-out activations, and we report per-sample relative reconstruction error averaged over activations and seeds

    all solve Eq. (4) on the same held-out activations, and we report per-sample relative reconstruction error averaged over activations and seeds. OMP reduces reconstruction error by 15 to 25% relative to the trained encoder across every (d, k) tested, consistent with the amortisation-gap findings of O’Neill et al. (2025) and Google DeepMind MI Team (2024). ...

  19. [19]

    L controls the number of columns added to the active set per outer iteration, withL=1 recovering iterative OMP and L=64 collapsing to single-shot decoding

    Table 5.Block-size L sweep for the Cholesky-refit + Triton-correlation + structured-Gram OMP variant atd=7 (m=512, n=4096, k=64, A10G, bf16, B=1024, seed 0). L controls the number of columns added to the active set per outer iteration, withL=1 recovering iterative OMP and L=64 collapsing to single-shot decoding. The final column compares each row to the s...

  20. [20]

    A value of 1 means the reconstruction matches the clean model’s CE,0 means it is no better than zero-ablation, and negative values mean it is worse than zero-ablation

    CE-loss recovered is computed by replacing the layer-ℓ residual stream of the language model with the SAE reconstruction and measuring next-token cross-entropy on held-out sequences, CE recovered = CEzero −CE recon CEzero −CE clean ,(17) where CEclean is the cross-entropy of the unmodified language model, CEzero replaces the activation by the all-zero vec...

  21. [21]

    has one of three seeds diverge during training (rel err 1.13, included in the reported mean); without that seed the rel err mean is0.704and CE rec is0.854. Expander-SAE Dense-SAE (fulln) Matched-n ′ Dense-SAE Layer metricd=7d=30d=102n=16,384n ′=240n ′=816 12 rel err0.764 0.703 0.681 0.489 0.608 0.556 CE rec0.828 0.894 0.919 0.983 0.906 0.945 24 rel err0.6...

  22. [22]

    Therefore sX r=1 brar ≤εd sX r=1 r(ar −a r+1) =εd sX r=1 ar

    By summation by parts, sX r=1 brar = sX r=1 Cr(ar −a r+1). Therefore sX r=1 brar ≤εd sX r=1 r(ar −a r+1) =εd sX r=1 ar. M.2. Flatness bounds learned weighted cancellations The previous subsection treats M as a combinatorial object and ignores the values of W. This subsection extracts the two consequences of the flatness factor that the main proof needs, b...

  23. [23]

    The second corollary moves from existential identifiability to a constructive recovery guarantee. The classical analysis of Tropp (2004) certifies OMP exact recovery whenever the cumulative coherence satisfiesµ1(k;W) +µ 1(k−1;W)<1 , and we show that the prefix collision bound of Lemma M.1 controlsµ 1(s;W)in our setting. Corollary M.4(A sufficient cumulati...

  24. [24]

    Taking the maximum over j and T gives the cumulative-coherence bound, and the final statement follows from the standard OMP sufficient conditionµ 1(k;W) +µ 1(k−1;W)<1(Tropp, 2004)

    SinceP ℓ∈T cjℓ is bounded by this collision count, X ℓ∈T |⟨wj,w ℓ⟩| ≤β 2ε(s+ 1). Taking the maximum over j and T gives the cumulative-coherence bound, and the final statement follows from the standard OMP sufficient conditionµ 1(k;W) +µ 1(k−1;W)<1(Tropp, 2004). The third corollary upgrades identifiability to a quantitative stability statement. Under the a...