pith. sign in

arxiv: 2605.31518 · v1 · pith:VMJ5SKF7new · submitted 2026-05-29 · 💻 cs.LG

On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders

Pith reviewed 2026-06-28 23:32 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse autoencodersfeature deathactivation outliersmean-centeringdictionary learningneural interpretabilitypre-activation shift
0
0 comments X

The pith

Activation outliers in neural network layers cause many sparse autoencoder features to die at initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that certain dimensions in model activations have large average magnitudes compared to their variation across tokens. These outliers shift the initial pre-activations of SAE features depending on how aligned each feature is with the mean activation. Features pointing opposite to the mean start with negative values and stay dead. A simple ratio gamma measuring outlier strength strongly predicts how many features die across many models. Mean-centering the activations removes this problem entirely.

Core claim

Dimension-level activation outliers cause feature death in SAEs because they shift pre-activations at initialization: features anti-aligned with the activation mean receive permanently negative pre-activations and never fire. Outlier severity gamma equals the norm of the mean divided by the norm of the standard deviation, and this value predicts initial death rates with high Spearman correlation across 454 model-layer combinations. Mean-centering eliminates the death.

What carries the argument

The outlier severity measure gamma = ||mu|| / ||sigma||, which quantifies how much the activation mean dominates per-token variation and drives the pre-activation shift.

Load-bearing premise

The assumption that the initialization-time pre-activation shift from outliers is the primary cause of the observed feature death rates rather than other training dynamics.

What would settle it

Finding a model-layer combination where gamma does not correlate with death rates, or where mean-centering fails to reduce death despite high gamma.

Figures

Figures reproduced from arXiv: 2605.31518 by Elana Simon, Etowah Adams, James Zou.

Figure 1
Figure 1. Figure 1: Activation outliers predetermine feature fate at initialization. (Left) Centered activations (e.g. GPT-2, γ < 1) versus activations with dimension-level outliers (e.g. AF3, γ ≈ 15), where γ = ∥µ∥/∥σ∥ measures how much the activation mean dominates per-token variation. (Middle) In the offset case, only features positively aligned with the activation mean fire and receive gradient; anti-aligned features are … view at source ↗
Figure 2
Figure 2. Figure 2: Dead feature rates vary dramatically across models and coincide with dimension-level activation outliers. (a) Dead feature rates for TopK SAEs across a subset of language, vision, protein, and genomic models (additional models in Figure S14). AuxK helps but does not resolve death where it is most severe. (b) Activation magnitude landscapes (LayerNorm-normalized per token) for three models. Green highlights… view at source ↗
Figure 4
Figure 4. Figure 4: γ analytically predicts dead feature rates for both death pathways. (a) Synthetic activations with controlled γ: dead features increase monotonically (Spearman ρ = 1.0). The ReLU theory curve Φ(−C/γ) matches dead-by-ReLU rates (circles); the TopK theory curve Φ(tk − C/γ) tracks dead-by-TopK (triangles), becoming tight for γ > 3. The dotted line at 99.2% marks the 1 − k/n asymptote. (b) 454 real model-layer… view at source ↗
Figure 5
Figure 5. Figure 5: Recovery from outlier-induced death is bottlenecked by slow bias learning. All panels use synthetic activations with γ set directly; the same two-phase pattern holds on real models (Figure S7). (a) Dead feature recovery slows as γ increases. (b) Bias convergence to µ slows as γ increases. (c) Dead-by-TopK at γ = 20 drops steeply around 100K steps (inset) regardless of condition. (d) Dead-by-ReLU at γ = 20 … view at source ↗
Figure 6
Figure 6. Figure 6: Mean-centering sidesteps the recovery bottleneck: features are never dead to begin with. Top: synthetic data (γ = 40); mean-centering achieves near-zero death, near-perfect recovery of ground-truth features, and lower reconstruction error. Bottom: ESM3 layer 24 (γ ≈ 8); baseline reaches around 75% dead by end of training, AuxK plateaus ∼25%, mean-centering stays near zero and recovers more biological conce… view at source ↗
read the original abstract

Sparse autoencoders (SAEs) decompose neural network activations into interpretable features, but many learned features never activate, a problem called feature death that wastes dictionary capacity and can reintroduce superposition. Death rates vary dramatically between models: near-zero on GPT-2, over 70% on AlphaFold3 with identical configurations. We find that dimension-level activation outliers (dimensions whose mean magnitude is large relative to per-token variation) cause this by shifting pre-activations at initialization based on each feature's alignment with the activation mean. Features anti-aligned with the mean receive permanently negative pre-activations and never fire. We formalize outlier severity as $\gamma = \|\mu\|/\|\sigma\|$; it predicts initial death rates (Spearman $\rho = 0.89$ for dead-by-TopK, $0.82$ for dead-by-ReLU) across 454 model-layer combinations spanning language, vision, protein, and genomic models. Dead features can revive during training, but recovery requires the SAE bias to learn the activation mean, a process that is prohibitively slow at high $\gamma$. Mean-centering (subtracting the activation mean) sidesteps this and eliminates outlier-induced death across all tested models, confirming the mechanism and providing a principled basis for when and why this preprocessing step is necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that feature death in sparse autoencoders arises from dimension-level activation outliers, which induce a pre-activation shift at initialization proportional to each feature's alignment with the activation mean μ. Outlier severity is formalized as γ = ‖μ‖/‖σ‖, which predicts initial death rates via high Spearman correlations (ρ = 0.89 for dead-by-TopK, ρ = 0.82 for dead-by-ReLU) across 454 model-layer combinations spanning language, vision, protein, and genomic models. Mean-centering is shown to eliminate outlier-induced death, while recovery during training is described as slow when γ is large because the SAE bias must learn μ.

Significance. If the central empirical relationships and intervention hold, the result is significant: it supplies a mechanistic account for the large observed variation in SAE death rates across models and a simple, targeted preprocessing remedy. Credit is due for the scale of the study (454 combinations) and the direct causal test via mean-centering, which isolates the proposed initialization-time mechanism from generic training dynamics.

minor comments (3)
  1. [Methods] The description of the 454 model-layer combinations and any data-exclusion rules should be expanded in the methods section to support full reproducibility of the reported correlations.
  2. [Results] Error bars or bootstrap confidence intervals on the Spearman ρ values would strengthen the claim that γ is a reliable predictor.
  3. [Discussion] The statement that recovery is 'prohibitively slow at high γ' would benefit from a quantitative plot or table showing bias-learning time scales versus γ.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its significance, and the recommendation for minor revision. The scale of the study (454 model-layer combinations) and the direct causal test via mean-centering are indeed central to the contribution, and we are glad these aspects were highlighted.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on direct empirical measurements (Spearman correlations of γ with death rates across 454 model-layer combinations) and a controlled intervention (mean-centering eliminates death). No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or definitional equivalence; γ is introduced as an explicit ratio of observed statistics, and the reported predictive power is an external correlation rather than a tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper is empirical and introduces no new free parameters or invented entities. It relies on standard definitions of vector norms and means plus domain assumptions about SAE initialization and activation statistics.

axioms (2)
  • domain assumption Standard random initialization of SAE encoder weights and biases without prior mean adjustment.
    The mechanism description depends on how pre-activations are computed from random weights at the start of training.
  • domain assumption Activation statistics (mean μ and per-dimension variation σ) are stable properties of the source model layers.
    γ is defined directly from these statistics and used to predict death rates.

pith-pipeline@v0.9.1-grok · 5771 in / 1377 out tokens · 32799 ms · 2026-06-28T23:32:20.295128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    URL https://transformer-circuits. pub/2025/january-update. Cornman, A., West-Roberts, J., Camargo, A. P., Roux, S., Beracochea, M., Mirdita, M., Ovchinnikov, S., and Hwang, Y . The omg dataset: An open metagenomic corpus for mixed-modality genomic language modeling. bioRxiv, pp. 2024–08, 2024. Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. LLM...

  2. [2]

    Jermyn, A

    URL https://openreview.net/forum? id=F76bwRSLeK. Jermyn, A. and Templeton, A. Ghost grads: An improve- ment on resampling.Transformer Circuits Thread,

  3. [3]

    1126/science.ade2574

    URL https://transformer-circuits. pub/2024/jan-update/index.html# dict-learning-resampling. Joseph, S., Suresh, P., Goldfarb, E., Hufe, L., Gandelsman, Y ., Graham, R., Bzdok, D., Samek, W., and Richards, B. A. Steering CLIP’s vision transformer with sparse au- toencoders, 2025. URL https://arxiv.org/abs/ 2504.08729. Karvonen, A., Rager, C., Lin, J., Tigg...

  4. [4]

    Qiu, Z., Huang, Z., Wen, K., Jin, P., Zheng, B., Zhou, Y ., Huang, H., Wang, Z., Li, X., Zhang, H., et al

    URL https://openreview.net/forum? id=DaNnkQJSQf. Qiu, Z., Huang, Z., Wen, K., Jin, P., Zheng, B., Zhou, Y ., Huang, H., Wang, Z., Li, X., Zhang, H., et al. A unified view of attention and residual sinks: Outlier-driven rescal- ing is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026. Radford, A., Wu, J., Child, R., Luan, D., Amodei,...

  5. [5]

    URL http://arxiv.org/abs/2511. 13981. arXiv:2511.13981 [cs]. Sim´eoni, O., V o, H. V ., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V ., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al. DINOv3.arXiv preprint arXiv:2508.10104, 2025. Simon, E. and Zou, J. Interplm: discovering interpretable features in protein language models via sparse aut...

  6. [6]

    Its projection onto any fixed vector is approximately Gaussian (Appendix B.1)

    Each feature’s encoder weight wi is a random unit vector. Its projection onto any fixed vector is approximately Gaussian (Appendix B.1)

  7. [7]

    Use this to decompose the pre-activation zi(x) =s i +r i(x) into a data-independentshift si =w i ·µ and a data-dependentsignalr i(x) =w i ·(x−µ), both with Gaussian distributions parameterized byµandσ

  8. [8]

    For each death pathway (dead-by-ReLU and dead-by-TopK), find the shift threshold below which no sample can rescue the feature

  9. [9]

    Assumptions.The key assumption: for a fixed encoder weight wi, the signal wi ·(x−µ) is approximately Gaussian over the data distribution

    Compute the fraction of features whose shift falls below this threshold; the dead-by-TopK extension follows the same approach with a different threshold. Assumptions.The key assumption: for a fixed encoder weight wi, the signal wi ·(x−µ) is approximately Gaussian over the data distribution. This holds when activations are not too heavy-tailed; heavier-tai...