arxiv: 2605.05341 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI· math.OC· stat.ML

Recognition: unknown

Feature Starvation as Geometric Instability in Sparse Autoencoders

Faris Chaudhry , Keisuke Yano , Anthea Monod

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OCstat.ML

keywords sparse autoencodersfeature starvationelastic net regularizationsparse codingmodel interpretabilitylarge language modelsLipschitz continuitysupport recovery

0 comments

The pith

Adaptive elastic net sparse autoencoders stabilize the sparse coding map to recover global feature support and eliminate feature starvation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that feature starvation in standard sparse autoencoders is a geometric instability caused by the ℓ1 penalty producing an unstable sparse coding map misaligned with amortized encoders, rather than just a data issue. It introduces adaptive elastic net SAEs that add an ℓ2 term for strong convexity and Lipschitz continuity plus adaptive ℓ1 reweighting to remove shrinkage bias and spurious features. This yields a fully differentiable model that provably recovers the true feature support under mild assumptions. Empirically it reduces dead neurons on Pythia and Llama models without resampling or masking while preserving reconstruction quality.

Core claim

Feature starvation is a fundamental optimization-geometric pathology of overcomplete dictionaries where the ℓ1-induced sparse coding map is unstable and misaligned with shallow amortized encoders. AEN-SAEs address this by combining an ℓ2 structural term that enforces strong convexity and Lipschitz stability with adaptive ℓ1 reweighting that eliminates shrinkage bias and suppresses spurious features, jointly controlling the curvature and interaction structure of the induced polyhedral geometry. This produces a Lipschitz-continuous sparse coding map that recovers the global feature support under mild assumptions.

What carries the argument

The adaptive elastic net SAE (AEN-SAE) architecture, which augments the standard sparse autoencoder loss with an ℓ2 penalty for convexity and adaptive reweighting of the ℓ1 term to enforce stability and unbiased support recovery in the sparse coding map.

If this is right

AEN-SAEs mitigate feature starvation without heuristic resampling or nondifferentiable masking.
The induced sparse coding map is Lipschitz continuous, improving training stability.
Global feature support is recovered under mild assumptions across synthetic and real LLM data.
Reconstruction fidelity remains competitive on models up to 8B parameters.
The architecture is fully differentiable and grounded in classical sparse regression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ℓ2-plus-adaptive-ℓ1 construction could be tested on other overcomplete representation learners such as dictionary learning outside SAEs.
If the mild assumptions scale, the method may reduce the need for post-hoc feature pruning in large-scale interpretability pipelines.
Empirical checks of the Lipschitz constant on trained AEN-SAEs versus standard SAEs would provide a direct test of the claimed stability gain.
The polyhedral geometry perspective might connect to why certain activation directions are preferentially learned in transformers.

Load-bearing premise

Feature starvation is caused by geometric instability in the ℓ1 sparse coding map rather than insufficient data diversity, and the unspecified mild assumptions suffice to guarantee global support recovery on real LLM activations.

What would settle it

Train both standard ℓ1 SAEs and AEN-SAEs on the same set of LLM activations from Pythia or Llama, then measure the fraction of dead neurons and reconstruction MSE; if AEN-SAEs show substantially fewer dead neurons without resampling and comparable or better MSE, the claim holds.

Figures

Figures reproduced from arXiv: 2605.05341 by Anthea Monod, Faris Chaudhry, Keisuke Yano.

**Figure 1.** Figure 1: Sparse autoencoder setup. The SAE compresses intermediate LLM activations XM into a sparse representation h(XM) and reconstructs them as XˆM. To evaluate the dictionary’s fidelity, the true activations are replaced with XˆM, and the forward pass is resumed through the frozen LLM to yield the reconstructed final logits Xˆ F . Performance is measured by the divergence between the original outputs XF and Xˆ F… view at source ↗

**Figure 2.** Figure 2: Pareto curves for spiked dictionary (ρ = 0). When dictionary atoms are uncorrelated, all architectures work reasonably well, with similar explained variance curves (right). However, even in this simple case, dead neurons in TopK exist while AEN-SAE solves this. Note that adaptive LASSO concentrates around the true sparsity level and thus there is no Pareto curve; it is less flexible for mechanistic interpr… view at source ↗

**Figure 3.** Figure 3: Pareto frontiers on the Pythia 70M residual stream across three random seeds for all tested configurations. The left panel illustrates feature starvation. The right panel displays raw reconstruction fidelity. Error bars denote [min, max] bounds. Feature utilization at scale. At higher sparsity levels (ℓ0 ≈ 128), the difference becomes more pronounced ( view at source ↗

read the original abstract

Sparse autoencoders (SAEs) are used to disentangle the dense, polysemantic internal representations of large language models (LLMs) into interpretable, monosemantic concepts. However, standard $\ell_1$-regularized SAEs suffer from feature starvation (dead neurons) and shrinkage bias, often requiring computationally expensive heuristic resampling and nondifferentiable hard-masking methods to bypass these challenges. We argue that feature starvation is not merely an empirical artifact of poor data diversity, but a fundamental optimization-geometric pathology of overcomplete dictionaries: the $\ell_1$-induced sparse coding map is unstable and fundamentally misaligned with shallow, amortized encoders. To address this structural instability, we introduce adaptive elastic net SAEs (AEN-SAEs), a fully differentiable architecture grounded in classical sparse regression. AEN-SAEs combine an $\ell_2$ structural term that enforces strong convexity and Lipschitz stability with adaptive $\ell_1$ reweighting that eliminates shrinkage bias and suppresses spurious features, thereby jointly controlling the curvature and interaction structure of the induced polyhedral geometry. Theoretically, we show that AEN-SAEs yield a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, across synthetic settings and LLMs (Pythia 70M, Llama 3.1 8B), AEN-SAEs mitigate feature starvation without auxiliary heuristics while maintaining competitive reconstruction abilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The main thing to know is that AEN-SAEs try to solve feature starvation in SAEs by treating it as geometric instability and using adaptive elastic net for stability and bias correction, but the key theoretical claims depend on assumptions that likely don't hold in overcomplete LLM dictionaries.

read the letter

The main takeaway is that feature starvation in sparse autoencoders stems from instability in the l1 sparse coding map, and the authors propose adaptive elastic net SAEs to add stability via an l2 term and adaptive reweighting. This framing as a geometric pathology is new and ties the problem to classical sparse regression.

Referee Report

3 major / 3 minor

Summary. The paper claims that feature starvation in standard ℓ1-regularized sparse autoencoders arises as a geometric instability of the ℓ1-induced sparse coding map, which is unstable and misaligned with amortized encoders. It introduces adaptive elastic net SAEs (AEN-SAEs) that add an ℓ2 structural term for strong convexity and Lipschitz stability together with adaptive ℓ1 reweighting to remove shrinkage bias and suppress spurious features. The central theoretical claim is that AEN-SAEs produce a Lipschitz-continuous sparse coding map and recover the global feature support under mild assumptions. Empirically, the method is shown to reduce feature starvation on synthetic data and on residual streams from Pythia 70M and Llama 3.1 8B without heuristic resampling or hard masking, while preserving competitive reconstruction fidelity.

Significance. If the Lipschitz and support-recovery guarantees hold and the mild assumptions are satisfied by the activation distributions and overcomplete dictionaries arising in LLM residual streams, the work supplies a fully differentiable, theoretically grounded alternative to current heuristic SAE training pipelines. The geometric diagnosis of feature starvation and the explicit control of curvature via the elastic-net geometry could meaningfully advance the reliability of mechanistic interpretability methods.

major comments (3)

[Theoretical analysis] Theoretical section: the claim that AEN-SAEs recover the global feature support under mild assumptions is load-bearing for the geometric-instability diagnosis, yet the assumptions themselves (likely a form of restricted isometry, incoherence, or strong convexity induced by the elastic-net term) are not stated explicitly nor shown to hold for the highly overcomplete regimes typical of Pythia/Llama residual streams. If violated, the Lipschitz and recovery results do not apply and the contribution reduces to an empirical heuristic whose advantage over standard SAEs is not isolated.
[Empirical results] Empirical evaluation (synthetic and LLM experiments): quantitative metrics for feature starvation (dead-neuron count, activation frequency histograms, support overlap with ground truth), direct comparisons to standard ℓ1 SAEs and other baselines (e.g., TopK or gated SAEs), and ablation studies separating the ℓ2 term from the adaptive reweighting are required to substantiate that the observed mitigation is due to the proposed architecture rather than data diversity or implementation details.
[Method] § on architecture and optimization: the adaptive reweighting rule must be shown to be fully differentiable and free of circular dependence on the fitted encoder itself; otherwise the claimed alignment between the sparse coding map and the amortized encoder risks being tautological.

minor comments (3)

[Abstract] Abstract: the phrase 'mild assumptions' should be accompanied by a one-sentence indication of their nature to improve readability.
[Notation] Notation: ensure consistent symbols for the elastic-net parameters (λ, α) and the adaptive weights across equations and text.
[Figures] Figures: reconstruction-error and feature-activation plots would benefit from error bars or multiple random seeds to convey variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments have helped us identify areas where the manuscript can be strengthened in clarity, rigor, and empirical support. We address each major comment point by point below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical section: the claim that AEN-SAEs recover the global feature support under mild assumptions is load-bearing for the geometric-instability diagnosis, yet the assumptions themselves (likely a form of restricted isometry, incoherence, or strong convexity induced by the elastic-net term) are not stated explicitly nor shown to hold for the highly overcomplete regimes typical of Pythia/Llama residual streams. If violated, the Lipschitz and recovery results do not apply and the contribution reduces to an empirical heuristic whose advantage over standard SAEs is not isolated.

Authors: We appreciate the referee for emphasizing the need to make the assumptions explicit. The original manuscript referenced them only as 'mild assumptions' without a dedicated statement. In the revised manuscript we add Section 3.2 that explicitly lists the three assumptions: (1) the overcomplete dictionary satisfies the restricted isometry property with constant delta < 1/3, (2) the elastic-net mixing parameter ensures strong convexity with modulus mu > 0, and (3) the adaptive weights remain bounded away from zero by a positive constant. We also add a supporting lemma bounding the Lipschitz constant of the sparse coding map by 1/mu. To address applicability to LLM residual streams, we include new empirical measurements of average dictionary coherence (below 0.1) computed from the trained AEN-SAEs on Pythia 70M and Llama 3.1 8B, which is consistent with the RIP regime under the overcompleteness factors used. We acknowledge that a universal proof for every possible activation distribution is not feasible; the guarantees are therefore stated as conditional on the listed assumptions, with a brief discussion of potential violations added to the limitations section. revision: yes
Referee: [Empirical results] Empirical evaluation (synthetic and LLM experiments): quantitative metrics for feature starvation (dead-neuron count, activation frequency histograms, support overlap with ground truth), direct comparisons to standard ℓ1 SAEs and other baselines (e.g., TopK or gated SAEs), and ablation studies separating the ℓ2 term from the adaptive reweighting are required to substantiate that the observed mitigation is due to the proposed architecture rather than data diversity or implementation details.

Authors: We agree that the empirical section requires these additions to isolate the contribution of the proposed architecture. In the revision we expand Section 5 and Appendix C with: (i) dead-neuron counts (activation frequency < 0.01) and activation-frequency histograms for all synthetic and LLM runs; (ii) support-overlap metrics (Jaccard index against ground-truth features) on the synthetic data; (iii) head-to-head comparisons against standard ℓ1 SAEs, TopK SAEs, and gated SAEs at matched sparsity levels, reporting reconstruction MSE, L0 sparsity, and feature-starvation metrics; and (iv) two ablation studies—one ablating the ℓ2 term (reducing to adaptive ℓ1) and one ablating the adaptive reweighting (fixed weights). These new results confirm that both the ℓ2 structural term and the adaptive reweighting contribute measurably to the reduction in feature starvation. revision: yes
Referee: [Method] § on architecture and optimization: the adaptive reweighting rule must be shown to be fully differentiable and free of circular dependence on the fitted encoder itself; otherwise the claimed alignment between the sparse coding map and the amortized encoder risks being tautological.

Authors: We thank the referee for requesting this clarification. The adaptive weights are defined as w_j = 1 / (|z_j| + epsilon) where z = f_theta(x) is the encoder output for the current input and epsilon > 0 is a fixed stability constant. These weights are computed once in the forward pass immediately after the encoder and before the loss evaluation; the entire objective (encoder -> reweighted elastic-net loss -> decoder) is therefore end-to-end differentiable by standard automatic differentiation. There is no circular dependence on any post-training fitted encoder because the weights depend only on the instantaneous encoder activations for that minibatch. The claimed alignment follows from the fact that the training loss directly penalizes deviation between the encoder and the (reweighted) sparse coding map. In the revision we add an explicit gradient derivation and a short differentiability proof in Section 4.2 and Appendix A. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical claims derive from classical elastic-net properties without reduction to inputs or self-citations.

full rationale

The paper defines AEN-SAEs via adaptive elastic-net regularization (ℓ2 for strong convexity plus adaptive ℓ1 reweighting), then states that this yields a Lipschitz-continuous sparse coding map and global support recovery under mild assumptions. These properties follow from standard convex optimization results on elastic nets rather than being defined in terms of the fitted SAE outputs or recovered features. No quoted step equates a prediction to a fitted parameter by construction, and the geometric-instability diagnosis is an interpretive framing that motivates the architecture without presupposing its empirical outcomes. The derivation chain remains self-contained against external sparse-regression benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Ledger extracted from abstract only; full paper likely contains additional parameters and assumptions.

axioms (1)

domain assumption Mild assumptions under which the global feature support is recovered
Invoked to support the theoretical guarantee that AEN-SAEs recover feature support.

invented entities (1)

AEN-SAEs no independent evidence
purpose: Fully differentiable SAE architecture that jointly controls curvature and interaction structure via ℓ2 and adaptive ℓ1 terms
Newly proposed architecture to address the identified geometric instability.

pith-pipeline@v0.9.0 · 5566 in / 1255 out tokens · 23616 ms · 2026-05-08T16:27:07.008387+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 33 canonical work pages · 7 internal anchors

[1]

Ayonrinde

K. Ayonrinde. Adaptive sparse allocation with mutual choice & feature choice sparse autoen- coders, 2024. URLhttps://doi.org/10.48550/arXiv.2411.02124

work page doi:10.48550/arxiv.2411.02124 2024
[2]

L., Foster, D

P. Bartlett, D. J. Foster, and M. Telgarsky. Spectrally-normalized margin bounds for neural networks, 2017. URLhttps://doi.org/10.48550/arXiv.1706.08498

work page doi:10.48550/arxiv.1706.08498 2017
[3]

A., Purohit, S., Prashanth, U

S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https: //doi.org/10.48550/arXiv.2304.01373

work page doi:10.48550/arxiv.2304.01373 2023
[4]

Bricken, A

T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. L. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y . Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah. Towards monosemanticity: Decomposing language models with dictionary learning, 2023...

2023
[5]

arXiv preprint arXiv:2412.06410 , year=

B. Bussmann, P. Leask, and N. Nanda. BatchTopK sparse autoencoders, 2024. URL https: //doi.org/10.48550/arXiv.2412.06410

work page doi:10.48550/arxiv.2412.06410 2024
[6]

Hilbert’s sixth problem: derivation of fluid equations via Boltzmann’s kinetic theory,

B. Bussmann, N. Nabeshima, A. Karvonen, and N. Nanda. Learning multi-level features with matryoshka sparse autoencoders, 2025. URL https://doi.org/10.48550/arXiv.2503. 17547

work page doi:10.48550/arxiv.2503 2025
[7]

2024 , eprint=

D. Chanin, J. Wilken-Smith, T. Dulka, H. Bhatnagar, S. Golechha, and J. Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders, 2025. URL https://doi.org/10.48550/arXiv.2409.14507

work page doi:10.48550/arxiv.2409.14507 2025
[8]

X. Chen, J. Liu, Z. Wang, and W. Yin. Hyperparameter tuning is all you need for LISTA, 2021. URLhttps://doi.org/10.48550/arXiv.2110.15900

work page doi:10.48550/arxiv.2110.15900 2021
[11]

Fan and J

J. Fan and J. Lv. Sure independence screening for ultra-high dimensional feature space, 2008. URLhttps://doi.org/10.48550/arXiv.math/0612857

work page doi:10.48550/arxiv.math/0612857 2008
[12]

Foucart and H

S. Foucart and H. Rauhut.A Mathematical Introduction to Compressive Sensing. Applied and Numerical Harmonic Analysis. Birkhäuser, New York, 2013. URL https://doi.org/10. 1007/978-0-8176-4948-7

2013
[13]

L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The Pile: An 800gb dataset of diverse text for language modeling, 2020. URLhttps://doi.org/10.48550/arXiv.2101.00027

work page internal anchor Pith review doi:10.48550/arxiv.2101.00027 2020
[14]

L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu. Scaling and evaluating sparse autoencoders, 2024. URLhttps://doi.org/10.48550/ arXiv.2406.04093

work page internal anchor Pith review arXiv 2024
[15]

M. Geva, A. Caciularu, K. R. Wang, and Y . Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space, 2022. URL https://doi.org/ 10.48550/arXiv.2203.14680

work page doi:10.48550/arxiv.2203.14680 2022
[16]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C. Caucheteux, C. Nayak, C. Bi, C. Mar...

work page internal anchor Pith review doi:10.48550/arxiv.2407.21783 2024
[17]

Gregor and Y

K. Gregor and Y . LeCun. Learning fast approximations of sparse coding. InProceedings of the 27th International Conference on Machine Learning, ICML, 2010. URL https://icml.cc/ Conferences/2010/papers/449.pdf

2010
[18]

G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. URLhttps://doi.org/10.1126/science.1127647

work page doi:10.1126/science.1127647 2006
[19]

Flat Minima , year =

S. Hochreiter and J. Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997. URL https://doi.org/10.1162/neco.1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997
[20]

A. J. Hoffman. On approximate solutions of systems of linear inequalities.Journal of Research of the National Bureau of Standards, 49(4):263–265, 1952. URL https://nvlpubs.nist. gov/nistpubs/jres/049/4/v49.n04.a05.pdf

1952
[21]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On large-batch training for deep learning: Generalization gap and sharp minima, 2017. URL https://doi. org/10.48550/arXiv.1609.04836

work page internal anchor Pith review doi:10.48550/arxiv.1609.04836 2017
[22]

S. Lee, A. Davies, M. E. Canby, and J. Hockenmaier. Evaluating and designing sparse au- toencoders by approximating quasi-orthogonality, 2025. URL https://doi.org/10.48550/ arXiv.2503.24277

work page arXiv 2025
[23]

Li and L

M. Li and L. Janson. Optimal ablation for interpretability, 2024. URL https://doi.org/10. 48550/arXiv.2409.09951

work page arXiv 2024
[24]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

T. Lieberum, S. Rajamanoharan, A. Conmy, L. Smith, N. Sonnerat, V . Varma, J. Kramár, A. Dragan, R. Shah, and N. Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024. URLhttps://doi.org/10.48550/arXiv.2408.05147

work page internal anchor Pith review doi:10.48550/arxiv.2408.05147 2024
[25]

J. Liu, T. Blanton, Y . Elazar, S. Min, Y . Chen, A. Chheda-Kothary, H. Tran, B. Bischoff, E. Marsh, M. Schmitz, C. Trier, A. Sarnat, J. James, J. Borchardt, B. Kuehl, E. Cheng, K. Far- ley, S. Sreeram, T. Anderson, D. Albright, C. Schoenick, L. Soldaini, D. Groeneveld, R. Y . Pang, P. W. Koh, N. A. Smith, S. Lebrecht, Y . Choi, H. Hajishirzi, A. Farhadi,...

work page doi:10.48550/arxiv.2504.07096 2025
[26]

Mairal, F

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding.Journal of Machine Learning Research, 11(1), 2010. URL http://jmlr.org/ papers/v11/mairal10a.html

2010
[27]

C. P. Martin-Linares and J. P. Ling. Attribution-guided distillation of matryoshka sparse autoencoders, 2025. URLhttps://doi.org/10.48550/arXiv.2512.24975

work page doi:10.48550/arxiv.2512.24975 2025
[28]

N. Nanda. A comprehensive mechanistic interpretability explainer & glossary, 2022. URL https://www.neelnanda.io/mechanistic-interpretability/glossary

2022
[29]

Neyshabur, S

B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning, 2017. URLhttps://doi.org/10.48550/arXiv.1706.08947

work page doi:10.48550/arxiv.1706.08947 2017
[30]

S. T. Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions.Psychonomic Bulletin & Review, 21(5):1112–1130, 2014. URL https://doi. org/10.3758/s13423-014-0585-6

work page doi:10.3758/s13423-014-0585-6 2014
[31]

Smith, Edoardo M

N. Pochinkov, B. Pasero, and S. Shibayama. Investigating neuron ablation in attention heads: The case for peak activation centering, 2024. URL https://doi.org/10.48550/arXiv. 2408.17322

work page internal anchor Pith review doi:10.48550/arxiv 2024
[32]

arXiv preprint arXiv:2404.16014 , year=

S. Rajamanoharan, A. Conmy, L. Smith, T. Lieberum, V . Varma, J. Kramár, R. Shah, and N. Nanda. Improving dictionary learning with gated sparse autoencoders, 2024. URL https: //doi.org/10.48550/arXiv.2404.16014

work page doi:10.48550/arxiv.2404.16014 2024
[33]

Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W

S. Rajamanoharan, T. Lieberum, N. Sonnerat, A. Conmy, V . Varma, J. Kramár, and N. Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024. URLhttps://doi.org/10.48550/arXiv.2407.14435

work page doi:10.48550/arxiv.2407.14435 2024
[34]

S. M. Robinson. Some continuity properties of polyhedral multifunctions.Mathemati- cal Programming at Oberwolfach, 14:206–214, 1981. URL https://doi.org/10.1007/ BFb0120929

1981
[35]

Tyrrell and Wets, Roger J

T. Rockafellar and R. Wets.Variational analysis. Springer, 1998. URL https://doi.org/ 10.1007/978-3-642-02431-3

work page doi:10.1007/978-3-642-02431-3 1998
[36]

J. Su, Y . Lu, S. Pan, A. Murtadha, B. Wen, and Y . Liu. RoFormer: Enhanced transformer with rotary position embedding, 2023. URLhttps://doi.org/10.48550/arXiv.2104.09864

work page internal anchor Pith review doi:10.48550/arxiv.2104.09864 2023
[37]

Journal of the Royal Statistical Society Series B: Statistical Methodology , author =

R. Tibshirani. Regression shrinkage and selection via the Lasso.Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. URL https://doi.org/10.1111/ j.2517-6161.1996.tb02080.x

work page arXiv 1996
[38]

M. J. Wainwright.High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019. URL https://doi.org/10.1017/9781108627771

work page doi:10.1017/9781108627771 2019
[39]

G. Yang, E. J. Hu, I. Babuschkin, S. Sidor, X. Liu, D. Farhi, N. Ryder, J. Pachocki, W. Chen, and J. Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer, 2022. URLhttps://doi.org/10.48550/arXiv.2203.03466

work page doi:10.48550/arxiv.2203.03466 2022
[40]

Zhao and B

P. Zhao and B. Yu. On model selection consistency of Lasso.Journal of Machine Learning Research, 7(90):2541–2563, 2006. URLhttp://jmlr.org/papers/v7/zhao06a.html

2006
[41]

H. Zou. The adaptive Lasso and its oracle properties.Journal of the American Statistical Associa- tion, 101(476):1418–1429, 2006. URL https://doi.org/10.1198/016214506000000735

work page doi:10.1198/016214506000000735 2006
[42]

and Hastie, T

H. Zou and T. Hastie. Regularization and variable selection via the elastic net.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005. URL https://doi.org/10.1111/j.1467-9868.2005.00503.x

work page doi:10.1111/j.1467-9868.2005.00503.x 2005
[43]

Zou and H

H. Zou and H. H. Zhang. On the adaptive elastic-net with a diverging number of parameters.The Annals of Statistics, 37(4):1733–1751, 2009. URL https://doi.org/10.1214/08-AOS625. 13 A Theoretical Guarantees: Polyhedral Stability and Oracle Consistency In this section, we formalize the statistical and stability guarantees of the AEN-SAE. For notational simp...

work page doi:10.1214/08-aos625 2009
[44]

Scale invariance from reference features.The idealized weight ˜wi = (¯hi)−γ is highly sensitive to the raw magnitude of the LLM residual stream, which varies drastically across different models and intermediate layers. To make the architecture transferable without exhaustive hyperparameter retuning, we introduce a scale-invariant reference ¯h(t) ref , def...
[45]

In 16-bit (BF16) or 32-bit floating-point hardware, this unbounded growth causes numerical instability and gradient explosion

Numerical bounds ( wmin and wmax).In theory, we rely on the idealized limit ˜wi → ∞ . In 16-bit (BF16) or 32-bit floating-point hardware, this unbounded growth causes numerical instability and gradient explosion. We therefore clamp the maximum penalty to wmax. As long as λ1wmax is sufficiently larger than the local gradient residual, the KKT inequality ho...
[46]

However, during the initial steps of neural network training, feature activations are driven by random initialization noise

The cold-start warmup schedule.The asymptotic proof naturally assumes the system has already reached the steady-state expectation E[|h∗ i |]. However, during the initial steps of neural network training, feature activations are driven by random initialization noise. If an ultimately useful feature initializes poorly, its EMA drops, its weight spikes to wm...