pith. machine review for the scientific record. sign in

arxiv: 2406.04093 · v1 · submitted 2024-06-06 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Scaling and evaluating sparse autoencoders

Alec Radford, Gabriel Goh, Henk Tillman, Ilya Sutskever, Jan Leike, Jeffrey Wu, Leo Gao, Rajan Troll, Tom Dupr\'e la Tour

Pith reviewed 2026-05-12 17:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse autoencodersscaling lawsfeature interpretabilitymechanistic interpretabilitylanguage modelsGPT-4dead latents
0
0 comments X

The pith

K-sparse autoencoders with dead-latent fixes yield clean scaling laws and steadily improving feature quality metrics as size grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that replacing standard sparse autoencoders with k-sparse versions lets researchers set sparsity directly, removes the need to balance competing loss terms, and produces far fewer dead latents even at large scale. With this change the authors observe simple power-law relationships between autoencoder width, sparsity level, and reconstruction fidelity. They also define three new ways to score feature quality—how well the autoencoder recovers features hypothesized by other methods, how human-readable the activation patterns are, and how sparsely the features affect downstream model behavior—and demonstrate that all three scores rise with autoencoder size. To show the approach works at extreme scale they train a 16-million-latent model on GPT-4 activations for 40 billion tokens.

Core claim

k-sparse autoencoders directly enforce a fixed number of active latents per example, combined with small architectural changes that keep almost all latents alive, produce clean scaling laws relating autoencoder size and sparsity to reconstruction loss while new interpretability metrics based on feature recovery, activation explainability, and downstream sparsity all improve monotonically with size, culminating in a working 16-million-latent autoencoder trained on GPT-4.

What carries the argument

k-sparse autoencoder that selects exactly the top-k activations per input example and applies auxiliary losses to prevent dead latents.

If this is right

  • Larger autoencoders recover a higher fraction of features previously identified by other interpretability techniques.
  • The fraction of activation patterns that humans can explain increases with autoencoder width.
  • Features extracted at larger scales produce sparser effects on downstream model outputs.
  • The same training recipe remains stable up to at least 16 million latents and 40 billion training tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the scaling laws continue, mechanistic interpretability of frontier models may become feasible by training autoencoders whose latent count matches or exceeds the number of distinct concepts the model uses.
  • The dead-latent mitigation techniques could be ported to other bottleneck architectures that suffer from unused units.
  • The new evaluation metrics provide a quantitative yardstick that future work can use to compare different sparse-coding methods without relying solely on reconstruction loss.

Load-bearing premise

The three new metrics genuinely track true feature interpretability rather than simply tracking autoencoder size or reconstruction quality.

What would settle it

Training still-larger autoencoders beyond 16 million latents and checking whether the three proposed metrics stop improving or begin to degrade.

read the original abstract

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes k-sparse autoencoders as a method to directly control sparsity in SAEs for extracting interpretable features from language model activations. It reports modifications that minimize dead latents, identifies clean scaling laws relating autoencoder size and sparsity to reconstruction quality, introduces three new evaluation metrics (recovery of hypothesized features, explainability of activation patterns, and sparsity of downstream effects) that are shown to improve with scale, and demonstrates scalability by training a 16-million-latent SAE on GPT-4 activations over 40 billion tokens. The work releases training code, trained autoencoders for open-source models, and a visualizer.

Significance. If the reported scaling laws are robust and the new metrics are shown to track genuine improvements in monosemanticity and interpretability, the results would provide a practical path to larger-scale feature extraction in mechanistic interpretability. The explicit release of code, models, and a visualizer is a concrete strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Evaluation metrics section (around the introduction of the three metrics)] The central claim that the three new metrics (hypothesized-feature recovery, activation-pattern explainability, downstream-effect sparsity) establish improved feature quality rests on their observed improvement with autoencoder size. However, the manuscript provides no external validation: no correlation analysis with human interpretability ratings on the released visualizer, no comparison against existing interpretability benchmarks, and no test of whether the metrics remain predictive when the SAE is transferred to a different model or task. Without such anchoring, it remains possible that the metrics can be improved by architectural choices that do not increase monosemanticity.
  2. [Scaling experiments and results] The abstract states that 'clean scaling laws' are found with respect to autoencoder size and sparsity, yet the provided experimental summary lacks reported error bars, baseline comparisons to standard (non-k-sparse) SAEs, and explicit data-exclusion criteria. These omissions make it impossible to assess whether the scaling relations are statistically reliable or sensitive to hyperparameter choices, which is load-bearing for the scalability claim.
minor comments (1)
  1. [Abstract / Introduction] The abstract and introduction would benefit from a short table or figure reference that directly compares the reconstruction-sparsity frontier of k-sparse autoencoders against prior SAE variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation metrics and scaling experiments. We have revised the manuscript to incorporate clarifications, additional details, and a limitations discussion where feasible, while preserving the core contributions of k-sparse autoencoders and the observed scaling behaviors.

read point-by-point responses
  1. Referee: [Evaluation metrics section (around the introduction of the three metrics)] The central claim that the three new metrics (hypothesized-feature recovery, activation-pattern explainability, downstream-effect sparsity) establish improved feature quality rests on their observed improvement with autoencoder size. However, the manuscript provides no external validation: no correlation analysis with human interpretability ratings on the released visualizer, no comparison against existing interpretability benchmarks, and no test of whether the metrics remain predictive when the SAE is transferred to a different model or task. Without such anchoring, it remains possible that the metrics can be improved by architectural choices that do not increase monosemanticity.

    Authors: We agree that the new metrics would be strengthened by direct external validation such as human ratings or cross-benchmark comparisons. The metrics were designed to operationalize specific, testable aspects of feature quality motivated by prior interpretability literature (recovery of known model features, human-readable activation explanations, and localized downstream effects). Their consistent improvement with scale provides supporting evidence under the k-sparse regime, and the released visualizer is intended to enable exactly the kind of human studies the referee suggests. We have added an explicit limitations paragraph in the discussion section acknowledging the absence of these anchors in the current work and outlining how future studies could use the released artifacts to perform them. We have not claimed the metrics are fully validated proxies for monosemanticity, only that they improve alongside scale in our experiments. revision: partial

  2. Referee: [Scaling experiments and results] The abstract states that 'clean scaling laws' are found with respect to autoencoder size and sparsity, yet the provided experimental summary lacks reported error bars, baseline comparisons to standard (non-k-sparse) SAEs, and explicit data-exclusion criteria. These omissions make it impossible to assess whether the scaling relations are statistically reliable or sensitive to hyperparameter choices, which is load-bearing for the scalability claim.

    Authors: We accept that the presentation of the scaling results should have included error bars and clearer baselines. The full manuscript already contains direct comparisons between k-sparse and standard SAEs on the reconstruction-sparsity frontier, but we have now added error bars computed from repeated training runs at selected scales and clarified the data-exclusion criteria (primarily runs exhibiting >5% dead latents after the dead-latent mitigation steps). These additions appear in the revised figures and methods section. The abstract's reference to 'clean scaling laws' is qualified by the k-sparse formulation and dead-latent fixes; we have updated the text to emphasize that the observed relations hold under the reported hyperparameter ranges and exclusion rules. revision: yes

Circularity Check

0 steps flagged

Empirical scaling experiments with new metrics show no derivation circularity

full rationale

The paper reports experimental results from training k-sparse autoencoders on LM activations, observes scaling behavior in reconstruction/sparsity tradeoffs and in three new evaluation metrics (hypothesized-feature recovery, activation-pattern explainability, downstream-effect sparsity), and demonstrates training a 16M-latent SAE. No load-bearing step reduces by the paper's own equations or self-citations to a fitted parameter or input quantity defined in terms of the target result; the scaling laws and metric improvements are presented as direct empirical observations rather than derived predictions. The cited k-sparse autoencoder technique is from independent prior work (Makhzani & Frey 2013) and does not create a self-referential chain. This is a standard empirical scaling study whose central claims remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard machine learning assumptions about sparse decompositions of activations rather than introducing new free parameters, axioms, or invented entities beyond existing SAE frameworks.

axioms (1)
  • domain assumption Language model activations can be usefully decomposed into a sparse set of interpretable features via autoencoders.
    This is the foundational premise of sparse autoencoders for mechanistic interpretability invoked throughout the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1464 out tokens · 112747 ms · 2026-05-12T17:42:25.008103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.JcostCore Jcost_pos_of_ne_one echoes

    We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

    cs.LG 2026-05 accept novelty 8.0

    Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

  2. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  3. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  4. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  5. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  6. Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

  7. From Mechanistic to Compositional Interpretability

    cs.LG 2026-05 unverdicted novelty 7.0

    Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...

  8. What Cohort INRs Encode and Where to Freeze Them

    cs.LG 2026-05 unverdicted novelty 7.0

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  9. Linear-Readout Floors and Threshold Recovery in Computation in Superposition

    cs.LG 2026-05 unverdicted novelty 7.0

    Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...

  10. Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

    cs.CV 2026-04 unverdicted novelty 7.0

    Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.

  11. Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

    cs.CV 2026-04 unverdicted novelty 7.0

    Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.

  12. Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse autoencoders applied to a 14.5M-parameter clinical EHR model reveal progressive abstraction across layers, with SAE features outperforming dense ones for mortality in full-sequence probes but not in leakage-saf...

  13. Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

    cs.CV 2026-04 unverdicted novelty 7.0

    The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...

  14. MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

    cs.LG 2026-04 conditional novelty 7.0

    Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.

  15. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  16. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  17. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  18. Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

  19. Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

    cs.LG 2026-05 unverdicted novelty 6.0

    Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.

  20. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...

  21. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...

  22. Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

    cs.LG 2026-05 unverdicted novelty 6.0

    Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.

  23. From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

    cs.AI 2026-05 conditional novelty 6.0

    Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.

  24. The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.

  25. Feature Starvation as Geometric Instability in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...

  26. Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

    cs.LG 2026-05 unverdicted novelty 6.0

    Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...

  27. GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.

  28. LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images

    cs.CV 2026-04 unverdicted novelty 6.0

    LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than ca...

  29. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  30. Geometric Routing Enables Causal Expert Control in Mixture of Experts

    cs.AI 2026-04 unverdicted novelty 6.0

    Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.

  31. Improving Robustness In Sparse Autoencoders via Masked Regularization

    cs.LG 2026-04 unverdicted novelty 6.0

    Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.

  32. Understanding Emergent Misalignment via Feature Superposition Geometry

    cs.AI 2026-04 unverdicted novelty 6.0

    Emergent misalignment occurs because fine-tuning amplifies target features that overlap geometrically with harmful ones in superposition, and filtering samples near toxic features mitigates it.

  33. Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

    cs.CL 2026-04 unverdicted novelty 6.0

    PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

  34. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    cs.LG 2024-03 unverdicted novelty 6.0

    Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 31 Pith papers · 17 internal anchors

  1. [1]

    K-SVD : An algorithm for designing overcomplete dictionaries for sparse representation

    Michal Aharon, Michael Elad, and Alfred Bruckstein. K-SVD : An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54 0 (11): 0 4311--4322, 2006

  2. [2]

    How to explain individual classification decisions

    David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert M \"u ller. How to explain individual classification decisions. The Journal of Machine Learning Research, 11: 0 1803--1831, 2010

  3. [3]

    Language models can explain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

  4. [4]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

  5. [5]

    Open source sparse autoencoders for all residual stream layers of gpt2-small

    Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2-small. AI Alignment Forum, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream

  6. [6]

    An interpretability illusion for BERT

    Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Vi \'e gas, and Martin Wattenberg. An interpretability illusion for BERT . arXiv preprint arXiv:2104.07143, 2021

  7. [7]

    Identifying functionally important features with end-to-end sparse dictionary learning

    Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally important features with end-to-end sparse dictionary learning. 2024

  8. [8]

    Towards monosemanticity: Decomposing language models with dictionary learning

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  9. [9]

    Toxic comment classification challenge

    Cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. Toxic comment classification challenge. Kaggle, 2017. URL https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge

  10. [10]

    Unified scaling laws for routed language models

    Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057--4086. PMLR, 2022

  11. [11]

    Update on how we train saes

    Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on how we train saes. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/april-update/index.html\#training-saes

  12. [12]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

  13. [13]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022

  14. [14]

    JumpReLU : A retrofit defense strategy for adversarial attacks

    N Benjamin Erichson, Zhewei Yao, and Michael W Mahoney. JumpReLU : A retrofit defense strategy for adversarial attacks. arXiv preprint arXiv:1904.03750, 2019

  15. [15]

    Neuron to Graph: Interpreting Language Model Neurons at Scale , shorttitle =

    Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, and Fazl Barez. Neuron to graph: Interpreting language model neurons at scale. arXiv preprint arXiv:2305.19911, 2023

  16. [16]

    Decoding the thought vector, 2016

    Gabriel Goh. Decoding the thought vector, 2016. URL https://gabgoh.github.io/ThoughtVectors/. Accessed: 2024-05-24

  17. [17]

    Ag's corpus of news articles

    Antonio Gulli. Ag's corpus of news articles. http://groups.di.unipi.it/ gulli/AG_corpus_of_news_articles.html. Accessed: 2024-05-21

  18. [18]

    2023 , archivePrefix=

    Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023

  19. [19]

    2023 , month = feb, journal =

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020

  20. [20]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

  21. [21]

    Reducing the dimensionality of data with neural networks

    Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313 0 (5786): 0 504--507, 2006

  22. [22]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  23. [23]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

  24. [24]

    Ghost grads: An improvement on resampling

    Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/jan-update/index.html\#dict-learning-resampling

  25. [25]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  26. [26]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  27. [27]

    E uroparl: A parallel corpus for statistical machine translation

    Philipp Koehn. E uroparl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79--86, Phuket, Thailand, September 13-15 2005. URL https://aclanthology.org/2005.mtsummit-papers.11

  28. [28]

    Zero-bias autoencoders and the benefits of co-adapting features

    Kishore Konda, Roland Memisevic, and David Krueger. Zero-bias autoencoders and the benefits of co-adapting features. arXiv preprint arXiv:1402.3337, 2014

  29. [29]

    Building high-level features using large scale unsupervised learning

    Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean, and Andrew Y Ng. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8595--8598. IEEE, 2013

  30. [30]

    Sparse deep belief net model for visual area v2

    Honglak Lee, Chaitanya Ekanadham, and Andrew Ng. Sparse deep belief net model for visual area v2. Advances in neural information processing systems, 20, 2007

  31. [31]

    Taking features out of superposition with sparse autoencoders

    Beren Millidge Lee Sharkey, Dan Braun. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition

  32. [32]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

  33. [33]

    Scaling laws for dictionary learning

    Jack Lindsey, Tom Conerly, Adly Templeton, Jonathan Marcus, and Tom Henighan. Scaling laws for dictionary learning. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/april-update/index.html\#scaling-laws

  34. [34]

    Sparse modeling for image and vision processing

    Julien Mairal, Francis Bach, Jean Ponce, et al. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision , 8 0 (2-3): 0 85--283, 2014

  35. [35]

    Towards principled evaluations of sparse autoencoders for interpretability and control, 2024

    Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control, 2024

  36. [36]

    k-Sparse Autoencoders

    Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013

  37. [37]

    Coherence analysis of iterative thresholding algorithms

    Arian Maleki. Coherence analysis of iterative thresholding algorithms. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 236--243. IEEE, 2009

  38. [38]

    Matching pursuits with time-frequency dictionaries

    St \'e phane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41 0 (12): 0 3397--3415, 1993

  39. [39]

    Some open-source dictionaries and dictionary learning infrastructure

    Sam Marks. Some open-source dictionaries and dictionary learning infrastructure. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/AaoWLcmpY3LKvtdyq/some-open-source-dictionaries-and-dictionary-learning

  40. [40]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024

  41. [41]

    Hidden factors and hidden topics: understanding rating dimensions with review text

    Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165--172, 2013

  42. [42]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018

  43. [43]

    Relaxed lasso

    Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52 0 (1): 0 374--393, 2007

  44. [44]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

  45. [45]

    Transformer debugger

    Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders. Transformer debugger. https://github.com/openai/transformer-debugger, 2024

  46. [46]

    Progress update \#1 from the gdm mech interp team: Full update

    Neel Nanda, Arthur Conmy, Lewis Smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, and Vikrant Varma. Progress update \#1 from the gdm mech interp team: Full update. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update

  47. [47]

    Open problem: Attribution dictionary learning

    Chris Olah, Adly Templeton, Trenton Bricken, and Adam Jermyn. Open problem: Attribution dictionary learning. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/april-update/index.html\#attr-dl

  48. [48]

    Emergence of simple-cell receptive field properties by learning a sparse code for natural images

    Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381 0 (6583): 0 607--609, 1996

  49. [49]

    Gpt-2 output dataset

    OpenAI. Gpt-2 output dataset. https://github.com/openai/gpt-2-output-dataset/tree/master, 2019. Accessed: 2024-05-21

  50. [50]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. arXiv prepreint arXiv:2303.08774, 2023

  51. [51]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  52. [52]

    arXiv preprint arXiv:2404.16014 , year=

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024

  53. [53]

    Getting closer to ai complete question answering: A set of prerequisite real tasks

    Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. Getting closer to ai complete question answering: A set of prerequisite real tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8722--8731, 2020

  54. [54]

    Efficient estimations from a slowly convergent robbins-monro process

    David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

  55. [55]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019

  56. [56]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

  57. [57]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

  58. [58]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013

  59. [59]

    The jpeg 2000 still image compression standard

    Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The jpeg 2000 still image compression standard. IEEE Signal processing magazine, 18 0 (5): 0 36--58, 2001

  60. [60]

    Z., and Liu, Z

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024

  61. [61]

    Quartz: An open-domain dataset of qualitative relationship questions

    Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. Quartz: An open-domain dataset of qualitative relationship questions. arXiv preprint arXiv:1909.03553, 2019

  62. [62]

    ProLU : A nonlinearity for sparse autoencoders

    Glen Taggart. ProLU : A nonlinearity for sparse autoencoders. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/HEpufTdakGTTKgoYF/prolu-a-pareto-improvement-for-sparse-autoencoders

  63. [63]

    Commonsenseqa 2.0: Exposing the limits of ai through gamification

    Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification. arXiv preprint arXiv:2201.05320, 2022

  64. [64]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  65. [65]

    Regression shrinkage and selection via the lasso

    Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58 0 (1): 0 267--288, 1996

  66. [66]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  67. [67]

    Crowdsourcing Multiple Choice Science Questions

    Johannes Welbl, Nelson F. Liu, Matt Gardner, Gabor Angeli, Rik Koncel-Kedziorski, Emily Bender, Kyle Richardson, Peter Clark, and Nate Kushman. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017. URL https://arxiv.org/abs/1707.06209

  68. [68]

    Addressing feature suppression in SAEs

    Benjamin Wright and Lee Sharkey. Addressing feature suppression in SAEs . AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes

  69. [69]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

  70. [70]

    and LeCun, Yann , month = apr, year =

    Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021

  71. [71]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023

  72. [72]

    going on a vacation

    Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. URL https:/...