arxiv: 2406.04093 · v1 · submitted 2024-06-06 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Scaling and evaluating sparse autoencoders

Alec Radford, Gabriel Goh, Henk Tillman, Ilya Sutskever, Jan Leike, Jeffrey Wu, Leo Gao, Rajan Troll, Tom Dupr\'e la Tour

Pith reviewed 2026-05-12 17:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords sparse autoencodersscaling lawsfeature interpretabilitymechanistic interpretabilitylanguage modelsGPT-4dead latents

0 comments

The pith

K-sparse autoencoders with dead-latent fixes yield clean scaling laws and steadily improving feature quality metrics as size grows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that replacing standard sparse autoencoders with k-sparse versions lets researchers set sparsity directly, removes the need to balance competing loss terms, and produces far fewer dead latents even at large scale. With this change the authors observe simple power-law relationships between autoencoder width, sparsity level, and reconstruction fidelity. They also define three new ways to score feature quality—how well the autoencoder recovers features hypothesized by other methods, how human-readable the activation patterns are, and how sparsely the features affect downstream model behavior—and demonstrate that all three scores rise with autoencoder size. To show the approach works at extreme scale they train a 16-million-latent model on GPT-4 activations for 40 billion tokens.

Core claim

k-sparse autoencoders directly enforce a fixed number of active latents per example, combined with small architectural changes that keep almost all latents alive, produce clean scaling laws relating autoencoder size and sparsity to reconstruction loss while new interpretability metrics based on feature recovery, activation explainability, and downstream sparsity all improve monotonically with size, culminating in a working 16-million-latent autoencoder trained on GPT-4.

What carries the argument

k-sparse autoencoder that selects exactly the top-k activations per input example and applies auxiliary losses to prevent dead latents.

If this is right

Larger autoencoders recover a higher fraction of features previously identified by other interpretability techniques.
The fraction of activation patterns that humans can explain increases with autoencoder width.
Features extracted at larger scales produce sparser effects on downstream model outputs.
The same training recipe remains stable up to at least 16 million latents and 40 billion training tokens.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling laws continue, mechanistic interpretability of frontier models may become feasible by training autoencoders whose latent count matches or exceeds the number of distinct concepts the model uses.
The dead-latent mitigation techniques could be ported to other bottleneck architectures that suffer from unused units.
The new evaluation metrics provide a quantitative yardstick that future work can use to compare different sparse-coding methods without relying solely on reconstruction loss.

Load-bearing premise

The three new metrics genuinely track true feature interpretability rather than simply tracking autoencoder size or reconstruction quality.

What would settle it

Training still-larger autoencoders beyond 16 million latents and checking whether the three proposed metrics stop improving or begin to degrade.

read the original abstract

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

k-sparse autoencoders give practical scaling to millions of latents on GPT-4 with clean laws, but the new metrics still need anchoring to human judgments or downstream tasks.

read the letter

The main thing here is that k-sparse autoencoders let them control sparsity directly and reach 16 million latents on GPT-4 activations over 40 billion tokens, with reported clean scaling laws in both size and sparsity. They also add three new metrics that improve as the autoencoder grows larger, and they release code plus a visualizer for open models. That combination of scale and artifacts is the practical advance. They adapt the 2013 k-sparse method to skip the usual L1 tuning hassle, describe fixes that keep dead latents low even at big sizes, and show the reconstruction-sparsity frontier moving forward. The scaling results and the GPT-4 run stand out as concrete evidence that the approach works at frontier scale. Releasing the trained models and visualizer gives others something to use right away. The new metrics cover recovery of hypothesized features, how explainable the activation patterns look, and how sparse the downstream effects are. These track size nicely in their experiments. The soft spot is that the metrics lack any reported check against human ratings or tests on whether the extracted features actually improve interpretability in a real task. Without that external anchor, it's possible the improvements reflect metric optimization more than genuine monosemanticity gains. The abstract also leaves the full baseline comparisons and error bars for the scaling claims to the body, so the strength depends on those details holding up. This is aimed at people working on mechanistic interpretability who want scalable tools for pulling features out of large language models. Readers who need concrete methods and released artifacts will get direct value. It has enough new technique and scale to deserve a serious referee, even with the metric validation gap. I'd send it to peer review.

Referee Report

2 major / 1 minor

Summary. The paper proposes k-sparse autoencoders as a method to directly control sparsity in SAEs for extracting interpretable features from language model activations. It reports modifications that minimize dead latents, identifies clean scaling laws relating autoencoder size and sparsity to reconstruction quality, introduces three new evaluation metrics (recovery of hypothesized features, explainability of activation patterns, and sparsity of downstream effects) that are shown to improve with scale, and demonstrates scalability by training a 16-million-latent SAE on GPT-4 activations over 40 billion tokens. The work releases training code, trained autoencoders for open-source models, and a visualizer.

Significance. If the reported scaling laws are robust and the new metrics are shown to track genuine improvements in monosemanticity and interpretability, the results would provide a practical path to larger-scale feature extraction in mechanistic interpretability. The explicit release of code, models, and a visualizer is a concrete strength that supports reproducibility and follow-on work.

major comments (2)

[Evaluation metrics section (around the introduction of the three metrics)] The central claim that the three new metrics (hypothesized-feature recovery, activation-pattern explainability, downstream-effect sparsity) establish improved feature quality rests on their observed improvement with autoencoder size. However, the manuscript provides no external validation: no correlation analysis with human interpretability ratings on the released visualizer, no comparison against existing interpretability benchmarks, and no test of whether the metrics remain predictive when the SAE is transferred to a different model or task. Without such anchoring, it remains possible that the metrics can be improved by architectural choices that do not increase monosemanticity.
[Scaling experiments and results] The abstract states that 'clean scaling laws' are found with respect to autoencoder size and sparsity, yet the provided experimental summary lacks reported error bars, baseline comparisons to standard (non-k-sparse) SAEs, and explicit data-exclusion criteria. These omissions make it impossible to assess whether the scaling relations are statistically reliable or sensitive to hyperparameter choices, which is load-bearing for the scalability claim.

minor comments (1)

[Abstract / Introduction] The abstract and introduction would benefit from a short table or figure reference that directly compares the reconstruction-sparsity frontier of k-sparse autoencoders against prior SAE variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation metrics and scaling experiments. We have revised the manuscript to incorporate clarifications, additional details, and a limitations discussion where feasible, while preserving the core contributions of k-sparse autoencoders and the observed scaling behaviors.

read point-by-point responses

Referee: [Evaluation metrics section (around the introduction of the three metrics)] The central claim that the three new metrics (hypothesized-feature recovery, activation-pattern explainability, downstream-effect sparsity) establish improved feature quality rests on their observed improvement with autoencoder size. However, the manuscript provides no external validation: no correlation analysis with human interpretability ratings on the released visualizer, no comparison against existing interpretability benchmarks, and no test of whether the metrics remain predictive when the SAE is transferred to a different model or task. Without such anchoring, it remains possible that the metrics can be improved by architectural choices that do not increase monosemanticity.

Authors: We agree that the new metrics would be strengthened by direct external validation such as human ratings or cross-benchmark comparisons. The metrics were designed to operationalize specific, testable aspects of feature quality motivated by prior interpretability literature (recovery of known model features, human-readable activation explanations, and localized downstream effects). Their consistent improvement with scale provides supporting evidence under the k-sparse regime, and the released visualizer is intended to enable exactly the kind of human studies the referee suggests. We have added an explicit limitations paragraph in the discussion section acknowledging the absence of these anchors in the current work and outlining how future studies could use the released artifacts to perform them. We have not claimed the metrics are fully validated proxies for monosemanticity, only that they improve alongside scale in our experiments. revision: partial
Referee: [Scaling experiments and results] The abstract states that 'clean scaling laws' are found with respect to autoencoder size and sparsity, yet the provided experimental summary lacks reported error bars, baseline comparisons to standard (non-k-sparse) SAEs, and explicit data-exclusion criteria. These omissions make it impossible to assess whether the scaling relations are statistically reliable or sensitive to hyperparameter choices, which is load-bearing for the scalability claim.

Authors: We accept that the presentation of the scaling results should have included error bars and clearer baselines. The full manuscript already contains direct comparisons between k-sparse and standard SAEs on the reconstruction-sparsity frontier, but we have now added error bars computed from repeated training runs at selected scales and clarified the data-exclusion criteria (primarily runs exhibiting >5% dead latents after the dead-latent mitigation steps). These additions appear in the revised figures and methods section. The abstract's reference to 'clean scaling laws' is qualified by the k-sparse formulation and dead-latent fixes; we have updated the text to emphasize that the observed relations hold under the reported hyperparameter ranges and exclusion rules. revision: yes

Circularity Check

0 steps flagged

Empirical scaling experiments with new metrics show no derivation circularity

full rationale

The paper reports experimental results from training k-sparse autoencoders on LM activations, observes scaling behavior in reconstruction/sparsity tradeoffs and in three new evaluation metrics (hypothesized-feature recovery, activation-pattern explainability, downstream-effect sparsity), and demonstrates training a 16M-latent SAE. No load-bearing step reduces by the paper's own equations or self-citations to a fitted parameter or input quantity defined in terms of the target result; the scaling laws and metric improvements are presented as direct empirical observations rather than derived predictions. The cited k-sparse autoencoder technique is from independent prior work (Makhzani & Frey 2013) and does not create a self-referential chain. This is a standard empirical scaling study whose central claims remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard machine learning assumptions about sparse decompositions of activations rather than introducing new free parameters, axioms, or invented entities beyond existing SAE frameworks.

axioms (1)

domain assumption Language model activations can be usefully decomposed into a sparse set of interpretable features via autoencoders.
This is the foundational premise of sparse autoencoders for mechanistic interpretability invoked throughout the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1464 out tokens · 112747 ms · 2026-05-12T17:42:25.008103+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.JcostCore Jcost_pos_of_ne_one echoes
We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
cs.LG 2026-05 accept novelty 8.0

Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning
cs.LG 2026-05 unverdicted novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
From Mechanistic to Compositional Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaran...
What Cohort INRs Encode and Where to Freeze Them
cs.LG 2026-05 unverdicted novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
Linear-Readout Floors and Threshold Recovery in Computation in Superposition
cs.LG 2026-05 unverdicted novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection
cs.CV 2026-04 unverdicted novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
cs.CV 2026-04 unverdicted novelty 7.0

Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction
cs.LG 2026-04 unverdicted novelty 7.0

Sparse autoencoders applied to a 14.5M-parameter clinical EHR model reveal progressive abstraction across layers, with SAE features outperforming dense ones for mortality in full-sequence probes but not in leakage-saf...
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
cs.CV 2026-04 unverdicted novelty 7.0

The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model pr...
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
cs.LG 2026-04 conditional novelty 7.0

Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 unverdicted novelty 6.0

DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
cs.LG 2026-05 conditional novelty 6.0

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
cs.LG 2026-05 unverdicted novelty 6.0

Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
cs.AI 2026-05 conditional novelty 6.0

Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
cs.AI 2026-05 unverdicted novelty 6.0

LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
Feature Starvation as Geometric Instability in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
cs.LG 2026-05 unverdicted novelty 6.0

Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
cs.CV 2026-05 unverdicted novelty 6.0

GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
LatentDiff: Scaling Semantic Dataset Comparison to Millions of Images
cs.CV 2026-04 unverdicted novelty 6.0

LatentDiff scales semantic dataset comparison to millions of images using latent spaces of vision encoders combined with sparse autoencoders and density ratio estimation, showing better accuracy and robustness than ca...
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
Geometric Routing Enables Causal Expert Control in Mixture of Experts
cs.AI 2026-04 unverdicted novelty 6.0

Cosine-similarity routing in low-dimensional space makes MoE experts monosemantic by construction and enables direct causal control via centroid interventions.
Improving Robustness In Sparse Autoencoders via Masked Regularization
cs.LG 2026-04 unverdicted novelty 6.0

Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.
Understanding Emergent Misalignment via Feature Superposition Geometry
cs.AI 2026-04 unverdicted novelty 6.0

Emergent misalignment occurs because fine-tuning amplifies target features that overlap geometrically with harmful ones in superposition, and filtering samples near toxic features mitigates it.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
cs.CL 2026-04 unverdicted novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
cs.LG 2024-03 unverdicted novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 31 Pith papers · 17 internal anchors

[1]

K-SVD : An algorithm for designing overcomplete dictionaries for sparse representation

Michal Aharon, Michael Elad, and Alfred Bruckstein. K-SVD : An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54 0 (11): 0 4311--4322, 2006

work page 2006
[2]

How to explain individual classification decisions

David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert M \"u ller. How to explain individual classification decisions. The Journal of Machine Learning Research, 11: 0 1803--1831, 2010

work page 2010
[3]

Language models can explain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

work page 2023
[4]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

work page 2020
[5]

Open source sparse autoencoders for all residual stream layers of gpt2-small

Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2-small. AI Alignment Forum, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream

work page 2024
[6]

An interpretability illusion for BERT

Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Vi \'e gas, and Martin Wattenberg. An interpretability illusion for BERT . arXiv preprint arXiv:2104.07143, 2021

work page arXiv 2021
[7]

Identifying functionally important features with end-to-end sparse dictionary learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally important features with end-to-end sparse dictionary learning. 2024

work page 2024
[8]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

work page 2023
[9]

Toxic comment classification challenge

Cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. Toxic comment classification challenge. Kaggle, 2017. URL https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge

work page 2017
[10]

Unified scaling laws for routed language models

Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, et al. Unified scaling laws for routed language models. In International conference on machine learning, pages 4057--4086. PMLR, 2022

work page 2022
[11]

Update on how we train saes

Tom Conerly, Adly Templeton, Trenton Bricken, Jonathan Marcus, and Tom Henighan. Update on how we train saes. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/april-update/index.html\#training-saes

work page 2024
[12]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

JumpReLU : A retrofit defense strategy for adversarial attacks

N Benjamin Erichson, Zhewei Yao, and Michael W Mahoney. JumpReLU : A retrofit defense strategy for adversarial attacks. arXiv preprint arXiv:1904.03750, 2019

work page arXiv 1904
[15]

Neuron to Graph: Interpreting Language Model Neurons at Scale , shorttitle =

Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, and Fazl Barez. Neuron to graph: Interpreting language model neurons at scale. arXiv preprint arXiv:2305.19911, 2023

work page arXiv 2023
[16]

Decoding the thought vector, 2016

Gabriel Goh. Decoding the thought vector, 2016. URL https://gabgoh.github.io/ThoughtVectors/. Accessed: 2024-05-24

work page 2016
[17]

Ag's corpus of news articles

Antonio Gulli. Ag's corpus of news articles. http://groups.di.unipi.it/ gulli/AG_corpus_of_news_articles.html. Accessed: 2024-05-21

work page 2024
[18]

2023 , archivePrefix=

Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023

work page arXiv 2023
[19]

2023 , month = feb, journal =

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020

work page arXiv 2008
[20]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review arXiv 2010
[21]

Reducing the dimensionality of data with neural networks

Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313 0 (5786): 0 504--507, 2006

work page 2006
[22]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

work page 2019
[24]

Ghost grads: An improvement on resampling

Adam Jermyn and Adly Templeton. Ghost grads: An improvement on resampling. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/jan-update/index.html\#dict-learning-resampling

work page 2024
[25]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[26]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[27]

E uroparl: A parallel corpus for statistical machine translation

Philipp Koehn. E uroparl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79--86, Phuket, Thailand, September 13-15 2005. URL https://aclanthology.org/2005.mtsummit-papers.11

work page 2005
[28]

Zero-bias autoencoders and the benefits of co-adapting features

Kishore Konda, Roland Memisevic, and David Krueger. Zero-bias autoencoders and the benefits of co-adapting features. arXiv preprint arXiv:1402.3337, 2014

work page arXiv 2014
[29]

Building high-level features using large scale unsupervised learning

Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean, and Andrew Y Ng. Building high-level features using large scale unsupervised learning. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8595--8598. IEEE, 2013

work page 2013
[30]

Sparse deep belief net model for visual area v2

Honglak Lee, Chaitanya Ekanadham, and Andrew Ng. Sparse deep belief net model for visual area v2. Advances in neural information processing systems, 20, 2007

work page 2007
[31]

Taking features out of superposition with sparse autoencoders

Beren Millidge Lee Sharkey, Dan Braun. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition

work page 2022
[32]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Scaling laws for dictionary learning

Jack Lindsey, Tom Conerly, Adly Templeton, Jonathan Marcus, and Tom Henighan. Scaling laws for dictionary learning. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/april-update/index.html\#scaling-laws

work page 2024
[34]

Sparse modeling for image and vision processing

Julien Mairal, Francis Bach, Jean Ponce, et al. Sparse modeling for image and vision processing. Foundations and Trends in Computer Graphics and Vision , 8 0 (2-3): 0 85--283, 2014

work page 2014
[35]

Towards principled evaluations of sparse autoencoders for interpretability and control, 2024

Aleksandar Makelov, George Lange, and Neel Nanda. Towards principled evaluations of sparse autoencoders for interpretability and control, 2024

work page 2024
[36]

k-Sparse Autoencoders

Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013

work page Pith review arXiv 2013
[37]

Coherence analysis of iterative thresholding algorithms

Arian Maleki. Coherence analysis of iterative thresholding algorithms. In 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 236--243. IEEE, 2009

work page 2009
[38]

Matching pursuits with time-frequency dictionaries

St \'e phane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41 0 (12): 0 3397--3415, 1993

work page 1993
[39]

Some open-source dictionaries and dictionary learning infrastructure

Sam Marks. Some open-source dictionaries and dictionary learning infrastructure. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/AaoWLcmpY3LKvtdyq/some-open-source-dictionaries-and-dictionary-learning

work page 2023
[40]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024

work page internal anchor Pith review arXiv 2024
[41]

Hidden factors and hidden topics: understanding rating dimensions with review text

Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pages 165--172, 2013

work page 2013
[42]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Relaxed lasso

Nicolai Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52 0 (1): 0 374--393, 2007

work page 2007
[44]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018

work page 2018
[45]

Transformer debugger

Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders. Transformer debugger. https://github.com/openai/transformer-debugger, 2024

work page 2024
[46]

Progress update \#1 from the gdm mech interp team: Full update

Neel Nanda, Arthur Conmy, Lewis Smith, Senthooran Rajamanoharan, Tom Lieberum, János Kramár, and Vikrant Varma. Progress update \#1 from the gdm mech interp team: Full update. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update

work page 2024
[47]

Open problem: Attribution dictionary learning

Chris Olah, Adly Templeton, Trenton Bricken, and Adam Jermyn. Open problem: Attribution dictionary learning. Transformer Circuits Thread, 2024. https://transformer-circuits.pub/2024/april-update/index.html\#attr-dl

work page 2024
[48]

Emergence of simple-cell receptive field properties by learning a sparse code for natural images

Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381 0 (6583): 0 607--609, 1996

work page 1996
[49]

Gpt-2 output dataset

OpenAI. Gpt-2 output dataset. https://github.com/openai/gpt-2-output-dataset/tree/master, 2019. Accessed: 2024-05-21

work page 2019
[50]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. arXiv prepreint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[52]

arXiv preprint arXiv:2404.16014 , year=

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024

work page arXiv 2024
[53]

Getting closer to ai complete question answering: A set of prerequisite real tasks

Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. Getting closer to ai complete question answering: A set of prerequisite real tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8722--8731, 2020

work page 2020
[54]

Efficient estimations from a slowly convergent robbins-monro process

David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988

work page 1988
[55]

SocialIQA: Commonsense Reasoning about Social Interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019

work page internal anchor Pith review arXiv 1904
[56]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[57]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[58]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[59]

The jpeg 2000 still image compression standard

Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. The jpeg 2000 still image compression standard. IEEE Signal processing magazine, 18 0 (5): 0 36--58, 2001

work page 2000
[60]

Z., and Liu, Z

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762, 2024

work page arXiv 2024
[61]

Quartz: An open-domain dataset of qualitative relationship questions

Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. Quartz: An open-domain dataset of qualitative relationship questions. arXiv preprint arXiv:1909.03553, 2019

work page arXiv 1909
[62]

ProLU : A nonlinearity for sparse autoencoders

Glen Taggart. ProLU : A nonlinearity for sparse autoencoders. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/HEpufTdakGTTKgoYF/prolu-a-pareto-improvement-for-sparse-autoencoders

work page 2024
[63]

Commonsenseqa 2.0: Exposing the limits of ai through gamification

Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification. arXiv preprint arXiv:2201.05320, 2022

work page arXiv 2022
[64]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024
[65]

Regression shrinkage and selection via the lasso

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58 0 (1): 0 267--288, 1996

work page 1996
[66]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review arXiv 2018
[67]

Crowdsourcing Multiple Choice Science Questions

Johannes Welbl, Nelson F. Liu, Matt Gardner, Gabor Angeli, Rik Koncel-Kedziorski, Emily Bender, Kyle Richardson, Peter Clark, and Nate Kushman. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017. URL https://arxiv.org/abs/1707.06209

work page Pith review arXiv 2017
[68]

Addressing feature suppression in SAEs

Benjamin Wright and Lee Sharkey. Addressing feature suppression in SAEs . AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes

work page 2024
[69]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

and LeCun, Yann , month = apr, year =

Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021

work page arXiv 2021
[71]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

going on a vacation

Ben Zhou, Daniel Khashabi, Qiang Ning, and Dan Roth. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019. URL https:/...

work page arXiv 2019