hub Canonical reference

Scaling and evaluating sparse autoencoders

· 2024 · cs.LG · arXiv 2406.04093

Canonical reference. 73% of citing Pith papers cite this work as background.

74 Pith papers citing it

Background 73% of classified citations

open full Pith review browse 74 citing papers arXiv PDF

abstract

Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 3

citation-polarity summary

background 8 use method 3

representative citing papers

Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability

cs.LG · 2026-07-02 · conditional · novelty 8.0

Expander SAEs apply left-d-regular expander masks to TopK SAEs, learning only dn decoder parameters instead of mn and tracing a storage-fidelity frontier that reaches 293x compression with 84% retained performance on Qwen2.5-3B.

Mechanistic Interpretability and Causal Feature Steering of Neural Quantum States via Sparse Autoencoders

quant-ph · 2026-07-01 · unverdicted · novelty 8.0

Sparse autoencoders applied to Neural Quantum States extract unsupervised features correlating with and causally steering physical observables such as order parameters while preserving variational energy.

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

cs.LG · 2026-05-13 · accept · novelty 8.0

Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.

Can neurons speak? Semantic narration of vision at single-cell resolution

q-bio.NC · 2026-06-17 · unverdicted · novelty 7.0

NEURRATOR bridges neural spike trains to frozen CLIP patch embeddings via a learned encoder, then uses a multimodal LM and sparse autoencoder to produce validated natural-language narrations of viewed scenes from Neuropixel recordings.

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

cs.LG · 2026-06-04 · conditional · novelty 7.0

SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.

How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Auto-interpretation labels for SAE features generalize poorly across languages and scripts, missing the same semantic content up to 4x more often in Serbian than English and more in Cyrillic than Latin despite deterministic transliteration.

Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU Activations

cs.LG · 2026-05-27 · conditional · novelty 7.0

SA-GSAE with Bi-Jump-ReLU enables one latent to encode both polarities of anticorrelated features, Pareto-dominating or matching full-width gated SAEs while reducing dead latents by up to 500x on some LLM hookpoints.

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

cs.LG · 2026-05-24 · unverdicted · novelty 7.0

Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CEDAR learns an invertible rotation of vision-language embeddings to concentrate semantics into sparse, axis-aligned coordinates for improved interpretability.

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

cs.CV · 2026-05-19 · conditional · novelty 7.0

Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.

What Cohort INRs Encode and Where to Freeze Them

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

Linear-Readout Floors and Threshold Recovery in Computation in Superposition

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contradiction via distinct readout invariants.

Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.

Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.

Sparse Autoencoder Decomposition of Clinical Sequence Model Representations: Feature Complexity, Task Specialisation, and Mortality Prediction

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Sparse autoencoders applied to a 14.5M-parameter clinical EHR model reveal progressive abstraction across layers, with SAE features outperforming dense ones for mortality in full-sequence probes but not in leakage-safe windows where dense representations match or exceed them.

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.

MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

cs.LG · 2026-04-03 · conditional · novelty 7.0

Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.

Extraction and Analysis of Multimodal Concepts in Vision Language Models through Sparse Autoencoders

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

A new SAE-based framework extracts visual, textual, and multimodal concepts from VLMs and reports up to 45% better visual concept quality on a VQA dataset while identifying multimodal concepts.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Scaling and evaluating sparse autoencoders

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer