Recognition: no theorem link
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Pith reviewed 2026-05-15 05:43 UTC · model grok-4.3
The pith
Gemma Scope releases JumpReLU sparse autoencoders for every layer and sub-layer of the Gemma 2 2B and 9B models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope.
What carries the argument
JumpReLU sparse autoencoders that learn a sparse decomposition of each layer's activations into interpretable features.
If this is right
- Any researcher can now apply these SAEs to inspect or intervene on activations inside Gemma 2 without first training their own.
- Direct comparisons of feature sets between the base and instruction-tuned 9B model become possible on identical architectures.
- Studies that require SAEs at every layer can now be run at 2B and 9B scale without prohibitive compute.
- Community benchmarks for SAE quality can be computed on the released metrics rather than starting from scratch.
- Safety analyses that rely on sparse feature dictionaries can use the same weights across multiple experiments.
Where Pith is reading between the lines
- The release could standardize SAE-based analysis for the Gemma family and make cross-model comparisons easier once similar suites appear for other open models.
- Downstream work may discover whether the same feature sets transfer across training checkpoints or fine-tuning stages.
- Interactive demos already hosted on Neuronpedia suggest that non-experts can explore the learned features without writing code.
- If the SAEs remain stable under further model scaling, the same training recipe could be applied to future Gemma releases with modest additional effort.
Load-bearing premise
Standard reconstruction and sparsity metrics are enough to show that the released SAEs will prove useful for downstream interpretability and safety work.
What would settle it
A controlled study in which features recovered by these SAEs show no reliable correlation with human-labeled concepts or produce no measurable gain on safety-related probing tasks would falsify their claimed utility.
read the original abstract
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Gemma Scope, an open suite of JumpReLU sparse autoencoders trained on all layers and sub-layers of the Gemma 2 2B and 9B base models plus selected layers of the 27B model (with an additional set trained on the instruction-tuned 9B variant). It reports standard reconstruction and sparsity metrics for the released SAEs, provides the model weights, and includes a tutorial and interactive demo to support downstream safety and interpretability research.
Significance. If the reported metrics hold, the release constitutes a substantial practical contribution by removing the high computational cost of training comprehensive SAE suites on recent Gemma 2 models. The public availability of weights across nearly all layers enables reproducible, large-scale mechanistic interpretability experiments that were previously inaccessible outside industrial labs.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a concise table or bullet list enumerating the exact number of SAEs released per model size and layer type, together with the primary metrics (e.g., reconstruction MSE, L0 sparsity) used for each.
- [Evaluation] Section 4 (or equivalent) on evaluation should explicitly note whether any hyperparameter search was performed for the JumpReLU threshold or sparsity penalty, or whether values were taken directly from prior work without retuning.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of Gemma Scope and for recommending acceptance. We appreciate the recognition that releasing comprehensive, publicly available SAE weights across Gemma 2 layers removes a significant barrier to large-scale interpretability research.
Circularity Check
No significant circularity in derivation chain
full rationale
The manuscript is a resource-release paper whose central contribution is the public availability of pre-trained JumpReLU SAEs on Gemma 2 models together with their standard reconstruction and sparsity metrics. No derivation, prediction, or theoretical claim is advanced that reduces by construction to fitted parameters, self-citations, or author-defined quantities; the work consists of empirical training followed by conventional benchmarking and release. All evaluation steps are independent of any prior author-specific results and rely on externally standard metrics.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 24 Pith papers
-
Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features
Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in ...
-
Linear-Readout Floors and Threshold Recovery in Computation in Superposition
Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
Structural Instability of Feature Composition
Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.
-
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Sparse autoencoders on EEG transformers identify three regimes of clinical concept encoding and reveal entanglements such as age-pathology confounding via a new steering selectivity metric.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
-
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
-
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates
Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.
-
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.
Reference graph
Works this paper leans on
-
[1]
org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes
URLhttps://www.alignmentforum. org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes. T. Conerly, A. Templeton, T. Bricken, J. Marcus, and T. Henighan. Update on how we train SAEs. Transformer Circuits Thread , 2024. URL https://transformer-circuits. pub/2024/april-update/index.html# training-saes. A. Conmy and N. Nanda. Activation steer- ing wit...
-
[2]
URLhttps://arxiv.org/abs/2406. 04093. Gemini Team. Gemini: A family of highly ca- pable multimodal models, 2024. URLhttps: //arxiv.org/abs/2312.11805. Gemma Team. Gemma: Open models based on gemini research and technology, 2024a. URL https://arxiv.org/abs/2403.08295. Gemma Team. Gemma 2: Improving open language models at a practical size, 2024b. URL https...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
URL https://openreview.net/ forum?id=JYs1R9IMJr. W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. Bertsi- mas. Universal neurons in gpt2 language mod- els, 2024. URL https://arxiv.org/abs/ 2401.12181. 14 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 M. Hanna, O. Liu, and A. Variengien. How ...
-
[4]
URLhttps://arxiv.org/abs/2407. 10264. A. Jermyn, C. Olah, and T. Henighan. At- tention head superposition, May 2023. URL https://transformer-circuits. pub/2023/may-update/index.html# attention-superposition. A. Karvonen, B. Wright, C. Rager, R. Angell, J. Brinkmann, L. R. Smith, C. M. Verdun, D. Bau, and S. Marks. Measuring progress in dictionary learning...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
URL https://aclanthology.org/ 2005.mtsummit-papers.11. K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-time interven- tion: Eliciting truthful answers from a lan- guage model. In Thirty-seventh Confer- ence on Neural Information Processing Sys- tems, 2023. URL https://openreview. net/forum?id=aLLuYpn83y. J. Lin and J. Bloom. Analyzing...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.blackboxnlp-1 2005
-
[6]
URL https://aclanthology.org/ 2023.blackboxnlp-1.2. N. Nanda, S. Rajamanoharan, J. Kramar, and R. Shah. Fact finding: Attempting to reverse- engineer factual recall on the neuron level, Dec 2023b. C. Olah. Interpretability.Alignment Forum, 2021. C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An in- troduction to circuits. D...
-
[7]
https://distill.pub/2020/circuits/branch- specialization. K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt- 2 small, 2022. URL https://arxiv.org/ abs/2211.00593. M. Wattenberg and F. Viégas. Relational composition in neural networks: A gen- tle survey and c...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[8]
Transcoders do not scale to larger models or modern transformer architectures (e.g. 17 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Gemma 2 has Gated MLPs unlike GPT-2 Small) as well as SAEs
-
[9]
JumpReLU provides a bigger performance boost to SAEs than to transcoders
-
[10]
Errors in the implementation of transcoders in this work, or in the SAE implementation from Dunefsky et al. (2024)
work page 2024
-
[11]
Other training details (not just the JumpReLU architecture) that improve SAEs more than transcoders. Dunefsky et al. (2024) use training methods such as using a low learning rate, differing from SAE research that came out at a similar time to Bricken et al. (2023) such as Rajamanoha- ran et al. (2024a) and Cunningham et al. (2023). However, Dunefsky et al...
work page 2024
-
[12]
or an architecture which prevents dead features like more recent SAE research (Conerly et al., 2024; Gao et al., 2024; Rajamanoharan et al., 2024a), which means their results are in a fairly different setting to other SAE research. Language model technical details We fold the pre-MLP RMS norm gain parameters (Zhang and Sennrich (2019), Section 3) into the...
work page 2024
-
[13]
We do not initialize the encoder kernelWenc to the transpose of the decoder kernelWdec
-
[14]
We do not use a pre-encoder bias, i.e. we do not subtractbdec from the input to the transcoder (although we still addbdec at the transcoder output). These two training changes were motivated by the fact that, unlike SAEs, the input and outputs spaces for transcoders are not identical. To spell out how we apply normalization: we divide the input and target...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.