pith. machine review for the scientific record. sign in

arxiv: 2408.05147 · v2 · submitted 2024-08-09 · 💻 cs.LG · cs.AI· cs.CL

Recognition: no theorem link

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords sparse autoencodersGemma 2model interpretabilityJumpReLUSAEneural network featuresopen weightssafety research
0
0 comments X

The pith

Gemma Scope releases JumpReLU sparse autoencoders for every layer and sub-layer of the Gemma 2 2B and 9B models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Gemma Scope as a complete collection of sparse autoencoders trained on all layers and sub-layers of the Gemma 2 2B and 9B base models, plus selected layers of the 27B model. Sparse autoencoders decompose a neural network's internal activations into a sparse set of features that often correspond to human-understandable concepts. Training these models from scratch is computationally expensive, which has restricted their use to well-resourced labs. By releasing the trained weights together with standard quality metrics, the work removes that barrier and lets outside researchers run interpretability and safety experiments directly on Gemma 2. Versions trained on both the base and instruction-tuned 9B model are included for comparison.

Core claim

We introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope.

What carries the argument

JumpReLU sparse autoencoders that learn a sparse decomposition of each layer's activations into interpretable features.

If this is right

  • Any researcher can now apply these SAEs to inspect or intervene on activations inside Gemma 2 without first training their own.
  • Direct comparisons of feature sets between the base and instruction-tuned 9B model become possible on identical architectures.
  • Studies that require SAEs at every layer can now be run at 2B and 9B scale without prohibitive compute.
  • Community benchmarks for SAE quality can be computed on the released metrics rather than starting from scratch.
  • Safety analyses that rely on sparse feature dictionaries can use the same weights across multiple experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The release could standardize SAE-based analysis for the Gemma family and make cross-model comparisons easier once similar suites appear for other open models.
  • Downstream work may discover whether the same feature sets transfer across training checkpoints or fine-tuning stages.
  • Interactive demos already hosted on Neuronpedia suggest that non-experts can explore the learned features without writing code.
  • If the SAEs remain stable under further model scaling, the same training recipe could be applied to future Gemma releases with modest additional effort.

Load-bearing premise

Standard reconstruction and sparsity metrics are enough to show that the released SAEs will prove useful for downstream interpretability and safety work.

What would settle it

A controlled study in which features recovered by these SAEs show no reliable correlation with human-labeled concepts or produce no measurable gain on safety-related probing tasks would falsify their claimed utility.

read the original abstract

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces Gemma Scope, an open suite of JumpReLU sparse autoencoders trained on all layers and sub-layers of the Gemma 2 2B and 9B base models plus selected layers of the 27B model (with an additional set trained on the instruction-tuned 9B variant). It reports standard reconstruction and sparsity metrics for the released SAEs, provides the model weights, and includes a tutorial and interactive demo to support downstream safety and interpretability research.

Significance. If the reported metrics hold, the release constitutes a substantial practical contribution by removing the high computational cost of training comprehensive SAE suites on recent Gemma 2 models. The public availability of weights across nearly all layers enables reproducible, large-scale mechanistic interpretability experiments that were previously inaccessible outside industrial labs.

minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a concise table or bullet list enumerating the exact number of SAEs released per model size and layer type, together with the primary metrics (e.g., reconstruction MSE, L0 sparsity) used for each.
  2. [Evaluation] Section 4 (or equivalent) on evaluation should explicitly note whether any hyperparameter search was performed for the JumpReLU threshold or sparsity penalty, or whether values were taken directly from prior work without retuning.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of Gemma Scope and for recommending acceptance. We appreciate the recognition that releasing comprehensive, publicly available SAE weights across Gemma 2 layers removes a significant barrier to large-scale interpretability research.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The manuscript is a resource-release paper whose central contribution is the public availability of pre-trained JumpReLU SAEs on Gemma 2 models together with their standard reconstruction and sparsity metrics. No derivation, prediction, or theoretical claim is advanced that reduces by construction to fitted parameters, self-citations, or author-defined quantities; the work consists of empirical training followed by conventional benchmarking and release. All evaluation steps are independent of any prior author-specific results and rely on externally standard metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on the standard assumptions of sparse autoencoder training (sparsity penalty, JumpReLU activation function, reconstruction loss) but introduces no new free parameters, axioms, or invented entities beyond those already established in the SAE literature.

pith-pipeline@v0.9.0 · 5533 in / 1136 out tokens · 36548 ms · 2026-05-15T05:43:33.135686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

    cs.LG 2026-05 accept novelty 8.0

    Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  3. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  4. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  5. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  6. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    cs.LG 2026-05 unverdicted novelty 7.0

    Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

  7. Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

    cs.AI 2026-05 unverdicted novelty 7.0

    A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in ...

  8. Linear-Readout Floors and Threshold Recovery in Computation in Superposition

    cs.LG 2026-05 unverdicted novelty 7.0

    Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contr...

  9. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 7.0

    Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

  10. Structural Instability of Feature Composition

    cs.LG 2026-04 unverdicted novelty 7.0

    Feature composition in SAEs collapses asymptotically when the Gaussian mean width of the signal cone is exceeded, with ReLU inducing a ratchet-like accumulation of interference from correlations.

  11. MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents

    cs.LG 2026-04 conditional novelty 7.0

    Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.

  12. Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse autoencoders on EEG transformers identify three regimes of clinical concept encoding and reveal entanglements such as age-pathology confounding via a new steering selectivity metric.

  13. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 unverdicted novelty 6.0

    DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.

  14. DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

    cs.LG 2026-05 conditional novelty 6.0

    DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

  15. The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection

    cs.AI 2026-05 conditional novelty 6.0

    Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.

  16. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  17. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...

  18. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...

  19. Feature Starvation as Geometric Instability in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...

  20. Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

    cs.LG 2026-05 unverdicted novelty 6.0

    Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...

  21. Understanding the Mechanism of Altruism in Large Language Models

    econ.GN 2026-04 unverdicted novelty 6.0

    A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.

  22. SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection

    cs.CR 2026-04 unverdicted novelty 6.0

    SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.

  23. Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates

    cs.CE 2026-03 unverdicted novelty 6.0

    Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.

  24. Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

    cs.CL 2026-03 unverdicted novelty 6.0

    Hallucination neurons in LLMs are domain-specific, with cross-domain classifiers dropping from AUROC 0.783 within-domain to 0.563 across domains.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 20 Pith papers · 4 internal anchors

  1. [1]

    org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes

    URLhttps://www.alignmentforum. org/posts/baJyjpktzmcmRfosq/ stitching-saes-of-different-sizes. T. Conerly, A. Templeton, T. Bricken, J. Marcus, and T. Henighan. Update on how we train SAEs. Transformer Circuits Thread , 2024. URL https://transformer-circuits. pub/2024/april-update/index.html# training-saes. A. Conmy and N. Nanda. Activation steer- ing wit...

  2. [2]

    URLhttps://arxiv.org/abs/2406. 04093. Gemini Team. Gemini: A family of highly ca- pable multimodal models, 2024. URLhttps: //arxiv.org/abs/2312.11805. Gemma Team. Gemma: Open models based on gemini research and technology, 2024a. URL https://arxiv.org/abs/2403.08295. Gemma Team. Gemma 2: Improving open language models at a practical size, 2024b. URL https...

  3. [3]

    URL https://openreview.net/ forum?id=JYs1R9IMJr. W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. Bertsi- mas. Universal neurons in gpt2 language mod- els, 2024. URL https://arxiv.org/abs/ 2401.12181. 14 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 M. Hanna, O. Liu, and A. Variengien. How ...

  4. [4]

    URLhttps://arxiv.org/abs/2407. 10264. A. Jermyn, C. Olah, and T. Henighan. At- tention head superposition, May 2023. URL https://transformer-circuits. pub/2023/may-update/index.html# attention-superposition. A. Karvonen, B. Wright, C. Rager, R. Angell, J. Brinkmann, L. R. Smith, C. M. Verdun, D. Bau, and S. Marks. Measuring progress in dictionary learning...

  5. [5]

    URL https://aclanthology.org/ 2005.mtsummit-papers.11. K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg. Inference-time interven- tion: Eliciting truthful answers from a lan- guage model. In Thirty-seventh Confer- ence on Neural Information Processing Sys- tems, 2023. URL https://openreview. net/forum?id=aLLuYpn83y. J. Lin and J. Bloom. Analyzing...

  6. [6]

    URL https://aclanthology.org/ 2023.blackboxnlp-1.2. N. Nanda, S. Rajamanoharan, J. Kramar, and R. Shah. Fact finding: Attempting to reverse- engineer factual recall on the neuron level, Dec 2023b. C. Olah. Interpretability.Alignment Forum, 2021. C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An in- troduction to circuits. D...

  7. [7]

    https://distill.pub/2020/circuits/branch- specialization. K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt- 2 small, 2022. URL https://arxiv.org/ abs/2211.00593. M. Wattenberg and F. Viégas. Relational composition in neural networks: A gen- tle survey and c...

  8. [8]

    17 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Gemma 2 has Gated MLPs unlike GPT-2 Small) as well as SAEs

    Transcoders do not scale to larger models or modern transformer architectures (e.g. 17 Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 Gemma 2 has Gated MLPs unlike GPT-2 Small) as well as SAEs

  9. [9]

    JumpReLU provides a bigger performance boost to SAEs than to transcoders

  10. [10]

    Errors in the implementation of transcoders in this work, or in the SAE implementation from Dunefsky et al. (2024)

  11. [11]

    Dunefsky et al

    Other training details (not just the JumpReLU architecture) that improve SAEs more than transcoders. Dunefsky et al. (2024) use training methods such as using a low learning rate, differing from SAE research that came out at a similar time to Bricken et al. (2023) such as Rajamanoha- ran et al. (2024a) and Cunningham et al. (2023). However, Dunefsky et al...

  12. [12]

    Language model technical details We fold the pre-MLP RMS norm gain parameters (Zhang and Sennrich (2019), Section 3) into the MLP input matrices, as described in (Gurnee et al

    or an architecture which prevents dead features like more recent SAE research (Conerly et al., 2024; Gao et al., 2024; Rajamanoharan et al., 2024a), which means their results are in a fairly different setting to other SAE research. Language model technical details We fold the pre-MLP RMS norm gain parameters (Zhang and Sennrich (2019), Section 3) into the...

  13. [13]

    We do not initialize the encoder kernelWenc to the transpose of the decoder kernelWdec

  14. [14]

    google/gemma-2-2b

    We do not use a pre-encoder bias, i.e. we do not subtractbdec from the input to the transcoder (although we still addbdec at the transcoder output). These two training changes were motivated by the fact that, unlike SAEs, the input and outputs spaces for transcoders are not identical. To spell out how we apply normalization: we divide the input and target...