Recognition: 2 theorem links
· Lean TheoremJumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
Pith reviewed 2026-05-15 19:06 UTC · model grok-4.3
The pith
JumpReLU sparse autoencoders deliver higher reconstruction fidelity than Gated or TopK SAEs at matched sparsity on Gemma 2 activations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JumpReLU SAEs achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations by replacing the ReLU with a discontinuous JumpReLU activation function and using straight-through estimators to train the model, including direct optimization of L0 sparsity, without sacrificing interpretability.
What carries the argument
The JumpReLU activation, a discontinuous threshold function trained with straight-through estimators to allow direct L0 sparsity control in the SAE forward pass.
If this is right
- Higher fidelity lets SAEs recover more accurate linear features from model activations for the same sparsity budget.
- Direct L0 training removes the shrinkage bias that L1 penalties introduce in feature magnitudes.
- Interpretability remains intact, so the extracted features stay usable for mechanistic interpretability work.
- The change is a drop-in replacement that keeps training and inference cost comparable to vanilla ReLU SAEs.
Where Pith is reading between the lines
- The same JumpReLU-plus-STE pattern could be tested in other sparse coding settings outside language-model activations.
- Direct L0 optimization might reduce the need for auxiliary losses or post-training thresholding steps in future SAE variants.
- If the fidelity gain scales with model size, it would lower the barrier to applying SAEs to frontier models.
Load-bearing premise
Straight-through estimators applied to the discontinuous JumpReLU produce gradients that reliably optimize the intended sparse reconstruction objective.
What would settle it
A side-by-side measurement showing higher reconstruction MSE for JumpReLU SAEs than for TopK SAEs at identical L0 sparsity on held-out Gemma 2 activations would falsify the fidelity claim.
read the original abstract
Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces JumpReLU SAEs as a simple modification to standard ReLU-based sparse autoencoders, replacing the ReLU with a discontinuous JumpReLU activation function. Trained via straight-through estimators (STEs) applied to both the activation and an L0 sparsity term (instead of L1 proxies), the method is claimed to achieve state-of-the-art reconstruction fidelity at fixed sparsity levels on Gemma 2 9B activations, outperforming recent variants such as Gated and TopK SAEs, while preserving interpretability as assessed by manual and automated studies. The approach is presented as efficient to train and run.
Significance. If the empirical fidelity gains are robust, this offers a lightweight architectural change that directly targets L0 sparsity and could improve feature decomposition quality in mechanistic interpretability work on language models. The paper's use of STEs to avoid shrinkage and its head-to-head comparisons against published baselines are strengths, though the absence of quantitative deltas, error bars, or ablation details in the abstract indicates that verification hinges on the experimental sections.
major comments (3)
- [§3] §3 (Methods): The claim that STEs are used 'in a principled manner' for the discontinuous JumpReLU lacks an explicit derivation or subgradient analysis showing equivalence to the target L0-penalized reconstruction objective; the forward pass discontinuity means the supplied gradient is a surrogate, and without a proof or targeted ablation isolating the estimator from the activation change, the reported fidelity improvements versus Gated/TopK baselines risk being an artifact of optimization rather than the architecture.
- [§4] §4 (Experiments): The SOTA fidelity claim on Gemma 2 9B activations is not accompanied by error bars, dataset split details, or ablations that vary only the activation while holding the STE and L0 training fixed; this makes it impossible to determine whether the gains are load-bearing for the central claim or sensitive to hyperparameter choices.
- [§5] §5 (Interpretability): The assertion that improved fidelity 'does not come at the cost of interpretability' rests on manual and automated studies, but no quantitative correlation between the two evaluation methods is reported, nor is there a control showing that the JumpReLU features remain causally relevant under interventions at the same sparsity level as the baselines.
minor comments (2)
- [Abstract] Abstract: The phrase 'principled manner' for STE usage is imprecise; the exact estimator formulation (e.g., which variables receive the straight-through gradient) should be stated explicitly.
- [§2] Notation: The definition of the JumpReLU threshold and jump height should be given with an equation number in the main text rather than only in the appendix, to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, offering clarifications on the use of straight-through estimators, experimental reporting, and interpretability evaluations. Where the manuscript can be improved with additional details or controls, we commit to revisions; we believe these changes will strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: §3 (Methods): The claim that STEs are used 'in a principled manner' for the discontinuous JumpReLU lacks an explicit derivation or subgradient analysis showing equivalence to the target L0-penalized reconstruction objective; the forward pass discontinuity means the supplied gradient is a surrogate, and without a proof or targeted ablation isolating the estimator from the activation change, the reported fidelity improvements versus Gated/TopK baselines risk being an artifact of optimization rather than the architecture.
Authors: We agree that a more explicit justification would benefit the paper. Straight-through estimators are a standard surrogate for optimizing non-differentiable functions such as the L0 norm, as established in the literature on discrete optimization and binarized networks; in our case they allow direct optimization of the target sparsity objective rather than an L1 proxy. In the revised manuscript we will add a short derivation subsection in §3 explaining the STE application to both the JumpReLU threshold and the L0 term, including the unbiasedness property in expectation. We will also insert a targeted ablation that trains the same architecture with and without the STE to isolate its contribution from the activation change itself. These additions should clarify that the fidelity gains are not merely optimization artifacts. revision: yes
-
Referee: §4 (Experiments): The SOTA fidelity claim on Gemma 2 9B activations is not accompanied by error bars, dataset split details, or ablations that vary only the activation while holding the STE and L0 training fixed; this makes it impossible to determine whether the gains are load-bearing for the central claim or sensitive to hyperparameter choices.
Authors: The main experimental tables already report means and standard deviations computed over three independent random seeds, and the dataset construction (including the train/validation split of Gemma 2 9B activations) is described in Appendix B. To address the request for tighter isolation, the revision will expand §4.3 with a new controlled ablation that changes only the activation function while freezing the STE implementation and L0 penalty schedule across all compared methods (ReLU, Gated, TopK, JumpReLU). Error bars will be added to all fidelity plots and tables for visual clarity. These updates will make the load-bearing nature of the architectural change more transparent. revision: yes
-
Referee: §5 (Interpretability): The assertion that improved fidelity 'does not come at the cost of interpretability' rests on manual and automated studies, but no quantitative correlation between the two evaluation methods is reported, nor is there a control showing that the JumpReLU features remain causally relevant under interventions at the same sparsity level as the baselines.
Authors: Section 5 and Appendix C already present both manual feature annotations and automated interpretability scores (following the protocol of prior SAE interpretability work), with consistent trends favoring JumpReLU at matched sparsity. While we did not compute a Pearson correlation between the two scoring methods, the per-method ordering is aligned. Intervention experiments (feature ablation and activation patching) are reported in Appendix D and show comparable causal impact for JumpReLU features versus the baselines. In the revision we will add an explicit table reporting the correlation between manual and automated scores and will move a concise summary of the intervention results into the main text of §5. This will directly address the request for quantitative linkage and causal controls. revision: partial
Circularity Check
No significant circularity: claims rest on empirical comparisons, not self-referential derivations
full rationale
The paper introduces JumpReLU SAEs as a simple architectural change (ReLU replaced by discontinuous JumpReLU) and trains them using straight-through estimators for both the activation and the L0 term. Central results are state-of-the-art reconstruction fidelity at fixed sparsity on Gemma 2 9B activations, plus interpretability checks via manual and automated studies. These are established by direct head-to-head experiments against prior published methods (Gated SAEs, TopK SAEs) rather than by any derivation that reduces a claimed prediction to a fitted parameter or self-citation by construction. No equations are presented that define a quantity in terms of itself, rename a known empirical pattern, or import uniqueness from the authors' prior work as a load-bearing premise. The STE usage is described as 'principled' but without a claimed proof of equivalence to the target objective; the paper simply reports that the resulting models outperform baselines. This is the normal case of an empirical methods paper whose validity is tested externally rather than internally tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Straight-through estimators produce usable gradients for discontinuous activation functions in this setting.
invented entities (1)
-
JumpReLU activation function
no independent evidence
Lean theorems connected to this paper
-
Foundation.CostFirstExistenceexistence_economically_inevitable unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
-
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
SoftSAE introduces a dynamic top-k selection mechanism in sparse autoencoders that learns an input-dependent sparsity level via a differentiable soft top-k operator.
-
Improving Sparse Autoencoder with Dynamic Attention
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
-
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
-
Do Language Models Encode Knowledge of Linguistic Constraint Violations?
Sparse autoencoder analysis of language model activations finds limited evidence for a unified set of features detecting linguistic constraint violations.
-
Do Language Models Encode Knowledge of Linguistic Constraint Violations?
Sparse autoencoder features in language models do not satisfy joint falsification criteria for unified grammatical violation detectors across linguistic phenomena.
-
Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.
-
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
Re-injecting emotion vectors during recall steepens a model's threat-safety judgments and raises good decision rates from 52% to 80% only when combined with semantic labels, replicating Damasio's somatic marker effect.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
-
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.
-
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
-
Feature Starvation as Geometric Instability in Sparse Autoencoders
Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
-
GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models
GeoSAE extracts a compact, interpretable feature set from frozen brain MRI foundation models that predicts MCI-to-AD conversion (AUC 0.746) with age-deconfounded annotations and replicates across cohorts.
-
SAGE: Signal-Amplified Guided Embeddings for LLM-based Vulnerability Detection
SAGE uses sparse autoencoders to boost vulnerability signals in LLMs, raising internal SNR 12.7x and delivering up to 318% MCC gains on vulnerability detection benchmarks.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
Improving Robustness In Sparse Autoencoders via Masked Regularization
Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.
-
Sparse Autoencoders as a Steering Basis for Phase Synchronization in Graph-Based CFD Surrogates
Sparse autoencoders enable phase synchronization in frozen graph CFD surrogates through Hilbert-identified oscillatory features and SVD-based time-varying rotations.
-
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.