arxiv: 2605.06610 · v2 · submitted 2026-05-07 · 💻 cs.LG · cs.CV

Recognition: no theorem link

SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders

Jakub St\k{e}pie\'n , Marcin Mazur , Jacek Tabor , Przemys{\l}aw Spurek

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords sparse autoencodersmechanistic interpretabilitydynamic sparsitytop-k selectiondifferentiable operatorsadaptive feature activation

0 comments

The pith

Sparse autoencoders can adapt the number of active features to each input's complexity instead of using one fixed count.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SoftSAE to replace the fixed sparsity constraint used in standard TopK sparse autoencoders. It replaces the hard top-k selection with a differentiable Soft Top-K operator that produces an input-specific sparsity level k. This change matters because real-world inputs differ in how many independent factors they contain, so a single global K either injects noise into simple cases or fails to capture structure in complex ones. If the method succeeds, the resulting features will activate in counts that scale with the information present in each sample rather than enforcing uniform length explanations.

Core claim

SoftSAE uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input.

What carries the argument

The differentiable Soft Top-K operator, which approximates discrete top-k selection so that both the chosen features and the effective value of k can be learned from data via gradients.

Load-bearing premise

The number of relevant factors in natural data changes enough from sample to sample that a single fixed sparsity level K is noticeably suboptimal.

What would settle it

A controlled test on data where every input has the same intrinsic dimensionality, in which SoftSAE would show no improvement over a fixed-K SAE in reconstruction error or feature interpretability.

Figures

Figures reproduced from arXiv: 2605.06610 by Jacek Tabor, Jakub St\k{e}pie\'n, Marcin Mazur, Przemys{\l}aw Spurek.

**Figure 1.** Figure 1: Overview of the SoftSAE framework for adaptive mechanistic interpretability. Unlike view at source ↗

**Figure 2.** Figure 2: Diagrammatic representation of the SoftSAE architecture. The model utilizes a Dynamic view at source ↗

**Figure 3.** Figure 3: Visualization of inputs with varying embedding complexity: images with a solid background view at source ↗

**Figure 4.** Figure 4: Estimated distribution of ˆk, with examples of corresponding texts and images illustrating various levels of complexity. An additional, somewhat counterintuitive benefit of this formulation emerges in practice. Unlike ReLU, Softplus assigns a small but nonzero penalty even when E[ ˆk] < k, effectively discouraging the model from saturating the available capacity. While this may initially appear undesir… view at source ↗

**Figure 5.** Figure 5: Reconstruction quality versus sparsity level for di view at source ↗

**Figure 6.** Figure 6: Absorption Rate and Number of Split Features vs. Sparsity. SoftSAE demonstrates its view at source ↗

**Figure 7.** Figure 7: Spurious Correlation Removal and Targeted Probe Perturbation scores vs. Sparsity. SoftSAE view at source ↗

**Figure 8.** Figure 8: Examples of images from ImageNet100 with the highest estimated view at source ↗

**Figure 9.** Figure 9: Additional examples of images from ImageNet100 with the highest estimated view at source ↗

**Figure 10.** Figure 10: Examples of images from ImageNet100 with the lowest estimated view at source ↗

**Figure 11.** Figure 11: Additional examples of images from ImageNet100 with the lowest estimated view at source ↗

**Figure 12.** Figure 12: The presented results indicate that the Dynamic Sparsity MLP is sensitive not only to the view at source ↗

**Figure 13.** Figure 13: Ablation study within CLIP reconstruction experimental setup. Removing view at source ↗

read the original abstract

Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://github.com/St0pien/SoftSAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SoftSAE adds a differentiable soft top-k to let SAE sparsity vary per input, but the results do not yet show that this adaptivity improves features over fixed-k baselines.

read the letter

The main new element here is the soft top-k operator that produces an input-dependent sparsity level k while staying differentiable. The authors start from the standard observation that fixed-K TopK SAEs apply the same sparsity everywhere, then replace the hard selection with a soft version so k can rise or fall with each sample. That is a straightforward technical step beyond the fixed-k literature they cite, and releasing the code is helpful for anyone who wants to test it directly. The motivation about varying local intrinsic dimensionality on real data is reasonable and matches what people see in activations from LLMs and ViTs. If the operator works as intended, it could give shorter explanations on simple inputs and longer ones on complex ones without manual tuning of K. The experiments are said to produce meaningful features and sensible per-concept feature counts, which is at least consistent with the claim. The soft spots sit in the strength of the evidence for the adaptive part. The abstract asserts that the model selects the right number of features, but there are no reported numbers on how much k actually varies across inputs, no correlation between learned k and any independent measure of input complexity, and no ablation that keeps average sparsity fixed while turning the dynamic mechanism on and off. Without those controls it remains possible that any observed differences come from the soft approximation itself or from incidental regularization rather than from matching sparsity to data structure. The central assumption is plausible, yet the paper would be tighter if it showed the per-sample k tracking something observable like reconstruction difficulty. This work is aimed at people already running SAEs for mechanistic interpretability who are willing to experiment with new selection operators. It is coherent on its own terms and engages the existing TopK SAE literature, so it deserves a serious referee who can ask for the missing ablations and check the implementation. I would send it to review rather than desk-reject, with the expectation that the authors can supply clearer quantitative support for the adaptivity claim.

Referee Report

3 major / 2 minor

Summary. The paper proposes SoftSAE, a sparse autoencoder architecture that replaces the fixed-K top-k selection of standard TopK SAEs with a differentiable Soft Top-K operator. This operator is intended to produce an input-dependent sparsity level k that adapts to the local complexity of each sample, motivated by the observation that natural data manifolds exhibit varying intrinsic dimensionality. The authors claim that the resulting representations are more faithful, with experimental results purportedly confirming both the discovery of meaningful monosemantic features and the selection of appropriate per-concept feature counts. Source code is provided.

Significance. If the dynamic k selection can be shown to correlate with independent measures of input complexity and to yield strictly better feature decompositions than fixed-K baselines at matched average sparsity, the method would represent a meaningful advance in mechanistic interpretability tools for LLMs and ViTs. The provision of open-source code strengthens potential impact by enabling direct replication and extension.

major comments (3)

[Abstract] Abstract: the central claim that experiments 'confirm that SoftSAE ... selects the right number of features for each concept' is unsupported by any quantitative results, correlation coefficients, ablation tables, or baseline comparisons. No evidence is supplied linking the learned per-sample k values to input properties such as reconstruction error of a dense autoencoder or estimates of local intrinsic dimensionality.
[Method] Method section (Soft Top-K operator): it is unclear whether the differentiable approximation truly optimizes an input-dependent k or merely relaxes the hard top-k constraint while the effective sparsity remains controlled by a global hyperparameter. A controlled ablation that disables the input-dependence (e.g., by feeding a constant auxiliary input) while preserving total parameter count is required to isolate the contribution of adaptivity.
[Experiments] Experiments: without reported metrics showing that variance in learned k across samples exceeds what would be expected from noise in a fixed-K model, and without comparison to fixed-K SAEs trained at the same average sparsity, observed improvements could be attributable to the soft approximation or implicit regularization rather than adaptive sparsity.

minor comments (2)

[Abstract] The abstract states that 'the explanation length reflects the amount of information in the input' without defining how explanation length is measured or providing supporting statistics.
[Method] Notation for the Soft Top-K operator should be introduced with an explicit equation rather than descriptive text only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below with clarifications from the manuscript and indicate the revisions we will make to strengthen the presentation and evidence.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that experiments 'confirm that SoftSAE ... selects the right number of features for each concept' is unsupported by any quantitative results, correlation coefficients, ablation tables, or baseline comparisons. No evidence is supplied linking the learned per-sample k values to input properties such as reconstruction error of a dense autoencoder or estimates of local intrinsic dimensionality.

Authors: We agree that the abstract claim would be more robust with explicit quantitative backing. The current experiments provide qualitative support via visualizations of per-sample k variation aligned with input complexity and monosemantic feature discovery. In the revision we will add correlation analyses between learned k and independent measures (dense autoencoder reconstruction error and local intrinsic dimensionality estimates), plus the requested ablation tables and baseline comparisons. revision: yes
Referee: [Method] Method section (Soft Top-K operator): it is unclear whether the differentiable approximation truly optimizes an input-dependent k or merely relaxes the hard top-k constraint while the effective sparsity remains controlled by a global hyperparameter. A controlled ablation that disables the input-dependence (e.g., by feeding a constant auxiliary input) while preserving total parameter count is required to isolate the contribution of adaptivity.

Authors: The Soft Top-K operator receives input-derived features from the encoder to produce per-sample selection weights, so the allocation of active features is genuinely input-dependent while a global hyperparameter only sets the overall sparsity budget. We will add the suggested controlled ablation (constant auxiliary input, matched parameter count) in the revised method and experiments sections to isolate the adaptivity contribution. revision: yes
Referee: [Experiments] Experiments: without reported metrics showing that variance in learned k across samples exceeds what would be expected from noise in a fixed-K model, and without comparison to fixed-K SAEs trained at the same average sparsity, observed improvements could be attributable to the soft approximation or implicit regularization rather than adaptive sparsity.

Authors: We will expand the experiments section to report the observed variance in learned k and compare it against the variance attributable to noise under a fixed-K regime. We will also add direct comparisons against fixed-K SAEs trained at identical average sparsity, using reconstruction fidelity and feature quality metrics, to demonstrate that gains arise from adaptivity rather than the soft operator or regularization alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces SoftSAE by proposing a differentiable Soft Top-K operator to enable input-dependent sparsity k in sparse autoencoders, building on standard SAE training objectives. No equations or derivations appear in the provided abstract, and the full text description indicates the method rests on a new operator plus empirical validation rather than any self-definitional reduction, fitted parameter renamed as prediction, or load-bearing self-citation chain. Claims about matching data complexity are presented as experimental outcomes, not tautological by construction. This is a standard case of a novel architectural proposal with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the standard SAE reconstruction-plus-sparsity objective and the assumption of varying intrinsic dimensionality.

pith-pipeline@v0.9.0 · 5603 in / 1044 out tokens · 20047 ms · 2026-05-11T01:54:41.258323+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[2]

Beyond interpretability: The gains of feature monosemanticity on model robustness.arXiv preprint arXiv:2410.21331, 2024

Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka, and Yisen Wang. Beyond interpretability: The gains of feature monosemanticity on model robustness.arXiv preprint arXiv:2410.21331, 2024

work page arXiv 2024
[3]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nicholas L. Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E. Burke, Tristan Hume, Shan Carter, Tom Henighan, and Chris Olah. To...

work page 2023
[4]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InICLR, 2025

work page 2025
[5]

arXiv preprint arXiv:2412.06410 , year=

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

work page arXiv 2024
[6]

Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.arXiv preprint arXiv:2407.14435, 2024

work page arXiv 2024
[7]

Interpreting CLIP with hierar- chical sparse autoencoders

Vladimir Zaigrajew, Hubert Baniecki, and Przemyslaw Biecek. Interpreting CLIP with hierar- chical sparse autoencoders. InF orty-second International Conference on Machine Learning, 2025

work page 2025
[8]

Lidl: Local intrinsic dimension estimation using approximate likelihood

Piotr Tempczyk, Rafał Michaluk, Lukasz Garncarek, Przemysław Spurek, Jacek Tabor, and Adam Golinski. Lidl: Local intrinsic dimension estimation using approximate likelihood. In International Conference on Machine Learning, pages 21205–21231. PMLR, 2022

work page 2022
[9]

Lapsum-one method to differentiate them all: Ranking, sorting and top-k selection

Łukasz Struski, Michal B Bednarczyk, Igor T Podolak, and Jacek Tabor. Lapsum-one method to differentiate them all: Ranking, sorting and top-k selection. InInternational Conference on Machine Learning, pages 56990–57007. PMLR, 2025

work page 2025
[10]

Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

work page 2023
[11]

arXiv preprint arXiv:2404.14082 (2024)

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082, 2024

work page arXiv 2024
[12]

Natural language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Ja- cob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representations, 2021

work page 2021
[13]

Anthropic, 2024

Adly Templeton.Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Anthropic, 2024

work page 2024
[14]

One-step is enough: Sparse autoencoders for text-to- image diffusion models.arXiv preprint arXiv:2410.22366, 2024

Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre, and David Bau. One-step is enough: Sparse autoencoders for text-to- image diffusion models.arXiv preprint arXiv:2410.22366, 2024

work page arXiv 2024
[15]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C

Ahmed Abdulaal, Hugo Fry, Nina Montaña-Brown, Ayodeji Ijishakin, Jack Gao, Stephanie Hyland, Daniel C Alexander, and Daniel C Castro. An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation.arXiv preprint arXiv:2410.03334, 2024. 10

work page arXiv 2024
[16]

Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024

Usha Bhalla, Alex Oesterling, Suraj Srinivas, Flavio P Calmon, and Himabindu Lakkaraju. Interpreting clip with sparse linear concept embeddings (splice).Advances in Neural Information Processing Systems, 37:84298–84328, 2024

work page 2024
[17]

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Bartosz Cywi ´nski and Kamil Deja. Saeuron: Interpretable concept unlearning in diffusion models with sparse autoencoders.arXiv preprint arXiv:2501.18052, 2025

work page arXiv 2025
[18]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on machine learning, pages 2668–2677. PMLR, 2018

work page 2018
[19]

Interpretable basis decomposition for visual explanation

Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Interpretable basis decomposition for visual explanation. InProceedings of the European Conference on Computer Vision (ECCV), pages 119–134, 2018

work page 2018
[20]

Towards automatic concept- based explanations.Advances in neural information processing systems, 32, 2019

Amirata Ghorbani, James Wexler, James Y Zou, and Been Kim. Towards automatic concept- based explanations.Advances in neural information processing systems, 32, 2019

work page 2019
[21]

Labeling neural representations with inverse recognition.Advances in Neural Information Processing Systems, 36:24804–24828, 2023

Kirill Bykov, Laura Kopf, Shinichi Nakajima, Marius Kloft, and Marina Höhne. Labeling neural representations with inverse recognition.Advances in Neural Information Processing Systems, 36:24804–24828, 2023

work page 2023
[22]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, pages 5338–5348. PMLR, 2020

work page 2020
[23]

Concept embedding models: Beyond the accuracy-explainability trade-off.Ad- vances in neural information processing systems, 35:21400–21413, 2022

Mateo Espinosa Zarlenga, Pietro Barbiero, Gabriele Ciravegna, Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, Zohreh Shams, Frederic Precioso, Stefano Melacci, Adrian Weller, et al. Concept embedding models: Beyond the accuracy-explainability trade-off.Ad- vances in neural information processing systems, 35:21400–21413, 2022

work page 2022
[24]

Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability

Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ayonrinde, et al. Saebench: A comprehensive benchmark for sparse autoencoders in language model interpretability. In International Conference on Machine Learning, pages 29223–29264. PMLR, 2025

work page 2025
[25]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[26]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[27]

Learning multi-level features with matryoshka sparse autoencoders

Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda. Learning multi-level features with matryoshka sparse autoencoders. InF orty-second International Conference on Machine Learning, 2025. 11 A Impact Statement and Declaration of LLM Usage Our work contributes to the safety and alignment of Large Language Models (LLMs) and Vision Transformers (ViTs...

work page arXiv 2025