arxiv: 2605.01829 · v1 · submitted 2026-05-03 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models

Favour Nerrise (1) , Lucy Yin (1) , Mohammad H. Abbasi (1) , Kilian M. Pohl (1) , Ehsan Adeli (1) ((1) Stanford University)

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords sparse autoencoderbrain MRIAlzheimer's diseasefoundation modelsinterpretabilitygeometric priorsfeature extractionbiomarkers

0 comments

The pith

GeoSAE uses the learned manifold of brain MRI foundation models to extract a compact, replicable set of interpretable features that predict mild cognitive impairment to Alzheimer's conversion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops GeoSAE to address the difficulty of determining what clinical information brain MRI foundation models actually encode. Standard sparse autoencoders collapse into uninformative features in deeper layers, and aging confounds nearly all clinical variables so that naive annotations become unreliable. GeoSAE incorporates the model's own learned geometric structure as a prior to keep features distinct and then annotates each one through age-deconfounded partial correlations. The resulting small collection of features, drawn from only 2 percent of the embedding dimensions, predicts disease progression, holds steady across separate patient groups, and appears in brain locations consistent with known Alzheimer's pathology.

Core claim

GeoSAE is a layer-wise sparse autoencoder that takes the foundation model's learned manifold structure as a geometric prior to prevent feature collapse. Each surviving feature is then annotated by computing its age-deconfounded partial correlation with clinical variables. When applied to large collections of T1-weighted MRI scans, the method recovers a compact feature set that predicts MCI-to-AD conversion at AUC 0.746 using only 2 percent of the embedding dimensions; comorbidity-annotated features perform at chance level. These features replicate across independent cohorts without any retraining and localize to neuroanatomically distinct regions that align with Braak staging.

What carries the argument

GeoSAE, the geometric prior-guided layer-wise sparse autoencoder that uses the foundation model's manifold structure both to stabilize feature learning and to support annotation through partial correlations.

If this is right

Only a small fraction of embedding dimensions carry the clinically predictive information for Alzheimer's progression.
The identified features remain stable and replicable across separate patient cohorts without retraining the underlying model.
Annotations that rely on comorbidities without age deconfounding yield no reliable signal.
Localized features align with established patterns of disease progression in the brain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric guidance could be applied to foundation models trained on other imaging modalities or disease domains.
The approach implies that the internal geometry learned by these models already encodes biologically meaningful structure that can be isolated directly.
Future experiments could test whether the same prior reduces collapse in non-MRI foundation models or in multi-modal settings.

Load-bearing premise

The geometric prior taken from the foundation model's manifold structure successfully prevents feature collapse and makes the age-deconfounded correlations reflect genuine biological signals instead of residual confounds or model artifacts.

What would settle it

A new independent cohort in which the selected features show no predictive power for MCI-to-AD conversion, fail to replicate their prior correlations, or localize outside the expected neuroanatomical regions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.01829 by Ehsan Adeli (1) ((1) Stanford University), Favour Nerrise (1), Kilian M. Pohl (1), Lucy Yin (1), Mohammad H. Abbasi (1).

**Figure 1.** Figure 1: Overview of GeoSAE. (a) Geometric prior analysis of a frozen brain MRI FM selects the SAE activation function and constructs a k-NN manifold graph. (b) GeoSAE training uses manifold regularization to prevent feature collapse. (c) Age-deconfounded feature annotation assigns each alive feature to a clinical category for downstream tasks. cause no geometric prior guides regularization; (b) age confounds near… view at source ↗

**Figure 2.** Figure 2: Cross-layer analysis of GeoSAE across 12 BrainIAC layers. (a) Stacked bars show alive features by clinical category; the line shows clinical annotation rate. Features consolidate with depth while clinical specificity increases. (b) Conversion AUC peaks at layer 9, then declines as scanner features dominate. 4.2. Cross-Layer Decomposition Geometric prior results. Geometric prior analysis of the BrainIAC rep… view at source ↗

**Figure 3.** Figure 3: GeoSAE annotation (layer 9). |ρjc·a| for 3 strongest per category (AD, Sex, Genetic); values where FDR p<0.05 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Brain region localization of the top-4 conversion-predictive [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-cohort replication of GeoSAE feature annotations. Each point is an SAE feature alive in both ADNI and AIBL, colored by clinical category. Age-partial diagnosis correlations replicate strongly (r=0.89, p<10−10) despite different cohort variables. GeoSAE features replicate in AIBL (100% replication rate). This strong transfer without retraining demonstrates that the manifold-regularized features captur… view at source ↗

read the original abstract

Brain MRI foundation models learn rich representations of anatomy, but interpreting what clinical information they encode remains an open problem. Standard sparse autoencoders (SAEs) suffer from severe feature collapse in deep transformer layers, and in Alzheimer's disease (AD) research, aging confounds nearly every clinical variable, making naive annotation unreliable. We propose GeoSAE, a geometry-guided SAE framework that uses the foundation model's learned manifold structure to prevent feature collapse and annotates each surviving feature via age-deconfounded partial correlations. Applied to ~14k T1-weighted MRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Australian Imaging biomarkers and Lifestyle (AIBL) datasets, GeoSAE identifies a compact, fully interpretable feature set that predicts mild cognitive impairment (MCI)-to-AD conversion (AUC 0.746) using only 2% of the embedding dimensions, while comorbidity-annotated features achieve only chance-level performance. The identified features replicate across cohorts without retraining (r=0.97) and localize to neuroanatomically distinct regions consistent with Braak staging. This shows that geometry-guided SAEs can extract interpretable, biomarkers from frozen brain MRI foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

GeoSAE stabilizes SAEs with geometric priors from the model manifold and uses age-deconfounded partial correlations to label a small feature set that predicts MCI-to-AD conversion at AUC 0.746 with cross-cohort replication. The features hit chance on comorbidity controls, replicate at r=0.97 without retraining, and localize to Braak-consistent areas on ADNI and AIBL data totaling about 14k scans. That is the core result worth noting first.

Referee Report

3 major / 2 minor

Summary. The paper proposes GeoSAE, a geometry-guided sparse autoencoder framework that leverages the foundation model's learned manifold to prevent feature collapse in deep layers and annotates surviving features via age-deconfounded partial correlations. Applied to ~14k T1-weighted MRI scans from ADNI and AIBL, it extracts a compact interpretable feature set (2% of embedding dimensions) that predicts MCI-to-AD conversion (AUC 0.746), with comorbidity-annotated features at chance level; the features replicate across cohorts without retraining (r=0.97) and localize to regions consistent with Braak staging.

Significance. If the central results hold after addressing annotation validity, this provides a practical route to extracting compact, biologically grounded biomarkers from frozen brain MRI foundation models. The cross-cohort replication without retraining and the comorbidity control condition are clear strengths that support specificity to AD-related signals rather than generic confounds or artifacts.

major comments (3)

[Abstract / Methods] Abstract and Methods: the reported AUC 0.746 and r=0.97 are presented without detail on the feature selection procedure (e.g., threshold for partial correlations), multiple-testing correction across the large embedding space, or confirmation that the downstream prediction model was trained and evaluated on data held out from the annotation step; these omissions make it impossible to assess whether the performance reflects true generalization or data leakage.
[Methods] Annotation procedure (Methods): age-deconfounded partial correlations are used to label features, but this only removes linear age effects; given that foundation-model embeddings are expected to encode nonlinear age-related structure and that age is a strong confound for AD progression, residual confounds could inflate both the AUC and the apparent Braak-stage localization. The geometric prior addresses collapse but does not mitigate this annotation validity risk.
[Results] Results: the claim that identified features localize to neuroanatomically distinct regions consistent with Braak staging is presented as post-hoc validation; without a pre-specified spatial correspondence test or quantitative overlap metric against Braak maps, this remains qualitative and does not independently corroborate that the partial-correlation labels capture AD-specific biology.

minor comments (2)

[Abstract] The abstract states 'comorbidity-annotated features achieve only chance-level performance' but does not specify which comorbidities were used or how the control annotation was performed; adding this detail would clarify the strength of the negative control.
[Methods] Notation for the geometric prior and the exact form of the layer-wise SAE loss should be introduced earlier and used consistently to aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, with honest assessment of where revisions are needed and why.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods: the reported AUC 0.746 and r=0.97 are presented without detail on the feature selection procedure (e.g., threshold for partial correlations), multiple-testing correction across the large embedding space, or confirmation that the downstream prediction model was trained and evaluated on data held out from the annotation step; these omissions make it impossible to assess whether the performance reflects true generalization or data leakage.

Authors: We agree that these details were omitted and are essential for evaluating generalization. In the revised manuscript we have added a new subsection in Methods that specifies: the partial-correlation threshold and FDR correction (q < 0.05) applied across the embedding space for feature selection; the exact data split ensuring the annotation step used only a training subset; and explicit confirmation that the MCI-to-AD classifier was trained and evaluated exclusively on held-out data never seen during annotation or feature selection. These changes directly address the leakage concern. revision: yes
Referee: [Methods] Annotation procedure (Methods): age-deconfounded partial correlations are used to label features, but this only removes linear age effects; given that foundation-model embeddings are expected to encode nonlinear age-related structure and that age is a strong confound for AD progression, residual confounds could inflate both the AUC and the apparent Braak-stage localization. The geometric prior addresses collapse but does not mitigate this annotation validity risk.

Authors: The referee correctly identifies that only linear age effects are removed. We have added supplementary analyses in the revision showing that the retained features have negligible correlation with quadratic and cubic age terms, and we have expanded the Discussion to acknowledge residual nonlinear confounds as a limitation. The comorbidity control condition (chance-level performance) and cross-cohort replication provide supporting evidence of specificity, but we note that complete removal of nonlinear age structure would require additional techniques (e.g., kernel partial correlations) outside the current scope. revision: partial
Referee: [Results] Results: the claim that identified features localize to neuroanatomically distinct regions consistent with Braak staging is presented as post-hoc validation; without a pre-specified spatial correspondence test or quantitative overlap metric against Braak maps, this remains qualitative and does not independently corroborate that the partial-correlation labels capture AD-specific biology.

Authors: We agree the original localization was qualitative. The revised manuscript now includes a quantitative spatial-overlap analysis: Dice coefficients between high-activation feature maps and standard Braak-stage atlases, together with a permutation-based significance test. This pre-specified metric has been added to the Results and provides an independent, quantitative corroboration of the biological relevance of the annotations. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives GeoSAE by applying a geometric prior (from the foundation model's manifold) to train layer-wise SAEs that avoid collapse, then annotates surviving features using age-deconfounded partial correlations computed on the input embeddings and clinical variables. These annotated features are subsequently evaluated for predictive utility on MCI-to-AD conversion in cross-cohort settings (ADNI/AIBL) with reported replication (r=0.97) without retraining. No equation or step reduces the reported AUC or Braak-consistent localization to the annotation inputs by construction; the downstream prediction performance is an independent statistical evaluation rather than a tautological restatement of the partial-correlation selection. Self-citations, if present, are not load-bearing for the core claim, and no fitted parameter is relabeled as a prediction. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the foundation model's internal manifold provides an unbiased geometric prior and that partial correlation removes all relevant confounds. No explicit free parameters or invented entities are stated in the abstract; the method itself is the primary addition.

axioms (2)

domain assumption The foundation model's learned manifold structure supplies a valid geometric prior that prevents feature collapse without introducing new bias.
Invoked to justify the GeoSAE design in the abstract.
domain assumption Age-deconfounded partial correlations isolate biologically meaningful feature labels.
Used to annotate surviving features and to claim clinical relevance.

pith-pipeline@v0.9.0 · 5553 in / 1591 out tokens · 75248 ms · 2026-05-10T15:04:57.749822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 13 canonical work pages · 3 internal anchors

[1]

sMRI pro- cessing pipeline: A lightweight, end-to-end workflow for structural brain MRI preprocessing and quality control, 2025

Mohammad Hassan Abbasi and Ehsan Adeli. sMRI pro- cessing pipeline: A lightweight, end-to-end workflow for structural brain MRI preprocessing and quality control, 2025. 5

2025
[2]

Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C

Ahmed Abdulaal et al. An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation. arXiv:2410.03334, 2024. 1, 3

work page arXiv 2024
[3]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding interme- diate layers using linear classifier probes.arXiv:1610.01644,

work page Pith review arXiv
[4]

Anatomical foundation models for brain mris.Pattern Recognition Letters, 2025

Carlo Alberto Barbano et al. Anatomical foundation models for brain mris.Pattern Recognition Letters, 2025. 1, 3

2025
[5]

Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995. 4

1995
[6]

Towards monosemanticity: Decom- posing language models with dictionary learning

Trenton Bricken et al. Towards monosemanticity: Decom- posing language models with dictionary learning. Technical report, Anthropic AI Research, 2023. 1, 3, 4, 6

2023
[7]

MONAI: An open-source framework for deep learning in healthcare

M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myro- nenko, Can Zhao, Dong Yang, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022. 5

work page internal anchor Pith review arXiv 2022
[8]

Biomarker investigation using multiple brain measures from mri through explainable artificial intelligence in alzheimer’s disease classification.Bioengineering, 12(1):82, 2025

Davide Coluzzi, Valentina Bordin, Massimo W Rivolta, Igor Fortel, Liang Zhan, Alex Leow, and Giuseppe Baselli. Biomarker investigation using multiple brain measures from mri through explainable artificial intelligence in alzheimer’s disease classification.Bioengineering, 12(1):82, 2025. 1, 3

2025
[9]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv:2309.08600,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Cy- tosae: Interpretable cell embeddings for hematology

Muhammed Furkan Dasdelen, Hyesu Lim, Michele Buck, Katharina S Götze, Carsten Marr, and Steffen Schneider. Cy- tosae: Interpretable cell embeddings for hematology. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 77–86. Springer, 2025. 1, 3, 4

2025
[11]

Differential response to donepezil in mri subtypes of mild cognitive impairment.Alzheimer’s Research & Therapy, 15 (1):117, 2023

Patricia Diaz-Galvan, Giulia Lorenzon, Rosaleena Mohanty, Gustav Mårtensson, Enrica Cavedo, Simone Lista, Andrea Vergallo, Kejal Kantarci, Harald Hampel, Bruno Dubois, et al. Differential response to donepezil in mri subtypes of mild cognitive impairment.Alzheimer’s Research & Therapy, 15 (1):117, 2023. 7

2023
[12]

Kathryn A Ellis et al. The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease.International psychogeriatrics, 21(4):672–687, 2009. 2, 5

2009
[13]

Avoiding shortcut-learning by mutual infor- mation minimization in deep learning-based image process- ing.IEEE Access, 11:64070–64086, 2023

Louisa Fay, Erick Cobos, Bin Yang, Sergios Gatidis, and Thomas Küstner. Avoiding shortcut-learning by mutual infor- mation minimization in deep learning-based image process- ing.IEEE Access, 11:64070–64086, 2023. 1, 3

2023
[14]

Scaling and evaluating sparse autoencoders

Leo Gao et al. Scaling and evaluating sparse autoencoders. arXiv:2406.04093, 2024. 1, 3

work page internal anchor Pith review arXiv 2024
[15]

Learning concept- driven logical rules for interpretable and generalizable medi- cal image classification

Yibo Gao, Hangqi Zhou, Zheyao Gao, Bomin Wang, Shangqi Gao, Sihan Wang, and Xiahai Zhuang. Learning concept- driven logical rules for interpretable and generalizable medi- cal image classification. InInternational Conference on Med- ical Image Computing and Computer-Assisted Intervention, pages 291–300. Springer, 2025. 1, 3

2025
[16]

Inter- preting and steering protein language models through sparse autoencoders.arXiv:2502.09135, 2025

Edith Natalia Villegas Garcia and Alessio Ansuini. Inter- preting and steering protein language models through sparse autoencoders.arXiv:2502.09135, 2025. 1

work page arXiv 2025
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5, 6

2016
[18]

Burgess, Xavier Glorot, Matthew M

Irina Higgins, Loïc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. InInternational Conference on Learning Representations, 2016. 1, 3, 6

2016
[19]

and Lubana, Ekdeep Singh and Fel, Thomas and Ba, Demba , year = 2025, month = mar, number =

Sai Sumedh R Hindupur, Ekdeep Singh Lubana, Thomas Fel, and Demba Ba. Projecting assumptions: The du- ality between sparse autoencoders and concept geometry. arXiv:2503.01822, 2025. 3

work page arXiv 2025
[20]

Clifford R Jack Jr et al. The alzheimer’s disease neuroimag- ing initiative (adni): Mri methods.Journal of Magnetic Res- onance Imaging: An Official Journal of the International Society for Magnetic Resonance in Medicine, 27(4):685–691,
[21]

Fsl.Neuroimage, 62(2):782–790, 2012

Mark Jenkinson, Christian F Beckmann, Timothy EJ Behrens, Mark W Woolrich, and Stephen M Smith. Fsl.Neuroimage, 62(2):782–790, 2012. 6

2012
[22]

Principal component analysis and exploratory factor analysis.Statistical methods in medical research, 1(1):69–95, 1992

Ian T Joliffe and BJT Morgan. Principal component analysis and exploratory factor analysis.Statistical methods in medical research, 1(1):69–95, 1992. 6

1992
[23]

Dis- close the neurodegeneration dynamics: Individualized ode discovery for alzheimer’s disease precision medicine

Wooseok Jung, Joonhyuk Park, and Won Hwa Kim. Dis- close the neurodegeneration dynamics: Individualized ode discovery for alzheimer’s disease precision medicine. InIn- ternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 177–186. Springer,
[24]

Building a general simclr self-supervised foundation model across neurological diseases to advance 3d brain mri diagnoses

Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, and Tal Arbel. Building a general simclr self-supervised foundation model across neurological diseases to advance 3d brain mri diagnoses. InIEEE ICCV, pages 1310–1319, 2025. 1, 3

2025
[25]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

Been Kim et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In ICML, pages 2668–2677. PMLR, 2018. 1, 2

2018
[26]

Learning biologically relevant features in a pathology foundation model using sparse autoencoders

Nhat Minh Le et al. Learning biologically relevant features in a pathology foundation model using sparse autoencoders. arXiv:2407.10785, 2024. 1, 3, 4

work page arXiv 2024
[27]

Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276,

Hyesu Lim, Jinho Choi, Jaegul Choo, and Steffen Schneider. Sparse autoencoders reveal selective remapping of visual concepts during adaptation.arXiv preprint arXiv:2412.05276,

work page arXiv
[28]

Comparative analysis of generalization and harmonization methods for 3d brain fmri images: A case study on openbhb dataset

Soroosh Safari Loaliyan and Greg Ver Steeg. Comparative analysis of generalization and harmonization methods for 3d brain fmri images: A case study on openbhb dataset. In2024 IEEE CVPR workshop, pages 4915–4923. IEEE, 2024. 3

2024
[29]

A unified approach to in- terpreting model predictions.Advances in neural information processing systems, 30, 2017

Scott M Lundberg and Su-In Lee. A unified approach to in- terpreting model predictions.Advances in neural information processing systems, 30, 2017. 2

2017
[30]

k-Sparse Autoencoders

Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv:1312.5663, 2013. 3, 4, 6

work page Pith review arXiv 2013
[31]

Probing the representational power of sparse autoencoders in vision models

Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Chang- bai Li, Phillip Howard, Vasudev Lal, and Shao-Yen Tseng. Probing the representational power of sparse autoencoders in vision models. InProceedings of the IEEE ICCV, pages 6167–6177, 2025. 1, 3

2025
[32]

arXiv preprint arXiv:2504.02821 , year=

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models. arXiv:2504.02821, 2025. 1, 3

work page arXiv 2025
[33]

Structural progression of alzheimer’s disease over decades: the mri staging scheme.Brain communications, 4 (3):fcac109, 2022

Vincent Planche, José V Manjon, Boris Mansencal, Enrique Lanuza, Thomas Tourdias, Gwenaëlle Catheline, and Pierrick Coupé. Structural progression of alzheimer’s disease over decades: the mri staging scheme.Brain communications, 4 (3):fcac109, 2022. 7

2022
[34]

Identification of causal effects of neuroanatomy on cognitive decline requires modeling un- observed confounders.Alzheimer’s & Dementia, 19(5):1994– 2005, 2023

Sebastian Pölsterl, Christian Wachinger, Alzheimer’s Dis- ease Neuroimaging Initiative, and Japanese Alzheimer’s Dis- ease Neuroimaging Initiative. Identification of causal effects of neuroanatomy on cognitive decline requires modeling un- observed confounders.Alzheimer’s & Dementia, 19(5):1994– 2005, 2023. 1, 3

1994
[35]

Tang, T., Luo, W., Huang, H., Zhang, D., Wang, X., Zhao, W

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Son- nerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders.arXiv:2407.14435, 2024. 3

work page arXiv 2024
[36]

Multi- modal vision pre-training for medical image analysis

Shaohao Rui, Lingzhi Chen, Zhenyu Tang, Lilong Wang, Mianxin Liu, Shaoting Zhang, and Xiaosong Wang. Multi- modal vision pre-training for medical image analysis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 5164–5174, 2025. 3

2025
[37]

Grad- cam: Visual explanations from deep networks via gradient- based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad- cam: Visual explanations from deep networks via gradient- based localization. InIEEE ICCV, pages 618–626, 2017. 1, 2

2017
[38]

Interplm: discovering inter- pretable features in protein language models via sparse au- toencoders.Nature methods, 22(10):2107–2117, 2025

Elana Simon and James Zou. Interplm: discovering inter- pretable features in protein language models via sparse au- toencoders.Nature methods, 22(10):2107–2117, 2025. 1, 3, 4

2025
[39]

Sparse autoencoders for scientifically rigorous interpretation of vision models.arXiv preprint arXiv:2502.06755,

Samuel Stevens, Wei-Lun Chao, Tanya Berger-Wolf, and Yu Su. Interpretable and testable vision features via sparse autoencoders.arXiv preprint arXiv:2502.06755, 2025. 3

work page arXiv 2025
[40]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. InICML, pages 3319–3328. PMLR, 2017. 1, 2

2017
[41]

A generalizable foundation model for analysis of human brain mri.Nature Neuroscience, pages 1–12, 2026

Divyanshu Tak et al. A generalizable foundation model for analysis of human brain mri.Nature Neuroscience, pages 1–12, 2026. 1, 2, 3, 5, 6

2026
[42]

Anthropic, 2024

Adly Templeton.Scaling monosemanticity: Extracting inter- pretable features from claude 3 sonnet. Anthropic, 2024. 1, 3

2024
[43]

Biomarker modeling of alzheimer’s disease using pet-based braak staging.Nature aging, 2(6):526–535, 2022

Joseph Therriault, Tharick A Pascoal, Firoza Z Lussier, Cé- cile Tissot, Mira Chamoun, Gleb Bezgin, Stijn Servaes, An- drea L Benedet, Nicholas J Ashton, Thomas K Karikari, et al. Biomarker modeling of alzheimer’s disease using pet-based braak staging.Nature aging, 2(6):526–535, 2022. 2, 7

2022
[44]

Revisiting mae pre-training for 3d medical image segmentation

Tassilo Wald, Constantin Ulrich, Stanislav Lukyanenko, An- drei Goncharov, Alberto Paderno, Maximilian Miller, Leander Maerkisch, Paul Jaeger, and Klaus Maier-Hein. Revisiting mae pre-training for 3d medical image segmentation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 5186–5196, 2025. 3

2025
[45]

Disentangled representation learning.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 46(12): 9677–9696, 2024

Xin Wang, Hong Chen, Si’ao Tang, Zihao Wu, and Wenwu Zhu. Disentangled representation learning.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 46(12): 9677–9696, 2024. 3

2024