VFUSE: Virulent Feature Understanding with Sparse autoEncoders

Matthew L. Olson; Michael Yu

arxiv: 2606.10080 · v1 · pith:VJ7TOEZSnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI· q-bio.QM

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

Michael Yu , Matthew L. Olson This is my paper

Pith reviewed 2026-06-27 16:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.QM

keywords sparse autoencodersprotein designhazardous proteinsmechanistic interpretabilitydiffusion transformersvirulence detectionRoseTTAFoldRFDiffusion

0 comments

The pith

Sparse autoencoders on protein model activations identify monosemantic features specific to hazardous designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that training sparse autoencoders on activations from diffusion transformer models used for protein design allows linear probes to detect hazardous designs more effectively than when using the models' original representations. It further identifies individual SAE features that activate only on hazardous designs. This matters to a sympathetic reader because generative protein models can produce dangerous outputs, and the approach offers an internal audit method that leaves model performance intact. The work applies the method to RoseTTAFold3 and RFDiffusion3 as examples of open-weight protein folding and synthesis models.

Core claim

VFUSE trains sparse autoencoders on diffusion-transformer activations to audit protein models for hazard-aware features. For certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations. The approach also identifies monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84.

What carries the argument

Sparse autoencoders trained on diffusion-transformer activations to extract monosemantic features for auditing virulence in protein design models.

If this is right

Linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations.
Monosemantic features from the SAE fire only on hazardous designs at up to AUROC 0.84.
The SAE approach improves interpretability on all-atom diffusion models without sacrificing model performance.
This constitutes the first feature-level virulence audit of a protein design model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If these features prove causal, targeted edits to the SAE latents might suppress generation of hazardous proteins.
The technique could extend to auditing generative models in other domains for safety-related features.
Validation on broader sets of designs might allow the method to guide safety filters in protein design pipelines.

Load-bearing premise

The labeled hazardous designs used for probing and feature analysis are both representative and correctly classified, and that improved linear probe performance in SAE space reflects genuine mechanistic understanding rather than dataset artifacts.

What would settle it

A held-out test set of protein designs where linear probes trained on SAE latents show no significant improvement over probes trained on original activations, or where the high-AUROC SAE features activate at comparable rates on non-hazardous designs.

Figures

Figures reproduced from arXiv: 2606.10080 by Matthew L. Olson, Michael Yu.

**Figure 1.** Figure 1: Top-firing residues for four leading RFD3 block-12 SAE features on partial-diffusion output structures. Red = highest activation, orange = moderate, gray = inactive. Feature 639’s top-3 residues (99, 104, 109) cluster in a single alpha-helix of a viper neurotoxin. The other features fire on proteins from three other hazard classes, with similarly focal but less structurally concentrated activation. Matryos… view at source ↗

**Figure 2.** Figure 2: AUROC gap (SAE − raw) per block and split. The only consistent positive spike is at RFD3 block 12, cluster split (+0.054) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Random−homology AUROC gap. Larger = more family memorization. Block 6 peaks; block-12 SAE is smallest. 4.1. SAE features beat raw activations at block 12 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Top-10 hazard-firing SAE features per (model, block), ranked by univariate AUROC with BH-corrected q-values. Feature quality (peak AUROC and number of discoveries) grows with depth in RFD3, with block-12 features reaching AUROC ≥0.80. where class-discriminative, family-generalizable structure lives. RFD3’s earlier blocks underperform under homologyclustering while RF3’s do not, suggesting that RFD3 devel… view at source ↗

**Figure 5.** Figure 5: Reproducibility check and additional feature. Same color scale as [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $0.84$ ($q < 10^{-13}$). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VFUSE applies SAEs to protein diffusion models and reports hazard-linked features at AUROC 0.84, but all results rest on unvalidated labels.

read the letter

The main point is that this paper trains sparse autoencoders on activations from RoseTTAFold3 and RFDiffusion3, then shows linear probes work better in the SAE space and identifies monosemantic features that fire on hazardous designs with AUROC up to 0.84 and strong q-values. It positions itself as the first SAE on an all-atom diffusion model and the first feature-level virulence audit in this setting.

The work does something useful by taking mechanistic interpretability tools into protein design, where open-weight models already raise misuse questions. The reported probe improvement and the existence of apparently monosemantic features are concrete observations worth checking.

The load-bearing problem is the hazardous labels. Every quantitative result depends on them, yet the abstract supplies no information on how the labels were generated, whether they were checked against known toxins or experimental data, or how inter-rater agreement was assessed. If the labels come from a computational filter that already correlates with the diffusion-transformer activations, the SAE could simply be sparsifying that same signal rather than revealing genuine hazard-aware mechanisms. Without methods on data splits, SAE training details, or controls for multiple testing, the AUROC and q-value numbers cannot be evaluated properly.

This paper is for people working on interpretability in biological generative models or on practical auditing for biosecurity. A reader who wants to see SAE techniques tried in this domain will find a starting point, but anyone needing reproducible evidence will have to wait for the full methods and label validation.

I would send it to peer review. The domain is important enough that referees should see the full paper and ask for the missing validation steps.

Referee Report

3 major / 2 minor

Summary. The paper introduces VFUSE, a method that trains sparse autoencoders (SAEs) on activations from diffusion-transformer components of protein design models (RoseTTAFold3 and RFDiffusion3) to identify monosemantic, hazard-aware features. It reports that linear probes for hazardous vs. non-hazardous designs achieve higher performance when trained in SAE latent space than in the original model representations, and identifies individual SAE features that fire selectively on hazardous designs with AUROC up to 0.84 (q < 10^{-13}). The work positions itself as the first SAE application to an all-atom diffusion model and the first feature-level virulence audit of such models.

Significance. If the quantitative claims hold after addressing label validation, the result would be moderately significant for mechanistic interpretability in generative biology: it demonstrates that SAE latents can surface more interpretable signals for a safety-relevant downstream task without apparent loss in probe accuracy, and supplies the first concrete example of monosemantic features tied to virulence in diffusion-based protein generators. This could inform auditing pipelines, though the absolute performance numbers remain modest and the approach has not yet been shown to generalize beyond the specific models and label set examined.

major comments (3)

[Abstract and §3] Abstract and §3 (Methods): the binary hazardous/non-hazardous labels used for both the linear-probe AUROC comparisons and the monosemantic-feature AUROC calculations lack any reported provenance, inter-rater reliability, external validation (e.g., sequence similarity to known toxins, experimental assays, or expert curation), or description of how they were generated. Because every reported statistic (probe AUROC gains, feature AUROC 0.84, q < 10^{-13}) is computed against these labels, the absence of validation is load-bearing; if labels were produced by a simple computational filter already correlated with the diffusion-transformer activations, the SAE could merely sparsify that same signal rather than reveal genuine mechanistic hazard understanding.
[§4] §4 (Results) and any supplementary tables reporting AUROC/q-values: no description is given of the train/test splits used for the linear probes, whether the same data were used to select the reported SAE features, or any correction for multiple testing across the large number of SAE latents examined. Without these details it is impossible to assess whether the reported q < 10^{-13} reflects genuine feature selectivity or post-hoc selection bias.
[§2 and §3] §2 (Background) and §3: the manuscript supplies no equations or pseudocode for the SAE training objective, the precise definition of the diffusion-transformer activations being reconstructed, or the criteria used to declare a feature “monosemantic.” This prevents evaluation of whether the reported probe improvement is an artifact of the SAE fitting procedure itself.

minor comments (2)

[Figures] Figure captions and axis labels should explicitly state the number of hazardous and non-hazardous designs in each evaluation set and whether the AUROC values are computed on held-out data.
[Abstract and §4] The claim that the approach improves “interpretability without sacrificing model performance” should be supported by a quantitative comparison of original-model vs. SAE-probe accuracy on the same held-out set, not merely a qualitative statement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our manuscript. We address each major comment point-by-point below and will incorporate revisions to address the identified gaps.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): the binary hazardous/non-hazardous labels used for both the linear-probe AUROC comparisons and the monosemantic-feature AUROC calculations lack any reported provenance, inter-rater reliability, external validation (e.g., sequence similarity to known toxins, experimental assays, or expert curation), or description of how they were generated. Because every reported statistic (probe AUROC gains, feature AUROC 0.84, q < 10^{-13}) is computed against these labels, the absence of validation is load-bearing; if labels were produced by a simple computational filter already correlated with the diffusion-transformer activations, the SAE could merely sparsify that same signal rather than reveal genuine mechanistic hazard understanding.

Authors: We agree that the provenance and validation details for the hazardous/non-hazardous labels must be explicitly reported. In the revised manuscript we will add a dedicated paragraph in §3 describing the exact procedure used to generate the labels (including any computational filters or curation steps) along with available supporting evidence such as sequence similarity checks against known toxins. This addition will allow readers to evaluate the extent to which the reported SAE features reflect mechanistic signals versus label-generation artifacts. revision: yes
Referee: [§4] §4 (Results) and any supplementary tables reporting AUROC/q-values: no description is given of the train/test splits used for the linear probes, whether the same data were used to select the reported SAE features, or any correction for multiple testing across the large number of SAE latents examined. Without these details it is impossible to assess whether the reported q < 10^{-13} reflects genuine feature selectivity or post-hoc selection bias.

Authors: We acknowledge that these experimental and statistical details were omitted. The revised §4 and supplementary material will specify the train/test split protocol for the linear probes, clarify whether SAE feature selection occurred on held-out data, and document the multiple-testing correction procedure (including the exact method and adjusted significance thresholds) applied to the q-values. revision: yes
Referee: [§2 and §3] §2 (Background) and §3: the manuscript supplies no equations or pseudocode for the SAE training objective, the precise definition of the diffusion-transformer activations being reconstructed, or the criteria used to declare a feature “monosemantic.” This prevents evaluation of whether the reported probe improvement is an artifact of the SAE fitting procedure itself.

Authors: We will expand §2 and §3 to include the full SAE training objective (with equations), pseudocode for the training loop, the exact layer and token positions from which diffusion-transformer activations are extracted, and the quantitative criteria (sparsity level, selectivity threshold, and any additional metrics) used to designate a feature as monosemantic. revision: yes

Circularity Check

0 steps flagged

No significant circularity; SAE training and label-based evaluation remain independent.

full rationale

The paper trains SAEs in an unsupervised manner on diffusion-transformer activations from RoseTTAFold3 and RFDiffusion3, then computes linear probe AUROCs and identifies monosemantic features by their correlation with separately provided hazardous/non-hazardous labels. No equations, fitting procedures, or self-citations are described that would make the reported AUROC 0.84 or the probe improvement reduce to the SAE inputs or labels by construction. The derivation chain is self-contained: unsupervised sparsity objective on activations is distinct from post-hoc supervised evaluation against external labels, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities can be extracted. Standard SAE sparsity and reconstruction assumptions are implicit but unstated.

pith-pipeline@v0.9.1-grok · 5711 in / 1054 out tokens · 22970 ms · 2026-06-27T16:55:53.848713+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 14 canonical work pages · 8 internal anchors

[1]

and Hume, Tristan and Carter, Shan and Henighan, Tom and Olah, Christopher , title =

Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nick and Anil, Cem and Denison, Carson and Askell, Amanda and Lasenby, Robert and Wu, Yifan and Kravec, Shauna and Schiefer, Nicholas and Maxwell, Tim and Joseph, Nicholas and Hatfield-Dodds, Zac and Tamkin, Alex and Nguyen, Karina and McL...
[2]

Journal of Molecular Biology , volume =

A Brief History of De Novo Protein Design: Minimal, Rational, and Computational , author =. Journal of Molecular Biology , volume =. 2021 , month = oct, doi =

2021
[3]

CVPR 2025 Workshop on Mechanistic Interpretability in Vision (MIV) , year =

Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video , author =. CVPR 2025 Workshop on Mechanistic Interpretability in Vision (MIV) , year =. 2504.19475 , archivePrefix =

work page arXiv 2025
[4]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , author =. Workshop at the International Conference on Learning Representations (ICLR) , year =. 1312.6034 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

Ben Melech Stan, Gabriela and Aflalo, Estelle and Rohekar, Raanan Yehezkel and Bhiwandiwalla, Anahita and Tseng, Shao-Yen and Olson, Matthew Lyle and Gurwicz, Yaniv and Wu, Chenfei and Duan, Nan and Lal, Vasudev , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =. 2024 , pages =

2024
[6]

k-Sparse Autoencoders

Makhzani, Alireza and Frey, Brendan , title =. International Conference on Learning Representations , year =. 1312.5663 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Toy Models of Superposition

Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher , title =. Transformer Circuits Thread , year =...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , title =. International Conference on Learning Representations , year =. 2309.08600 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Transformer Circuits Thread , year =

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jermyn, Adam and Olsson, Catherine and Gopal, Anand and Hauksson, Yahoshua and Hatfield-Dodds, Zac and Schiefer, Nicholas and Brown, Tom and Kaplan, Jared and Batson, Joshua and Chughtai, ...
[10]

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , booktitle =

Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Lieberum, Tom and Varma, Vikrant and Kram\'. Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , booktitle =. 2024 , eprint =

2024
[11]

Scaling and Evaluating Sparse Autoencoders , booktitle =

Gao, Leo and Dupr\'. Scaling and Evaluating Sparse Autoencoders , booktitle =. 2025 , eprint =

2025
[12]

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram\'. Jumping Ahead: Improving Reconstruction Fidelity with. arXiv preprint arXiv:2407.14435 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Bart Bussmann, Patrick Leask, and Neel Nanda

Bussmann, Bart and Leask, Patrick and Nanda, Neel , title =. NeurIPS Workshop on Attributing Model Behavior at Scale , year =. 2412.06410 , archivePrefix =

work page arXiv
[14]

Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,

Bussmann, Bart and Nabeshima, Noa and Karvonen, Adam and Nanda, Neel , title =. ICLR Workshop on Building Trust in Language Models and Applications , year =. 2503.17547 , archivePrefix =

work page arXiv
[15]

Highly accurate protein structure prediction with

Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and. Highly accurate protein structure prediction with. Nature , volume =
[16]

Science , volume =

Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander , title =. Science , volume =
[17]

and Juergens, David and Bennett, Nathaniel R

Watson, Joseph L. and Juergens, David and Bennett, Nathaniel R. and Trippe, Brian L. and Yim, Jason and Eisenach, Helen E. and Ahern, Woody and Borst, Andrew J. and Ragotte, Robert J. and Milles, Lukas F. and Wicky, Basile I. M. and Hanikel, Nikita and Pellock, Samuel J. and Courbet, Alexis and Sheffler, William and Wang, Jue and Venkatesh, Preetham and S...
[18]

and Anishchenko, Ivan and Humphreys, Ian R

Krishna, Rohith and Wang, Jue and Ahern, Woody and Sturmfels, Pascal and Venkatesh, Preetham and Kalvet, Indrek and Lee, Gyu Rie and Morey-Burrows, Felix S. and Anishchenko, Ivan and Humphreys, Ian R. and McHugh, Ryan and Vafeados, Dionne and Li, Xinting and Sutherland, George A. and Hitchcock, Andrew and Hunter, C. Neil and Kang, Alex and Brackenbrough, ...
[19]

and Bambrick, Joshua and Bodenstein, Sebastian W

Abramson, Josh and Adler, Jonas and Dunger, Jack and Evans, Richard and Green, Tim and Pritzel, Alexander and Ronneberger, Olaf and Willmore, Lindsay and Ballard, Andrew J. and Bambrick, Joshua and Bodenstein, Sebastian W. and Evans, David A. and Hung, Chia-Chun and O'Neill, Michael and Reiman, David and Tunyasuvunakool, Kathryn and Wu, Zachary and Bapst,...
[20]

Butcher, Jasper and Krishna, Rohith and Mitra, Raktim and Brent, Rafael I. and Li, Yanjing and Corley, Nathaniel and Kim, Paul and Funk, Jonathan and Mathis, Simon and Salike, Saman and Muraishi, Aiko and Eisenach, Helen and Thompson, Tuscan Rock and Chen, Jie and Politanska, Yuliya and Sehgal, Enisha and Coventry, Brian and Zhang, Odin and Qiang, Bo and ...
[21]

and Milles, Lukas F

Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J. and Milles, Lukas F. and Wicky, Basile I. M. and Courbet, Alexis and de Haas, Rob J. and Bethel, Neville and Leung, Philip J. Y. and Huddy, Timothy F. and Pellock, Sam and Tischer, Doug and Chan, Frederick and Koepnick, Brian and Nguyen, Hannah and Kang, Alex...
[22]

Biopolymers , volume =

Kabsch, Wolfgang and Sander, Christian , title =. Biopolymers , volume =
[23]

and Tumescheit, Charlotte and Mirdita, Milot and Lee, Jeongjae and Gilchrist, Cameron L

van Kempen, Michel and Kim, Stephanie S. and Tumescheit, Charlotte and Mirdita, Milot and Lee, Jeongjae and Gilchrist, Cameron L. M. and S\"oding, Johannes and Steinegger, Martin , title =. Nature Biotechnology , volume =
[24]

Nature Methods , volume =

Simon, Elana and Zou, James , title =. Nature Methods , volume =. 2025 , doi =

2025
[25]

International Conference on Machine Learning , year =

Adams, Etowah and Bai, Liam and Lee, Minji and Yu, Yiyang and AlQuraishi, Mohammed , title =. International Conference on Machine Learning , year =
[26]

and Yang, John J

Parsan, Nithin and Yang, David J. and Yang, John J. , title =. ICLR Workshop on Generative and Experimental Perspectives for Biomolecular Design , year =. 2503.08764 , archivePrefix =

work page arXiv
[27]

dictionary

Marks, Samuel and Karvonen, Adam and Mueller, Aaron , year =. dictionary
[28]

BMC Bioinformatics , volume =

Garg, Aarti and Gupta, Dinesh , title =. BMC Bioinformatics , volume =
[29]

Computers in Biology and Medicine , volume =

Singh, Shreya and Le, Nguyen Quoc Khanh and Wang, Cheng , title =. Computers in Biology and Medicine , volume =
[30]

Genes , volume =

Sun, Jiawei and Yin, Hongbo and Ju, Chenxiao and Wang, Yongheng and Yang, Zhiyuan , title =. Genes , volume =
[31]

Briefings in Bioinformatics , volume =

Chen, Chen and Xu, Yong and Ouyang, Jian and Xiong, Xiangyi and. Briefings in Bioinformatics , volume =
[32]

Advances in Neural Information Processing Systems , year =

Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , title =. Advances in Neural Information Processing Systems , year =
[33]

Classifier-Free Diffusion Guidance

Ho, Jonathan and Salimans, Tim , title =. arXiv preprint arXiv:2207.12598 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Nature Biotechnology , volume =

Steinegger, Martin and S\". Nature Biotechnology , volume =
[35]

Journal of the Royal Statistical Society: Series B , volume =

Benjamini, Yoav and Hochberg, Yosef , title =. Journal of the Royal Statistical Society: Series B , volume =
[36]

and Whitney, Donald R

Mann, Henry B. and Whitney, Donald R. , title =. Annals of Mathematical Statistics , volume =
[37]

Nucleic Acids Research , volume =
[38]

NeurIPS Workshop on Biosafe GenAI , year =

Fan, Jigang and Zhou, Zhenghong and Zhang, Zaixi and Jin, Ruofan and Cong, Le and Wang, Mengdi , title =. NeurIPS Workshop on Biosafe GenAI , year =
[39]

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

Zarzecki, Wojciech and Szymczak, Paulina and Szczurek, Ewa and Deja, Kamil , title =. arXiv preprint arXiv:2511.22519 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Goodfire Research , year =

Gorton, Liv and Wang, Nicholas and Nguyen, Nam and Deng, Myra and Ho, Eric and Balsam, Daniel and McGrath, Thomas , title =. Goodfire Research , year =. doi:10.5281/zenodo.14895891 , url =

work page doi:10.5281/zenodo.14895891
[41]

arXiv preprint arXiv:2502.04382 , year =

Movva, Rajiv and Peng, Kenny and Garg, Nikhil and Kleinberg, Jon and Pierson, Emma , title =. arXiv preprint arXiv:2502.04382 , year =

work page arXiv
[42]

Steering Language Models With Activation Engineering

Turner, Alex and Thiergart, Lisa and Udell, David and Leech, Gavin and Mini, Ulisse and MacDiarmid, Monte , title =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Advances in Neural Information Processing Systems , year =

Li, Kenneth and Patel, Oam and Vi\'egas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin , title =. Advances in Neural Information Processing Systems , year =
[44]

Goodfire Research , year =

Hazra, Dron and Bissell, Mark and Balsam, Daniel and Kolluru, Adeesh and McGrath, Delia and Colindres, Jorge , title =. Goodfire Research , year =
[45]

Rathore, Anand Singh and Choudhury, Shubham and Arora, Akanksha and Tijare, Purva and Raghava, Gajendra P. S. , title =. Computers in Biology and Medicine , volume =
[46]

Nucleic Acids Research , volume =

Liu, Bo and Zheng, Dandan and Jin, Qi and Chen, Lihong and Yang, Jian , title =. Nucleic Acids Research , volume =
[47]

ICML Workshop on Mechanistic Interpretability , year =

Karvonen, Adam and Wright, Benjamin and Rager, Can and Angell, Rico and Brinkmann, Jannik and Smith, Logan and Mayrink Verdun, Claudio and Bau, David and Marks, Samuel , title =. ICML Workshop on Mechanistic Interpretability , year =
[48]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Probing the representational power of sparse autoencoders in vision models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[49]

bioRxiv , year=

Accelerating biomolecular modeling with atomworks and rf3 , author=. bioRxiv , year=
[50]

Nature Methods , pages=

Atomic context-conditioned protein sequence design using LigandMPNN , author=. Nature Methods , pages=. 2025 , publisher=

2025

[1] [1]

and Hume, Tristan and Carter, Shan and Henighan, Tom and Olah, Christopher , title =

Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nick and Anil, Cem and Denison, Carson and Askell, Amanda and Lasenby, Robert and Wu, Yifan and Kravec, Shauna and Schiefer, Nicholas and Maxwell, Tim and Joseph, Nicholas and Hatfield-Dodds, Zac and Tamkin, Alex and Nguyen, Karina and McL...

[2] [2]

Journal of Molecular Biology , volume =

A Brief History of De Novo Protein Design: Minimal, Rational, and Computational , author =. Journal of Molecular Biology , volume =. 2021 , month = oct, doi =

2021

[3] [3]

CVPR 2025 Workshop on Mechanistic Interpretability in Vision (MIV) , year =

Prisma: An Open Source Toolkit for Mechanistic Interpretability in Vision and Video , author =. CVPR 2025 Workshop on Mechanistic Interpretability in Vision (MIV) , year =. 2504.19475 , archivePrefix =

work page arXiv 2025

[4] [4]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , author =. Workshop at the International Conference on Learning Representations (ICLR) , year =. 1312.6034 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =

Ben Melech Stan, Gabriela and Aflalo, Estelle and Rohekar, Raanan Yehezkel and Bhiwandiwalla, Anahita and Tseng, Shao-Yen and Olson, Matthew Lyle and Gurwicz, Yaniv and Wu, Chenfei and Duan, Nan and Lal, Vasudev , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , month =. 2024 , pages =

2024

[6] [6]

k-Sparse Autoencoders

Makhzani, Alireza and Frey, Brendan , title =. International Conference on Learning Representations , year =. 1312.5663 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Toy Models of Superposition

Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher , title =. Transformer Circuits Thread , year =...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , title =. International Conference on Learning Representations , year =. 2309.08600 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Transformer Circuits Thread , year =

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jermyn, Adam and Olsson, Catherine and Gopal, Anand and Hauksson, Yahoshua and Hatfield-Dodds, Zac and Schiefer, Nicholas and Brown, Tom and Kaplan, Jared and Batson, Joshua and Chughtai, ...

[10] [10]

Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , booktitle =

Rajamanoharan, Senthooran and Conmy, Arthur and Smith, Lewis and Lieberum, Tom and Varma, Vikrant and Kram\'. Improving Sparse Decomposition of Language Model Activations with Gated Sparse Autoencoders , booktitle =. 2024 , eprint =

2024

[11] [11]

Scaling and Evaluating Sparse Autoencoders , booktitle =

Gao, Leo and Dupr\'. Scaling and Evaluating Sparse Autoencoders , booktitle =. 2025 , eprint =

2025

[12] [12]

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram\'. Jumping Ahead: Improving Reconstruction Fidelity with. arXiv preprint arXiv:2407.14435 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Bart Bussmann, Patrick Leask, and Neel Nanda

Bussmann, Bart and Leask, Patrick and Nanda, Neel , title =. NeurIPS Workshop on Attributing Model Behavior at Scale , year =. 2412.06410 , archivePrefix =

work page arXiv

[14] [14]

Learning multi-level features with matryoshka sparse autoencoders.arXiv preprint arXiv:2503.17547,

Bussmann, Bart and Nabeshima, Noa and Karvonen, Adam and Nanda, Neel , title =. ICLR Workshop on Building Trust in Language Models and Applications , year =. 2503.17547 , archivePrefix =

work page arXiv

[15] [15]

Highly accurate protein structure prediction with

Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and. Highly accurate protein structure prediction with. Nature , volume =

[16] [16]

Science , volume =

Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and dos Santos Costa, Allan and Fazel-Zarandi, Maryam and Sercu, Tom and Candido, Salvatore and Rives, Alexander , title =. Science , volume =

[17] [17]

and Juergens, David and Bennett, Nathaniel R

Watson, Joseph L. and Juergens, David and Bennett, Nathaniel R. and Trippe, Brian L. and Yim, Jason and Eisenach, Helen E. and Ahern, Woody and Borst, Andrew J. and Ragotte, Robert J. and Milles, Lukas F. and Wicky, Basile I. M. and Hanikel, Nikita and Pellock, Samuel J. and Courbet, Alexis and Sheffler, William and Wang, Jue and Venkatesh, Preetham and S...

[18] [18]

and Anishchenko, Ivan and Humphreys, Ian R

Krishna, Rohith and Wang, Jue and Ahern, Woody and Sturmfels, Pascal and Venkatesh, Preetham and Kalvet, Indrek and Lee, Gyu Rie and Morey-Burrows, Felix S. and Anishchenko, Ivan and Humphreys, Ian R. and McHugh, Ryan and Vafeados, Dionne and Li, Xinting and Sutherland, George A. and Hitchcock, Andrew and Hunter, C. Neil and Kang, Alex and Brackenbrough, ...

[19] [19]

and Bambrick, Joshua and Bodenstein, Sebastian W

Abramson, Josh and Adler, Jonas and Dunger, Jack and Evans, Richard and Green, Tim and Pritzel, Alexander and Ronneberger, Olaf and Willmore, Lindsay and Ballard, Andrew J. and Bambrick, Joshua and Bodenstein, Sebastian W. and Evans, David A. and Hung, Chia-Chun and O'Neill, Michael and Reiman, David and Tunyasuvunakool, Kathryn and Wu, Zachary and Bapst,...

[20] [20]

Butcher, Jasper and Krishna, Rohith and Mitra, Raktim and Brent, Rafael I. and Li, Yanjing and Corley, Nathaniel and Kim, Paul and Funk, Jonathan and Mathis, Simon and Salike, Saman and Muraishi, Aiko and Eisenach, Helen and Thompson, Tuscan Rock and Chen, Jie and Politanska, Yuliya and Sehgal, Enisha and Coventry, Brian and Zhang, Odin and Qiang, Bo and ...

[21] [21]

and Milles, Lukas F

Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J. and Milles, Lukas F. and Wicky, Basile I. M. and Courbet, Alexis and de Haas, Rob J. and Bethel, Neville and Leung, Philip J. Y. and Huddy, Timothy F. and Pellock, Sam and Tischer, Doug and Chan, Frederick and Koepnick, Brian and Nguyen, Hannah and Kang, Alex...

[22] [22]

Biopolymers , volume =

Kabsch, Wolfgang and Sander, Christian , title =. Biopolymers , volume =

[23] [23]

and Tumescheit, Charlotte and Mirdita, Milot and Lee, Jeongjae and Gilchrist, Cameron L

van Kempen, Michel and Kim, Stephanie S. and Tumescheit, Charlotte and Mirdita, Milot and Lee, Jeongjae and Gilchrist, Cameron L. M. and S\"oding, Johannes and Steinegger, Martin , title =. Nature Biotechnology , volume =

[24] [24]

Nature Methods , volume =

Simon, Elana and Zou, James , title =. Nature Methods , volume =. 2025 , doi =

2025

[25] [25]

International Conference on Machine Learning , year =

Adams, Etowah and Bai, Liam and Lee, Minji and Yu, Yiyang and AlQuraishi, Mohammed , title =. International Conference on Machine Learning , year =

[26] [26]

and Yang, John J

Parsan, Nithin and Yang, David J. and Yang, John J. , title =. ICLR Workshop on Generative and Experimental Perspectives for Biomolecular Design , year =. 2503.08764 , archivePrefix =

work page arXiv

[27] [27]

dictionary

Marks, Samuel and Karvonen, Adam and Mueller, Aaron , year =. dictionary

[28] [28]

BMC Bioinformatics , volume =

Garg, Aarti and Gupta, Dinesh , title =. BMC Bioinformatics , volume =

[29] [29]

Computers in Biology and Medicine , volume =

Singh, Shreya and Le, Nguyen Quoc Khanh and Wang, Cheng , title =. Computers in Biology and Medicine , volume =

[30] [30]

Genes , volume =

Sun, Jiawei and Yin, Hongbo and Ju, Chenxiao and Wang, Yongheng and Yang, Zhiyuan , title =. Genes , volume =

[31] [31]

Briefings in Bioinformatics , volume =

Chen, Chen and Xu, Yong and Ouyang, Jian and Xiong, Xiangyi and. Briefings in Bioinformatics , volume =

[32] [32]

Advances in Neural Information Processing Systems , year =

Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , title =. Advances in Neural Information Processing Systems , year =

[33] [33]

Classifier-Free Diffusion Guidance

Ho, Jonathan and Salimans, Tim , title =. arXiv preprint arXiv:2207.12598 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Nature Biotechnology , volume =

Steinegger, Martin and S\". Nature Biotechnology , volume =

[35] [35]

Journal of the Royal Statistical Society: Series B , volume =

Benjamini, Yoav and Hochberg, Yosef , title =. Journal of the Royal Statistical Society: Series B , volume =

[36] [36]

and Whitney, Donald R

Mann, Henry B. and Whitney, Donald R. , title =. Annals of Mathematical Statistics , volume =

[37] [37]

Nucleic Acids Research , volume =

[38] [38]

NeurIPS Workshop on Biosafe GenAI , year =

Fan, Jigang and Zhou, Zhenghong and Zhang, Zaixi and Jin, Ruofan and Cong, Le and Wang, Mengdi , title =. NeurIPS Workshop on Biosafe GenAI , year =

[39] [39]

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

Zarzecki, Wojciech and Szymczak, Paulina and Szczurek, Ewa and Deja, Kamil , title =. arXiv preprint arXiv:2511.22519 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Goodfire Research , year =

Gorton, Liv and Wang, Nicholas and Nguyen, Nam and Deng, Myra and Ho, Eric and Balsam, Daniel and McGrath, Thomas , title =. Goodfire Research , year =. doi:10.5281/zenodo.14895891 , url =

work page doi:10.5281/zenodo.14895891

[41] [41]

arXiv preprint arXiv:2502.04382 , year =

Movva, Rajiv and Peng, Kenny and Garg, Nikhil and Kleinberg, Jon and Pierson, Emma , title =. arXiv preprint arXiv:2502.04382 , year =

work page arXiv

[42] [42]

Steering Language Models With Activation Engineering

Turner, Alex and Thiergart, Lisa and Udell, David and Leech, Gavin and Mini, Ulisse and MacDiarmid, Monte , title =. arXiv preprint arXiv:2308.10248 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Advances in Neural Information Processing Systems , year =

Li, Kenneth and Patel, Oam and Vi\'egas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin , title =. Advances in Neural Information Processing Systems , year =

[44] [44]

Goodfire Research , year =

Hazra, Dron and Bissell, Mark and Balsam, Daniel and Kolluru, Adeesh and McGrath, Delia and Colindres, Jorge , title =. Goodfire Research , year =

[45] [45]

Rathore, Anand Singh and Choudhury, Shubham and Arora, Akanksha and Tijare, Purva and Raghava, Gajendra P. S. , title =. Computers in Biology and Medicine , volume =

[46] [46]

Nucleic Acids Research , volume =

Liu, Bo and Zheng, Dandan and Jin, Qi and Chen, Lihong and Yang, Jian , title =. Nucleic Acids Research , volume =

[47] [47]

ICML Workshop on Mechanistic Interpretability , year =

Karvonen, Adam and Wright, Benjamin and Rager, Can and Angell, Rico and Brinkmann, Jannik and Smith, Logan and Mayrink Verdun, Claudio and Bau, David and Marks, Samuel , title =. ICML Workshop on Mechanistic Interpretability , year =

[48] [48]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Probing the representational power of sparse autoencoders in vision models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[49] [49]

bioRxiv , year=

Accelerating biomolecular modeling with atomworks and rf3 , author=. bioRxiv , year=

[50] [50]

Nature Methods , pages=

Atomic context-conditioned protein sequence design using LigandMPNN , author=. Nature Methods , pages=. 2025 , publisher=

2025