Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adam Jermyn; Adam Pearce; Adly Templeton; Alex Tamkin; Andy Jones; Brian Chen; Callum McDougall; C. Daniel Freeman; Chris Olah; Craig Citro

arxiv: 2605.29358 · v1 · pith:DUG2N5UNnew · submitted 2026-05-28 · 💻 cs.AI

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adly Templeton , Tom Conerly , Jonathan Marcus , Jack Lindsey , Trenton Bricken , Brian Chen , Adam Pearce , Craig Citro

show 18 more authors

Emmanuel Ameisen Andy Jones Hoagy Cunningham Nicholas L Turner Callum McDougall Monte MacDiarmid Alex Tamkin Esin Durmus Tristan Hume Francesco Mosconi C. Daniel Freeman Theodore R. Sumers Edward Rees Joshua Batson Adam Jermyn Shan Carter Chris Olah Tom Henighan

This is my paper

Pith reviewed 2026-06-29 07:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords sparse autoencodersdictionary learningmodel interpretabilityfeature extractionresidual streammonosemanticitymultimodal generalization

0 comments

The pith

Sparse autoencoders extract up to 34 million interpretable features from a production-scale language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sparse autoencoders scale to a large language model by training them on residual stream activations to recover millions of features aligned with understandable concepts. These features respond to both specific entities and abstract ideas, generalize across languages and to images despite text-only training, and permit steering of model outputs in directions consistent with the feature meanings. The work also locates features tied to behaviors such as deception, power-seeking, and bias, demonstrating that activation or suppression of those features changes model generations accordingly. This matters because it provides a concrete route to decompose the internal representations of advanced models into parts that humans can inspect and adjust. The authors emphasize that the recovered set remains incomplete and that no rigorous tests yet confirm the features match the model's actual computations rather than autoencoder artifacts.

Core claim

What carries the argument

Sparse autoencoders trained on middle-layer residual stream activations to produce a dictionary of features.

If this is right

Features generalize to image inputs even though training used only text.
Activation of harm-related features such as those for deception or bias alters model generations in the direction predicted by the feature interpretation.
The same feature can respond to both concrete examples and abstract discussion of the same concept.
Geometric and functional analyses of the learned features reveal additional regularities in how the model organizes its representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the features prove faithful, the same method could be applied to additional layers to build a more complete map of model behavior.
Similar dictionary learning could be tested on non-transformer architectures to check whether comparable monosemantic features appear.
A direct test would be to train independent autoencoders on the same model activations and measure how consistently the same concepts are recovered.
The ability to steer outputs via individual features suggests a route to targeted modification of model tendencies without retraining the entire system.

Load-bearing premise

The features recovered by the autoencoders correspond to the language model's actual internal computations rather than arising as side effects of how the autoencoders were trained.

What would settle it

An intervention experiment in which activating a feature labeled as representing deception produces no measurable increase in deceptive outputs on a held-out set of prompts.

read the original abstract

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They scaled SAEs to 34M features on Claude 3 Sonnet with steering on safety concepts, but the paper itself flags that we still lack tests for whether those features match the model's computations rather than SAE artifacts.

read the letter

The core result is that sparse autoencoders trained on the residual stream of a production model can yield millions of features that humans can interpret and that allow causal steering, including on deception, bias, and power-seeking. This extends earlier dictionary learning work from toy models to something closer to deployed scale.

What stands out is the practical scaling: they used scaling laws to choose hyperparameters, reached 34 million features, and observed generalization to images and multiple languages despite text-only training. The steering experiments on harm-related features are concrete and show output changes consistent with the feature labels. They also surface features for entities, code errors, and sarcasm.

The main limitation is the one the authors state outright: no rigorous method exists yet to confirm the features reflect the model's internal computations instead of being shaped by the autoencoder's objective and sparsity penalty. Human inspection and steering results are consistent with faithfulness but do not rule out training artifacts. The feature set is also acknowledged as incomplete.

This is useful for groups working on mechanistic interpretability or safety analysis of large models. The empirical demonstration of scale is solid enough to merit referee time even with the open faithfulness question; the work is grounded in direct training and intervention rather than circular claims.

Referee Report

2 major / 0 minor

Summary. The paper claims that sparse autoencoders (SAEs) with up to 34 million features can be trained on the middle-layer residual stream of Claude 3 Sonnet using scaling laws for hyperparameter selection, yielding features that are multilingual and multimodal (despite text-only training), respond to both concrete and abstract concepts, enable causal steering consistent with human interpretations, and include features for entities, sarcasm, code errors, and safety-relevant behaviors such as deception, power-seeking, sycophancy, and bias. It presents analyses of feature interpretability, geometry, and computational function while explicitly noting that the feature suite is incomplete and that rigorous methods for evaluating whether features faithfully capture model computations are lacking.

Significance. If the features are shown to be faithful to the model's computations, the work would be a significant advance in mechanistic interpretability by providing the first large-scale demonstration that dictionary learning scales to production frontier models, enabling systematic discovery of concepts and causal interventions on behaviors including those relevant to AI safety.

major comments (2)

[Abstract] Abstract: The central scaling claim—that SAEs extract interpretable features from a production-scale model and thereby address whether dictionary learning works beyond small transformers—requires that the discovered features reflect the model's actual internal computations rather than primarily reflecting SAE training dynamics (L1 penalty, reconstruction loss, or initialization). The abstract states that no rigorous evaluation methods exist for this faithfulness question, and the reported evidence (human inspection, steering results, multilingual/multimodal generalization) is consistent with but does not establish model-native features.
[Abstract] Abstract and limitations discussion: The explicit acknowledgment that the feature suite is incomplete and that faithfulness cannot be rigorously evaluated means the causal steering results and interpretability claims remain provisional; without a concrete test distinguishing model computations from SAE artifacts, the scaling demonstration does not yet fully resolve the open question posed in the abstract.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and for emphasizing the distinction between SAE artifacts and model-native features. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central scaling claim—that SAEs extract interpretable features from a production-scale model and thereby address whether dictionary learning works beyond small transformers—requires that the discovered features reflect the model's actual internal computations rather than primarily reflecting SAE training dynamics (L1 penalty, reconstruction loss, or initialization). The abstract states that no rigorous evaluation methods exist for this faithfulness question, and the reported evidence (human inspection, steering results, multilingual/multimodal generalization) is consistent with but does not establish model-native features.

Authors: We agree that the evidence presented—human interpretability judgments, causal steering results, and cross-lingual/cross-modal generalization—is consistent with model-native features but does not constitute rigorous proof that the features are free of SAE training artifacts. The manuscript already states this limitation explicitly in both the abstract and the limitations discussion. Our central claim is narrower: that dictionary learning can be scaled to a production model while producing features that exhibit the reported properties under current evaluation methods. The use of scaling laws for hyperparameter selection and the consistency of steering outcomes with independent interpretations provide additional support beyond what was available for smaller models, even if a definitive faithfulness test remains unavailable. revision: no
Referee: [Abstract] Abstract and limitations discussion: The explicit acknowledgment that the feature suite is incomplete and that faithfulness cannot be rigorously evaluated means the causal steering results and interpretability claims remain provisional; without a concrete test distinguishing model computations from SAE artifacts, the scaling demonstration does not yet fully resolve the open question posed in the abstract.

Authors: We concur that the results are provisional precisely because no rigorous faithfulness test exists, as the paper states. The abstract is deliberately structured to pose the scaling question and then immediately qualify the claims with the acknowledged limitations. We maintain that demonstrating successful training and interpretable, steerable features at 34 million scale on a frontier model constitutes progress on the open question, even while the field lacks methods to fully separate model computations from SAE artifacts. The incompleteness of the feature suite is likewise already noted and does not undermine the scaling result for the features that were recovered. revision: no

standing simulated objections not resolved

A concrete test distinguishing model computations from SAE artifacts

Circularity Check

0 steps flagged

Empirical scaling demonstration with no circular derivation steps

full rationale

The paper reports direct training of sparse autoencoders on Claude 3 Sonnet activations, followed by empirical observations of feature interpretability via inspection, steering, and generalization tests. No equations, predictions, or first-principles claims are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The acknowledged limitation on faithfulness evaluation is an explicit open question rather than a hidden circularity. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from dictionary learning literature that sparse autoencoders recover monosemantic features from neural activations; no new entities are introduced.

free parameters (2)

number of features = up to 34 million
Up to 34 million chosen via scaling laws to guide hyperparameter selection
sparsity hyperparameter
Selected using scaling laws for effective feature extraction

axioms (2)

domain assumption Sparse autoencoders recover interpretable, disentangled features from model residual streams
Core premise invoked for the entire extraction process
domain assumption The middle layer residual stream activations contain semantically meaningful information suitable for dictionary learning
Justifies the specific choice of layer and input representation

pith-pipeline@v0.9.1-grok · 5822 in / 1462 out tokens · 39922 ms · 2026-06-29T07:47:54.069109+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evidence for feature-specific error correction in LLMs
cs.LG 2026-06 unverdicted novelty 6.0

Perturbation experiments across six LLMs show activation robustness follows L^p norm with p>2 for feature directions (contrastive, MELBO, SAE) but p≈2 for random/PCA controls, indicating feature-specific error correction.
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization
cs.CL 2026-06 unverdicted novelty 6.0

HydraHead hybridizes full and linear attention along the head dimension via interpretability-driven selection and scale-normalized fusion, matching layer-wise hybrids at higher linear ratios after 15B-token training.

Reference graph

Works this paper leans on

78 extracted references · 35 canonical work pages · cited by 2 Pith papers · 19 internal anchors

[1]

Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024

Robert AIZI. Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024. URL https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/ research-report-sparse-autoencoders-find-only-9-180-board

2024
[2]

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, and Jessica N. Howard. Sparse autoen- coders find composed features in small toy models, 2024. URL https://www.lesswrong.com/posts/ a5wwqza2cY3W7L9cj/sparse-autoencoders-find-composed-features-in-small-toy

2024
[3]

Linear algebraic structure of word senses, with applications to polysemy

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. URL https://aclanthology.org/Q18-1034.pdf

2018
[4]

Using features for easy circuit identification, 2024

Joshua Batson, Brian Chen, and Andy Jones. Using features for easy circuit identification, 2024. URL https://transformer-circuits.pub/2024/march-update/index.html#feature-heads

2024
[5]

Leace: Perfect linear concept erasure in closed form, 2023

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Bider- man. Leace: Perfect linear concept erasure in closed form, 2023. URL https://arxiv.org/pdf/2306. 03819

2023
[6]

Representation Learning: A Review and New Perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828, 2013. URL https://arxiv.org/pdf/1206.5538

work page internal anchor Pith review Pith/arXiv arXiv 2013
[7]

Language models can explain neurons in language models, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

2023
[8]

Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024

Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/ open-source-sparse-autoencoders-for-all-residual-stream . 55

2024
[9]

Man is to computer programmer as woman is to homemaker? debiasing word embeddings

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_files/ paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf

2016
[10]

Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024. URL https://publications. apolloresearch.ai/end_to_end_sparse_dictionary_learning.pdf

2024
[11]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

2023
[12]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827 , 2022. URL https://arxiv.org/pdf/ 2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Infogan: Interpretable representation learning by information maximizing generative adversarial nets

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_ files/paper/2016/file/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Paper.pdf

2016
[14]

Eliciting latent knowledge: How to tell if your eyes deceive you

Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Google Docs, December , 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_ Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?tab=t.0#heading=h.kkaua0hwmp1d

2021
[15]

Activation steering with saes,

Arthur Conmy and Neel Nanda. Activation steering with saes,
[16]

URL https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/ full-post-progress-update-1-from-the-gdm-mech-interp-team#Activation_Steering_with_ SAEs
[17]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable model directions. arXiv preprint arXiv:2309.08600 , 2023. URL https:// arxiv.org/pdf/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

On measuring and mitigating biased inferences of word embeddings

Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 7659–7666, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6267/6123

2020
[19]

Transcoders find interpretable llm feature circuits

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. Advances in Neural Information Processing Systems , 37:24375–24410, 2025. URL https://arxiv.org/ abs/2406.11944

work page arXiv 2025
[20]

Sparse and redundant representations: from theory to applications in signal and image processing, volume 2

Michael Elad. Sparse and redundant representations: from theory to applications in signal and image processing, volume 2. Springer, 2010. 56

2010
[21]

Softmax linear units

Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Ka- mal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav...

2022
[22]

Toy models of superposition.Trans- former Circuits Thread , 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Trans- former Circuits Thread , 2022. URL https://transformer-circuits.pub/...

2022
[23]

Privileged bases in the transformer resid- ual stream

Nelson Elhage, Robert Lasenby, and Christopher Olah. Privileged bases in the transformer resid- ual stream. Transformer Circuits Thread , 2023. URL https://transformer-circuits.pub/2023/ privileged-basis/index.html

2023
[24]

Sparse Overcomplete Word Vector Representations

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 , 2015. URL https://arxiv.org/pdf/ 1506.02004

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and eﬀicient sparsity.arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv. org/pdf/2101.03961

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

Common crawl

The Common Crawl Foundation. Common crawl. URL https://commoncrawl.org
[27]

Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024

Hugo Fry. Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024. URL https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/ towards-multimodal-interpretability-learning-sparse-2

2024
[28]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL https://arxiv.org/pdf/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[29]

Decoding the thought vector, 2016

Gabriel Goh. Decoding the thought vector, 2016. URLhttps://gabgoh.github.io/ThoughtVectors/

2016
[30]

Sae reconstruction errors are (empirically) pathologi- cal, 2024

Wes Gurnee. Sae reconstruction errors are (empirically) pathologi- cal, 2024. URL https://www.lesswrong.com/posts/rZPiuFxESMxCDHe4B/ sae-reconstruction-errors-are-empirically-pathological

2024
[31]

Language models represent space and time, 2024

Wes Gurnee and Max Tegmark. Language models represent space and time, 2024. URL https: //arxiv.org/pdf/2310.02207

work page arXiv 2024
[32]

Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt

Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, and Xipeng Qiu. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt. arXiv preprint arXiv:2402.12201 , 2024. URL https://arxiv.org/pdf/2402.12201. 57

work page arXiv 2024
[33]

Superposition, memorization, and double descent.Transformer Circuits Thread, 2023

Tom Henighan, Shan Carter, Tristan Hume, Nelson Elhage, Robert Lasenby, Stanislav Fort, Nicholas Schiefer, and Christopher Olah. Superposition, memorization, and double descent.Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/toy-double-descent/index.html

2023
[34]

Natural language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representations, 2021. URL https://arxiv.org/pdf/2201.11114

work page arXiv 2021
[35]

beta-vae: Learning basic visual concepts with a constrained varia- tional framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained varia- tional framework. 2016. URL https://openreview.net/pdf?id=Sy2fzU9gl

2016
[36]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute- optimal large language models. arXiv preprint arXiv:2203.15556 , 2022. URL https://arxiv.org/ pdf/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Rad- hakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shaun...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

On the ”steerability” of generative adversarial networks

Ali Jahanian, Lucy Chai, and Phillip Isola. On the ”steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 , 2019. URL https://arxiv.org/pdf/1907.07171

work page arXiv 1907
[39]

Features in an 8-layer model, 2024

Adam Jermyn, Tom Conerly, Trenton Bricken, and Adly Templeton. Features in an 8-layer model, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#dict-learning

2024
[40]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 , 2022. URL https://arxiv.org/pdf/2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/pdf/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[42]

Disentangling by factorising

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Ma- chine Learning, pages 2649–2658. PMLR, 2018. URL http://proceedings.mlr.press/v80/kim18b/ kim18b.pdf

2018
[43]

Sparse autoencoders work on attention layer outputs, 2024

Connor Kissane, robertzk, Arthur Conmy, and Neel Nanda. Sparse autoencoders work on attention layer outputs, 2024. URL https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/ sparse-autoencoders-work-on-attention-layer-outputs . 58

2024
[44]

Atp*: An eﬀicient and scalable method for localizing llm behaviour to components

János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An eﬀicient and scalable method for localizing llm behaviour to components. arXiv preprint arXiv:2403.00745 , 2024. URL https: //arxiv.org/pdf/2403.00745

work page arXiv 2024
[45]

Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022

Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022. URL https://arxiv.org/pdf/2210.13382

work page arXiv 2022
[46]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023. URL https://arxiv.org/pdf/ 2306.03341

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

How strongly do dictionary learning features influence model behavior?, 2024

Jack Lindsey. How strongly do dictionary learning features influence model behavior?, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html#ablation-exps

2024
[48]

Simple probes can catch sleeper agents, 2024

Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duve- naud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hub- inger. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/ probes-catch-sleeper-agents

2024
[49]

Eliciting latent knowledge from quirky language models

Alex Mallen and Nora Belrose. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023. URL https://arxiv.org/pdf/2312.01037

work page arXiv 2023
[50]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824 , 2023. URL https: //arxiv.org/pdf/2310.06824

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

dictionary_learning github repository, 2024

Samuel Marks, Adam Karvonen, and Aaron Mueller. dictionary_learning github repository, 2024. URL https://github.com/saprmarks/dictionary_learning

2024
[52]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647, 2024. URL https://arxiv.org/pdf/2403.19647

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Linguistic regularities in continuous space word representations

Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages 746–751, 2013. URL https:// aclanthology.org/N13-1090.pdf

2013
[54]

Transformer debugger, 2024

Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders. Transformer debugger, 2024. URL https://github.com/openai/transformer-debugger

2024
[55]

Actually, othello-gpt has a linear emergent world representation, 2023

Neel Nanda. Actually, othello-gpt has a linear emergent world representation, 2023. URL https: //www.neelnanda.io/mechanistic-interpretability/othello

2023
[56]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratch- pads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021. URL https://arxiv.org/pdf/2112.00114. 59

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. URL https://distill. pub/2020/circuits/zoom-in. https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020
[58]

Distributed representations: Composition & superposition, 2023

Christopher Olah. Distributed representations: Composition & superposition, 2023. URL https: //transformer-circuits.pub/2023/superposition-composition/index.html

2023
[59]

Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. doi: 10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697

work page doi:10.1016/s0042-6989(97)00169-7 1997
[60]

Mlp neurons - 40l preliminary investigation [rough early thoughts]

Catherine Olsson, Nelson Elhage, and Chris Olah. Mlp neurons - 40l preliminary investigation [rough early thoughts]. URL https://www.youtube.com/watch?v=8wYNsoycM1U
[61]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015. URL https: //arxiv.org/pdf/1511.06434

work page internal anchor Pith review Pith/arXiv arXiv 2015
[62]

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014 , 2024. URL https://arxiv.org/pdf/2404.16014

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024

Logan Riggs and Jannik Brinkmann. Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024. URL https://www.lesswrong.com/posts/YiGs8qJ8aNBgwt2YN/ improving-sae-s-by-sqrt-ing-l1-and-removing-lowest

2024
[64]

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URLhttps://arxiv.org/pdf/2312.06681

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Polysemanticity and capacity in neural networks

Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892 , 2022. URL https://arxiv.org/pdf/ 2210.01892

work page arXiv 2022
[66]

Spine: Sparse interpretable neural embeddings

Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI Con- ference on Artificial Intelligence , volume 32, 2018. URL https://cdn.aaai.org/ojs/11935/ 11935-13-15463-1-2-20201228.pdf

2018
[67]

Attribution patching outperforms automated circuit discovery

Aaquib Syed, Can Rager, and Arthur Conmy. attribution patching outperforms automated circuit discovery. arXiv preprint arXiv:2310.10348 , 2023. URL https://arxiv.org/pdf/2310.10348

work page arXiv 2023
[68]

Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023

Alex Tamkin, Mohammad Taufeeque, and Noah D Goodman. Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023. URL https://arxiv.org/ pdf/2310.17230

work page arXiv 2023
[69]

Predicting future activations, 2024

Adly Templeton, Joshua Batson, Adam Jermyn, and Chris Olah. Predicting future activations, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#predict-future

2024
[70]

Do sparse autoencoders find ”true features”?, 2024

Demian Till. Do sparse autoencoders find ”true features”?, 2024. URL https://www.lesswrong.com/ posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features . 60

2024
[71]

Function vectors in large language models

Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213 , 2023. URL https:// arxiv.org/pdf/2310.15213

work page arXiv 2023
[72]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023. URL https://arxiv.org/ pdf/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Deep feature interpolation for image content changes

Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. Deep feature interpolation for image content changes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7064–7073, 2017. URLhttps://openaccess.thecvf. com/content_cvpr_2017/papers/Upchurch_Deep_Feature_Interpolat...

2017
[74]

Toward a mathematical framework for com- putation in superposition, 2024

Dmitry Vaintrob, Jake Mendel, and Kaarel Hänni. Toward a mathematical framework for com- putation in superposition, 2024. URL https://www.lesswrong.com/posts/2roZtSr5TGmLjXMnT/ toward-a-mathematical-framework-for-computation-in

2024
[75]

Addressing feature suppression in saes, 2024

Benjamin Wright and Lee Sharkey. Addressing feature suppression in saes, 2024. URL https://www. lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes

2024
[76]

Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021. URL https://arxiv.org/pdf/2103.15949

work page arXiv 2021
[77]

Word embedding visualization via dictionary learning

Juexiao Zhang, Yubei Chen, Brian Cheung, and Bruno A Olshausen. Word embedding visualization via dictionary learning. arXiv preprint arXiv:1910.03833 , 2019. URL https://arxiv.org/pdf/1910. 03833

work page arXiv 1910
[78]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top- down approach to ai transparency. arXiv preprint arXiv:2310.01405 , 2023. URL https://arxiv.org/ pdf/2310.01405. 61 A Author Contributions A.1 Infrastructure, T ooling, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024

Robert AIZI. Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024. URL https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/ research-report-sparse-autoencoders-find-only-9-180-board

2024

[2] [2]

Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, and Jessica N. Howard. Sparse autoen- coders find composed features in small toy models, 2024. URL https://www.lesswrong.com/posts/ a5wwqza2cY3W7L9cj/sparse-autoencoders-find-composed-features-in-small-toy

2024

[3] [3]

Linear algebraic structure of word senses, with applications to polysemy

Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. URL https://aclanthology.org/Q18-1034.pdf

2018

[4] [4]

Using features for easy circuit identification, 2024

Joshua Batson, Brian Chen, and Andy Jones. Using features for easy circuit identification, 2024. URL https://transformer-circuits.pub/2024/march-update/index.html#feature-heads

2024

[5] [5]

Leace: Perfect linear concept erasure in closed form, 2023

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Bider- man. Leace: Perfect linear concept erasure in closed form, 2023. URL https://arxiv.org/pdf/2306. 03819

2023

[6] [6]

Representation Learning: A Review and New Perspectives

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828, 2013. URL https://arxiv.org/pdf/1206.5538

work page internal anchor Pith review Pith/arXiv arXiv 2013

[7] [7]

Language models can explain neurons in language models, 2023

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

2023

[8] [8]

Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024

Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/ open-source-sparse-autoencoders-for-all-residual-stream . 55

2024

[9] [9]

Man is to computer programmer as woman is to homemaker? debiasing word embeddings

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_files/ paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf

2016

[10] [10]

Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024. URL https://publications. apolloresearch.ai/end_to_end_sparse_dictionary_learning.pdf

2024

[11] [11]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

2023

[12] [12]

Discovering Latent Knowledge in Language Models Without Supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827 , 2022. URL https://arxiv.org/pdf/ 2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Infogan: Interpretable representation learning by information maximizing generative adversarial nets

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_ files/paper/2016/file/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Paper.pdf

2016

[14] [14]

Eliciting latent knowledge: How to tell if your eyes deceive you

Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Google Docs, December , 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_ Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?tab=t.0#heading=h.kkaua0hwmp1d

2021

[15] [15]

Activation steering with saes,

Arthur Conmy and Neel Nanda. Activation steering with saes,

[16] [16]

URL https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/ full-post-progress-update-1-from-the-gdm-mech-interp-team#Activation_Steering_with_ SAEs

[17] [17]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable model directions. arXiv preprint arXiv:2309.08600 , 2023. URL https:// arxiv.org/pdf/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

On measuring and mitigating biased inferences of word embeddings

Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 7659–7666, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6267/6123

2020

[19] [19]

Transcoders find interpretable llm feature circuits

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. Advances in Neural Information Processing Systems , 37:24375–24410, 2025. URL https://arxiv.org/ abs/2406.11944

work page arXiv 2025

[20] [20]

Sparse and redundant representations: from theory to applications in signal and image processing, volume 2

Michael Elad. Sparse and redundant representations: from theory to applications in signal and image processing, volume 2. Springer, 2010. 56

2010

[21] [21]

Softmax linear units

Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Ka- mal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav...

2022

[22] [22]

Toy models of superposition.Trans- former Circuits Thread , 2022

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Trans- former Circuits Thread , 2022. URL https://transformer-circuits.pub/...

2022

[23] [23]

Privileged bases in the transformer resid- ual stream

Nelson Elhage, Robert Lasenby, and Christopher Olah. Privileged bases in the transformer resid- ual stream. Transformer Circuits Thread , 2023. URL https://transformer-circuits.pub/2023/ privileged-basis/index.html

2023

[24] [24]

Sparse Overcomplete Word Vector Representations

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 , 2015. URL https://arxiv.org/pdf/ 1506.02004

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [25]

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and eﬀicient sparsity.arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv. org/pdf/2101.03961

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [26]

Common crawl

The Common Crawl Foundation. Common crawl. URL https://commoncrawl.org

[27] [27]

Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024

Hugo Fry. Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024. URL https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/ towards-multimodal-interpretability-learning-sparse-2

2024

[28] [28]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL https://arxiv.org/pdf/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020

[29] [29]

Decoding the thought vector, 2016

Gabriel Goh. Decoding the thought vector, 2016. URLhttps://gabgoh.github.io/ThoughtVectors/

2016

[30] [30]

Sae reconstruction errors are (empirically) pathologi- cal, 2024

Wes Gurnee. Sae reconstruction errors are (empirically) pathologi- cal, 2024. URL https://www.lesswrong.com/posts/rZPiuFxESMxCDHe4B/ sae-reconstruction-errors-are-empirically-pathological

2024

[31] [31]

Language models represent space and time, 2024

Wes Gurnee and Max Tegmark. Language models represent space and time, 2024. URL https: //arxiv.org/pdf/2310.02207

work page arXiv 2024

[32] [32]

Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt

Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, and Xipeng Qiu. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt. arXiv preprint arXiv:2402.12201 , 2024. URL https://arxiv.org/pdf/2402.12201. 57

work page arXiv 2024

[33] [33]

Superposition, memorization, and double descent.Transformer Circuits Thread, 2023

Tom Henighan, Shan Carter, Tristan Hume, Nelson Elhage, Robert Lasenby, Stanislav Fort, Nicholas Schiefer, and Christopher Olah. Superposition, memorization, and double descent.Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/toy-double-descent/index.html

2023

[34] [34]

Natural language descriptions of deep visual features

Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representations, 2021. URL https://arxiv.org/pdf/2201.11114

work page arXiv 2021

[35] [35]

beta-vae: Learning basic visual concepts with a constrained varia- tional framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained varia- tional framework. 2016. URL https://openreview.net/pdf?id=Sy2fzU9gl

2016

[36] [36]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute- optimal large language models. arXiv preprint arXiv:2203.15556 , 2022. URL https://arxiv.org/ pdf/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Rad- hakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shaun...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

On the ”steerability” of generative adversarial networks

Ali Jahanian, Lucy Chai, and Phillip Isola. On the ”steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 , 2019. URL https://arxiv.org/pdf/1907.07171

work page arXiv 1907

[39] [39]

Features in an 8-layer model, 2024

Adam Jermyn, Tom Conerly, Trenton Bricken, and Adly Templeton. Features in an 8-layer model, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#dict-learning

2024

[40] [40]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 , 2022. URL https://arxiv.org/pdf/2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/pdf/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001

[42] [42]

Disentangling by factorising

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Ma- chine Learning, pages 2649–2658. PMLR, 2018. URL http://proceedings.mlr.press/v80/kim18b/ kim18b.pdf

2018

[43] [43]

Sparse autoencoders work on attention layer outputs, 2024

Connor Kissane, robertzk, Arthur Conmy, and Neel Nanda. Sparse autoencoders work on attention layer outputs, 2024. URL https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/ sparse-autoencoders-work-on-attention-layer-outputs . 58

2024

[44] [44]

Atp*: An eﬀicient and scalable method for localizing llm behaviour to components

János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An eﬀicient and scalable method for localizing llm behaviour to components. arXiv preprint arXiv:2403.00745 , 2024. URL https: //arxiv.org/pdf/2403.00745

work page arXiv 2024

[45] [45]

Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022

Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022. URL https://arxiv.org/pdf/2210.13382

work page arXiv 2022

[46] [46]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023. URL https://arxiv.org/pdf/ 2306.03341

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

How strongly do dictionary learning features influence model behavior?, 2024

Jack Lindsey. How strongly do dictionary learning features influence model behavior?, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html#ablation-exps

2024

[48] [48]

Simple probes can catch sleeper agents, 2024

Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duve- naud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hub- inger. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/ probes-catch-sleeper-agents

2024

[49] [49]

Eliciting latent knowledge from quirky language models

Alex Mallen and Nora Belrose. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023. URL https://arxiv.org/pdf/2312.01037

work page arXiv 2023

[50] [50]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824 , 2023. URL https: //arxiv.org/pdf/2310.06824

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

dictionary_learning github repository, 2024

Samuel Marks, Adam Karvonen, and Aaron Mueller. dictionary_learning github repository, 2024. URL https://github.com/saprmarks/dictionary_learning

2024

[52] [52]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647, 2024. URL https://arxiv.org/pdf/2403.19647

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Linguistic regularities in continuous space word representations

Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages 746–751, 2013. URL https:// aclanthology.org/N13-1090.pdf

2013

[54] [54]

Transformer debugger, 2024

Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders. Transformer debugger, 2024. URL https://github.com/openai/transformer-debugger

2024

[55] [55]

Actually, othello-gpt has a linear emergent world representation, 2023

Neel Nanda. Actually, othello-gpt has a linear emergent world representation, 2023. URL https: //www.neelnanda.io/mechanistic-interpretability/othello

2023

[56] [56]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratch- pads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021. URL https://arxiv.org/pdf/2112.00114. 59

work page internal anchor Pith review Pith/arXiv arXiv 2021

[57] [57]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. URL https://distill. pub/2020/circuits/zoom-in. https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020

[58] [58]

Distributed representations: Composition & superposition, 2023

Christopher Olah. Distributed representations: Composition & superposition, 2023. URL https: //transformer-circuits.pub/2023/superposition-composition/index.html

2023

[59] [59]

Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. doi: 10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697

work page doi:10.1016/s0042-6989(97)00169-7 1997

[60] [60]

Mlp neurons - 40l preliminary investigation [rough early thoughts]

Catherine Olsson, Nelson Elhage, and Chris Olah. Mlp neurons - 40l preliminary investigation [rough early thoughts]. URL https://www.youtube.com/watch?v=8wYNsoycM1U

[61] [61]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015. URL https: //arxiv.org/pdf/1511.06434

work page internal anchor Pith review Pith/arXiv arXiv 2015

[62] [62]

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014 , 2024. URL https://arxiv.org/pdf/2404.16014

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024

Logan Riggs and Jannik Brinkmann. Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024. URL https://www.lesswrong.com/posts/YiGs8qJ8aNBgwt2YN/ improving-sae-s-by-sqrt-ing-l1-and-removing-lowest

2024

[64] [64]

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URLhttps://arxiv.org/pdf/2312.06681

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Polysemanticity and capacity in neural networks

Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892 , 2022. URL https://arxiv.org/pdf/ 2210.01892

work page arXiv 2022

[66] [66]

Spine: Sparse interpretable neural embeddings

Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI Con- ference on Artificial Intelligence , volume 32, 2018. URL https://cdn.aaai.org/ojs/11935/ 11935-13-15463-1-2-20201228.pdf

2018

[67] [67]

Attribution patching outperforms automated circuit discovery

Aaquib Syed, Can Rager, and Arthur Conmy. attribution patching outperforms automated circuit discovery. arXiv preprint arXiv:2310.10348 , 2023. URL https://arxiv.org/pdf/2310.10348

work page arXiv 2023

[68] [68]

Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023

Alex Tamkin, Mohammad Taufeeque, and Noah D Goodman. Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023. URL https://arxiv.org/ pdf/2310.17230

work page arXiv 2023

[69] [69]

Predicting future activations, 2024

Adly Templeton, Joshua Batson, Adam Jermyn, and Chris Olah. Predicting future activations, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#predict-future

2024

[70] [70]

Do sparse autoencoders find ”true features”?, 2024

Demian Till. Do sparse autoencoders find ”true features”?, 2024. URL https://www.lesswrong.com/ posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features . 60

2024

[71] [71]

Function vectors in large language models

Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213 , 2023. URL https:// arxiv.org/pdf/2310.15213

work page arXiv 2023

[72] [72]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023. URL https://arxiv.org/ pdf/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [73]

Deep feature interpolation for image content changes

Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. Deep feature interpolation for image content changes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7064–7073, 2017. URLhttps://openaccess.thecvf. com/content_cvpr_2017/papers/Upchurch_Deep_Feature_Interpolat...

2017

[74] [74]

Toward a mathematical framework for com- putation in superposition, 2024

Dmitry Vaintrob, Jake Mendel, and Kaarel Hänni. Toward a mathematical framework for com- putation in superposition, 2024. URL https://www.lesswrong.com/posts/2roZtSr5TGmLjXMnT/ toward-a-mathematical-framework-for-computation-in

2024

[75] [75]

Addressing feature suppression in saes, 2024

Benjamin Wright and Lee Sharkey. Addressing feature suppression in saes, 2024. URL https://www. lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes

2024

[76] [76]

Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021. URL https://arxiv.org/pdf/2103.15949

work page arXiv 2021

[77] [77]

Word embedding visualization via dictionary learning

Juexiao Zhang, Yubei Chen, Brian Cheung, and Bruno A Olshausen. Word embedding visualization via dictionary learning. arXiv preprint arXiv:1910.03833 , 2019. URL https://arxiv.org/pdf/1910. 03833

work page arXiv 1910

[78] [78]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top- down approach to ai transparency. arXiv preprint arXiv:2310.01405 , 2023. URL https://arxiv.org/ pdf/2310.01405. 61 A Author Contributions A.1 Infrastructure, T ooling, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2023