pith. sign in

arxiv: 2605.29358 · v1 · pith:DUG2N5UNnew · submitted 2026-05-28 · 💻 cs.AI

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Pith reviewed 2026-06-29 07:47 UTC · model grok-4.3

classification 💻 cs.AI
keywords sparse autoencodersdictionary learningmodel interpretabilityfeature extractionresidual streammonosemanticitymultimodal generalization
0
0 comments X

The pith

Sparse autoencoders extract up to 34 million interpretable features from a production-scale language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sparse autoencoders scale to a large language model by training them on residual stream activations to recover millions of features aligned with understandable concepts. These features respond to both specific entities and abstract ideas, generalize across languages and to images despite text-only training, and permit steering of model outputs in directions consistent with the feature meanings. The work also locates features tied to behaviors such as deception, power-seeking, and bias, demonstrating that activation or suppression of those features changes model generations accordingly. This matters because it provides a concrete route to decompose the internal representations of advanced models into parts that humans can inspect and adjust. The authors emphasize that the recovered set remains incomplete and that no rigorous tests yet confirm the features match the model's actual computations rather than autoencoder artifacts.

Core claim

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal, respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract

What carries the argument

Sparse autoencoders trained on middle-layer residual stream activations to produce a dictionary of features.

If this is right

  • Features generalize to image inputs even though training used only text.
  • Activation of harm-related features such as those for deception or bias alters model generations in the direction predicted by the feature interpretation.
  • The same feature can respond to both concrete examples and abstract discussion of the same concept.
  • Geometric and functional analyses of the learned features reveal additional regularities in how the model organizes its representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the features prove faithful, the same method could be applied to additional layers to build a more complete map of model behavior.
  • Similar dictionary learning could be tested on non-transformer architectures to check whether comparable monosemantic features appear.
  • A direct test would be to train independent autoencoders on the same model activations and measure how consistently the same concepts are recovered.
  • The ability to steer outputs via individual features suggests a route to targeted modification of model tendencies without retraining the entire system.

Load-bearing premise

The features recovered by the autoencoders correspond to the language model's actual internal computations rather than arising as side effects of how the autoencoders were trained.

What would settle it

An intervention experiment in which activating a feature labeled as representing deception produces no measurable increase in deceptive outputs on a held-out set of prompts.

read the original abstract

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that sparse autoencoders (SAEs) with up to 34 million features can be trained on the middle-layer residual stream of Claude 3 Sonnet using scaling laws for hyperparameter selection, yielding features that are multilingual and multimodal (despite text-only training), respond to both concrete and abstract concepts, enable causal steering consistent with human interpretations, and include features for entities, sarcasm, code errors, and safety-relevant behaviors such as deception, power-seeking, sycophancy, and bias. It presents analyses of feature interpretability, geometry, and computational function while explicitly noting that the feature suite is incomplete and that rigorous methods for evaluating whether features faithfully capture model computations are lacking.

Significance. If the features are shown to be faithful to the model's computations, the work would be a significant advance in mechanistic interpretability by providing the first large-scale demonstration that dictionary learning scales to production frontier models, enabling systematic discovery of concepts and causal interventions on behaviors including those relevant to AI safety.

major comments (2)
  1. [Abstract] Abstract: The central scaling claim—that SAEs extract interpretable features from a production-scale model and thereby address whether dictionary learning works beyond small transformers—requires that the discovered features reflect the model's actual internal computations rather than primarily reflecting SAE training dynamics (L1 penalty, reconstruction loss, or initialization). The abstract states that no rigorous evaluation methods exist for this faithfulness question, and the reported evidence (human inspection, steering results, multilingual/multimodal generalization) is consistent with but does not establish model-native features.
  2. [Abstract] Abstract and limitations discussion: The explicit acknowledgment that the feature suite is incomplete and that faithfulness cannot be rigorously evaluated means the causal steering results and interpretability claims remain provisional; without a concrete test distinguishing model computations from SAE artifacts, the scaling demonstration does not yet fully resolve the open question posed in the abstract.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and for emphasizing the distinction between SAE artifacts and model-native features. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central scaling claim—that SAEs extract interpretable features from a production-scale model and thereby address whether dictionary learning works beyond small transformers—requires that the discovered features reflect the model's actual internal computations rather than primarily reflecting SAE training dynamics (L1 penalty, reconstruction loss, or initialization). The abstract states that no rigorous evaluation methods exist for this faithfulness question, and the reported evidence (human inspection, steering results, multilingual/multimodal generalization) is consistent with but does not establish model-native features.

    Authors: We agree that the evidence presented—human interpretability judgments, causal steering results, and cross-lingual/cross-modal generalization—is consistent with model-native features but does not constitute rigorous proof that the features are free of SAE training artifacts. The manuscript already states this limitation explicitly in both the abstract and the limitations discussion. Our central claim is narrower: that dictionary learning can be scaled to a production model while producing features that exhibit the reported properties under current evaluation methods. The use of scaling laws for hyperparameter selection and the consistency of steering outcomes with independent interpretations provide additional support beyond what was available for smaller models, even if a definitive faithfulness test remains unavailable. revision: no

  2. Referee: [Abstract] Abstract and limitations discussion: The explicit acknowledgment that the feature suite is incomplete and that faithfulness cannot be rigorously evaluated means the causal steering results and interpretability claims remain provisional; without a concrete test distinguishing model computations from SAE artifacts, the scaling demonstration does not yet fully resolve the open question posed in the abstract.

    Authors: We concur that the results are provisional precisely because no rigorous faithfulness test exists, as the paper states. The abstract is deliberately structured to pose the scaling question and then immediately qualify the claims with the acknowledged limitations. We maintain that demonstrating successful training and interpretable, steerable features at 34 million scale on a frontier model constitutes progress on the open question, even while the field lacks methods to fully separate model computations from SAE artifacts. The incompleteness of the feature suite is likewise already noted and does not undermine the scaling result for the features that were recovered. revision: no

standing simulated objections not resolved
  • A concrete test distinguishing model computations from SAE artifacts

Circularity Check

0 steps flagged

Empirical scaling demonstration with no circular derivation steps

full rationale

The paper reports direct training of sparse autoencoders on Claude 3 Sonnet activations, followed by empirical observations of feature interpretability via inspection, steering, and generalization tests. No equations, predictions, or first-principles claims are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The acknowledged limitation on faithfulness evaluation is an explicit open question rather than a hidden circularity. This is self-contained empirical work.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions from dictionary learning literature that sparse autoencoders recover monosemantic features from neural activations; no new entities are introduced.

free parameters (2)
  • number of features = up to 34 million
    Up to 34 million chosen via scaling laws to guide hyperparameter selection
  • sparsity hyperparameter
    Selected using scaling laws for effective feature extraction
axioms (2)
  • domain assumption Sparse autoencoders recover interpretable, disentangled features from model residual streams
    Core premise invoked for the entire extraction process
  • domain assumption The middle layer residual stream activations contain semantically meaningful information suitable for dictionary learning
    Justifies the specific choice of layer and input representation

pith-pipeline@v0.9.1-grok · 5822 in / 1462 out tokens · 39922 ms · 2026-06-29T07:47:54.069109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evidence for feature-specific error correction in LLMs

    cs.LG 2026-06 unverdicted novelty 6.0

    Perturbation experiments across six LLMs show activation robustness follows L^p norm with p>2 for feature directions (contrastive, MELBO, SAE) but p≈2 for random/PCA controls, indicating feature-specific error correction.

  2. HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

    cs.CL 2026-06 unverdicted novelty 6.0

    HydraHead hybridizes full and linear attention along the head dimension via interpretability-driven selection and scale-normalized fusion, matching layer-wise hybrids at higher linear ratios after 15B-token training.

Reference graph

Works this paper leans on

78 extracted references · 35 canonical work pages · cited by 2 Pith papers · 19 internal anchors

  1. [1]

    Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024

    Robert AIZI. Research report: Sparse autoencoders find only 9/180 board state fea- tures in othellogpt, 2024. URL https://www.lesswrong.com/posts/BduCMgmjJnCtc7jKc/ research-report-sparse-autoencoders-find-only-9-180-board

  2. [2]

    Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, and Jessica N. Howard. Sparse autoen- coders find composed features in small toy models, 2024. URL https://www.lesswrong.com/posts/ a5wwqza2cY3W7L9cj/sparse-autoencoders-find-composed-features-in-small-toy

  3. [3]

    Linear algebraic structure of word senses, with applications to polysemy

    Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. URL https://aclanthology.org/Q18-1034.pdf

  4. [4]

    Using features for easy circuit identification, 2024

    Joshua Batson, Brian Chen, and Andy Jones. Using features for easy circuit identification, 2024. URL https://transformer-circuits.pub/2024/march-update/index.html#feature-heads

  5. [5]

    Leace: Perfect linear concept erasure in closed form, 2023

    Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Bider- man. Leace: Perfect linear concept erasure in closed form, 2023. URL https://arxiv.org/pdf/2306. 03819

  6. [6]

    Representation Learning: A Review and New Perspectives

    Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828, 2013. URL https://arxiv.org/pdf/1206.5538

  7. [7]

    Language models can explain neurons in language models, 2023

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can explain neurons in language models, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

  8. [8]

    Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024

    Joseph Bloom. Open source sparse autoencoders for all residual stream layers of gpt2-small, 2024. URL https://www.lesswrong.com/posts/f9EgfLSurAiqRJySD/ open-source-sparse-autoencoders-for-all-residual-stream . 55

  9. [9]

    Man is to computer programmer as woman is to homemaker? debiasing word embeddings

    Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_files/ paper/2016/file/a486cd07e4ac3d270571622f4f316ec5-Paper.pdf

  10. [10]

    Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024

    Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally im- portant features with end-to-end sparse dictionary learning, 2024. URL https://publications. apolloresearch.ai/end_to_end_sparse_dictionary_learning.pdf

  11. [11]

    Towards monosemanticity: Decomposing language models with dictionary learning

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  12. [12]

    Discovering Latent Knowledge in Language Models Without Supervision

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827 , 2022. URL https://arxiv.org/pdf/ 2212.03827

  13. [13]

    Infogan: Interpretable representation learning by information maximizing generative adversarial nets

    Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems , 29, 2016. URL https://proceedings.neurips.cc/paper_ files/paper/2016/file/7c9d0b1f96aebd7b5eca8c3edaa19ebb-Paper.pdf

  14. [14]

    Eliciting latent knowledge: How to tell if your eyes deceive you

    Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Google Docs, December , 2021. URL https://docs.google.com/document/d/1WwsnJQstPq91_ Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?tab=t.0#heading=h.kkaua0hwmp1d

  15. [15]

    Activation steering with saes,

    Arthur Conmy and Neel Nanda. Activation steering with saes,

  16. [16]

    URL https://www.lesswrong.com/posts/C5KAZQib3bzzpeyrg/ full-post-progress-update-1-from-the-gdm-mech-interp-team#Activation_Steering_with_ SAEs

  17. [17]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Smith, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable model directions. arXiv preprint arXiv:2309.08600 , 2023. URL https:// arxiv.org/pdf/2309.08600

  18. [18]

    On measuring and mitigating biased inferences of word embeddings

    Sunipa Dev, Tao Li, Jeff M Phillips, and Vivek Srikumar. On measuring and mitigating biased inferences of word embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 7659–7666, 2020. URL https://ojs.aaai.org/index.php/AAAI/article/view/6267/6123

  19. [19]

    Transcoders find interpretable llm feature circuits

    Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. Advances in Neural Information Processing Systems , 37:24375–24410, 2025. URL https://arxiv.org/ abs/2406.11944

  20. [20]

    Sparse and redundant representations: from theory to applications in signal and image processing, volume 2

    Michael Elad. Sparse and redundant representations: from theory to applications in signal and image processing, volume 2. Springer, 2010. 56

  21. [21]

    Softmax linear units

    Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Ka- mal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav...

  22. [22]

    Toy models of superposition.Trans- former Circuits Thread , 2022

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Trans- former Circuits Thread , 2022. URL https://transformer-circuits.pub/...

  23. [23]

    Privileged bases in the transformer resid- ual stream

    Nelson Elhage, Robert Lasenby, and Christopher Olah. Privileged bases in the transformer resid- ual stream. Transformer Circuits Thread , 2023. URL https://transformer-circuits.pub/2023/ privileged-basis/index.html

  24. [24]

    Sparse Overcomplete Word Vector Representations

    Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 , 2015. URL https://arxiv.org/pdf/ 1506.02004

  25. [25]

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.arXiv preprint arXiv:2101.03961, 2021. URL https://arxiv. org/pdf/2101.03961

  26. [26]

    Common crawl

    The Common Crawl Foundation. Common crawl. URL https://commoncrawl.org

  27. [27]

    Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024

    Hugo Fry. Towards multimodal interpretability: Learning sparse interpretable features in vision transformers, 2024. URL https://www.lesswrong.com/posts/bCtbuWraqYTDtuARg/ towards-multimodal-interpretability-learning-sparse-2

  28. [28]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL https://arxiv.org/pdf/2101.00027

  29. [29]

    Decoding the thought vector, 2016

    Gabriel Goh. Decoding the thought vector, 2016. URLhttps://gabgoh.github.io/ThoughtVectors/

  30. [30]

    Sae reconstruction errors are (empirically) pathologi- cal, 2024

    Wes Gurnee. Sae reconstruction errors are (empirically) pathologi- cal, 2024. URL https://www.lesswrong.com/posts/rZPiuFxESMxCDHe4B/ sae-reconstruction-errors-are-empirically-pathological

  31. [31]

    Language models represent space and time, 2024

    Wes Gurnee and Max Tegmark. Language models represent space and time, 2024. URL https: //arxiv.org/pdf/2310.02207

  32. [32]

    Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt

    Zhengfu He, Xuyang Ge, Qiong Tang, Tianxiang Sun, Qinyuan Cheng, and Xipeng Qiu. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello- gpt. arXiv preprint arXiv:2402.12201 , 2024. URL https://arxiv.org/pdf/2402.12201. 57

  33. [33]

    Superposition, memorization, and double descent.Transformer Circuits Thread, 2023

    Tom Henighan, Shan Carter, Tristan Hume, Nelson Elhage, Robert Lasenby, Stanislav Fort, Nicholas Schiefer, and Christopher Olah. Superposition, memorization, and double descent.Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/toy-double-descent/index.html

  34. [34]

    Natural language descriptions of deep visual features

    Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. InInternational Conference on Learning Representations, 2021. URL https://arxiv.org/pdf/2201.11114

  35. [35]

    beta-vae: Learning basic visual concepts with a constrained varia- tional framework

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained varia- tional framework. 2016. URL https://openreview.net/pdf?id=Sy2fzU9gl

  36. [36]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute- optimal large language models. arXiv preprint arXiv:2203.15556 , 2022. URL https://arxiv.org/ pdf/2203.15556

  37. [37]

    Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Rad- hakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shaun...

  38. [38]

    On the ”steerability” of generative adversarial networks

    Ali Jahanian, Lucy Chai, and Phillip Isola. On the ”steerability” of generative adversarial networks. arXiv preprint arXiv:1907.07171 , 2019. URL https://arxiv.org/pdf/1907.07171

  39. [39]

    Features in an 8-layer model, 2024

    Adam Jermyn, Tom Conerly, Trenton Bricken, and Adly Templeton. Features in an 8-layer model, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#dict-learning

  40. [40]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 , 2022. URL https://arxiv.org/pdf/2207.05221

  41. [41]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/pdf/2001.08361

  42. [42]

    Disentangling by factorising

    Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Ma- chine Learning, pages 2649–2658. PMLR, 2018. URL http://proceedings.mlr.press/v80/kim18b/ kim18b.pdf

  43. [43]

    Sparse autoencoders work on attention layer outputs, 2024

    Connor Kissane, robertzk, Arthur Conmy, and Neel Nanda. Sparse autoencoders work on attention layer outputs, 2024. URL https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/ sparse-autoencoders-work-on-attention-layer-outputs . 58

  44. [44]

    Atp*: An efficient and scalable method for localizing llm behaviour to components

    János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. Atp*: An efficient and scalable method for localizing llm behaviour to components. arXiv preprint arXiv:2403.00745 , 2024. URL https: //arxiv.org/pdf/2403.00745

  45. [45]

    Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022

    Kenneth Li, Aspen K Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task.arXiv preprint arXiv:2210.13382, 2022. URL https://arxiv.org/pdf/2210.13382

  46. [46]

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model, 2023. URL https://arxiv.org/pdf/ 2306.03341

  47. [47]

    How strongly do dictionary learning features influence model behavior?, 2024

    Jack Lindsey. How strongly do dictionary learning features influence model behavior?, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html#ablation-exps

  48. [48]

    Simple probes can catch sleeper agents, 2024

    Monte MacDiarmid, Timothy Maxwell, Nicholas Schiefer, Jesse Mu, Jared Kaplan, David Duve- naud, Sam Bowman, Alex Tamkin, Ethan Perez, Mrinank Sharma, Carson Denison, and Evan Hub- inger. Simple probes can catch sleeper agents, 2024. URL https://www.anthropic.com/news/ probes-catch-sleeper-agents

  49. [49]

    Eliciting latent knowledge from quirky language models

    Alex Mallen and Nora Belrose. Eliciting latent knowledge from quirky language models. arXiv preprint arXiv:2312.01037, 2023. URL https://arxiv.org/pdf/2312.01037

  50. [50]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824 , 2023. URL https: //arxiv.org/pdf/2310.06824

  51. [51]

    dictionary_learning github repository, 2024

    Samuel Marks, Adam Karvonen, and Aaron Mueller. dictionary_learning github repository, 2024. URL https://github.com/saprmarks/dictionary_learning

  52. [52]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models.arXiv preprint arXiv:2403.19647, 2024. URL https://arxiv.org/pdf/2403.19647

  53. [53]

    Linguistic regularities in continuous space word representations

    Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies , pages 746–751, 2013. URL https:// aclanthology.org/N13-1090.pdf

  54. [54]

    Transformer debugger, 2024

    Dan Mossing, Steven Bills, Henk Tillman, Tom Dupré la Tour, Nick Cammarata, Leo Gao, Joshua Achiam, Catherine Yeh, Jan Leike, Jeff Wu, and William Saunders. Transformer debugger, 2024. URL https://github.com/openai/transformer-debugger

  55. [55]

    Actually, othello-gpt has a linear emergent world representation, 2023

    Neel Nanda. Actually, othello-gpt has a linear emergent world representation, 2023. URL https: //www.neelnanda.io/mechanistic-interpretability/othello

  56. [56]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratch- pads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021. URL https://arxiv.org/pdf/2112.00114. 59

  57. [57]

    Zoom in: An introduction to circuits

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. URL https://distill. pub/2020/circuits/zoom-in. https://distill.pub/2020/circuits/zoom-in

  58. [58]

    Distributed representations: Composition & superposition, 2023

    Christopher Olah. Distributed representations: Composition & superposition, 2023. URL https: //transformer-circuits.pub/2023/superposition-composition/index.html

  59. [59]

    Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997

    Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. doi: 10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697

  60. [60]

    Mlp neurons - 40l preliminary investigation [rough early thoughts]

    Catherine Olsson, Nelson Elhage, and Chris Olah. Mlp neurons - 40l preliminary investigation [rough early thoughts]. URL https://www.youtube.com/watch?v=8wYNsoycM1U

  61. [61]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015. URL https: //arxiv.org/pdf/1511.06434

  62. [62]

    Improving Dictionary Learning with Gated Sparse Autoencoders

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014 , 2024. URL https://arxiv.org/pdf/2404.16014

  63. [63]

    Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024

    Logan Riggs and Jannik Brinkmann. Improving sae’s by sqrt()-ing l1 & removing low- est activating features, 2024. URL https://www.lesswrong.com/posts/YiGs8qJ8aNBgwt2YN/ improving-sae-s-by-sqrt-ing-l1-and-removing-lowest

  64. [64]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition, 2024. URLhttps://arxiv.org/pdf/2312.06681

  65. [65]

    Polysemanticity and capacity in neural networks

    Adam Scherlis, Kshitij Sachan, Adam S Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892 , 2022. URL https://arxiv.org/pdf/ 2210.01892

  66. [66]

    Spine: Sparse interpretable neural embeddings

    Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI Con- ference on Artificial Intelligence , volume 32, 2018. URL https://cdn.aaai.org/ojs/11935/ 11935-13-15463-1-2-20201228.pdf

  67. [67]

    Attribution patching outperforms automated circuit discovery

    Aaquib Syed, Can Rager, and Arthur Conmy. attribution patching outperforms automated circuit discovery. arXiv preprint arXiv:2310.10348 , 2023. URL https://arxiv.org/pdf/2310.10348

  68. [68]

    Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023

    Alex Tamkin, Mohammad Taufeeque, and Noah D Goodman. Codebook features: Sparse and discrete interpretability for neural networks.arXiv preprint arXiv:2310.17230, 2023. URL https://arxiv.org/ pdf/2310.17230

  69. [69]

    Predicting future activations, 2024

    Adly Templeton, Joshua Batson, Adam Jermyn, and Chris Olah. Predicting future activations, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html#predict-future

  70. [70]

    Do sparse autoencoders find ”true features”?, 2024

    Demian Till. Do sparse autoencoders find ”true features”?, 2024. URL https://www.lesswrong.com/ posts/QoR8noAB3Mp2KBA4B/do-sparse-autoencoders-find-true-features . 60

  71. [71]

    Function vectors in large language models

    Eric Todd, Millicent L Li, Arnab Sen Sharma, Aaron Mueller, Byron C Wallace, and David Bau. Function vectors in large language models. arXiv preprint arXiv:2310.15213 , 2023. URL https:// arxiv.org/pdf/2310.15213

  72. [72]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2023. URL https://arxiv.org/ pdf/2308.10248

  73. [73]

    Deep feature interpolation for image content changes

    Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. Deep feature interpolation for image content changes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7064–7073, 2017. URLhttps://openaccess.thecvf. com/content_cvpr_2017/papers/Upchurch_Deep_Feature_Interpolat...

  74. [74]

    Toward a mathematical framework for com- putation in superposition, 2024

    Dmitry Vaintrob, Jake Mendel, and Kaarel Hänni. Toward a mathematical framework for com- putation in superposition, 2024. URL https://www.lesswrong.com/posts/2roZtSr5TGmLjXMnT/ toward-a-mathematical-framework-for-computation-in

  75. [75]

    Addressing feature suppression in saes, 2024

    Benjamin Wright and Lee Sharkey. Addressing feature suppression in saes, 2024. URL https://www. lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes

  76. [76]

    Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

    Zeyu Yun, Yubei Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021. URL https://arxiv.org/pdf/2103.15949

  77. [77]

    Word embedding visualization via dictionary learning

    Juexiao Zhang, Yubei Chen, Brian Cheung, and Bruno A Olshausen. Word embedding visualization via dictionary learning. arXiv preprint arXiv:1910.03833 , 2019. URL https://arxiv.org/pdf/1910. 03833

  78. [78]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top- down approach to ai transparency. arXiv preprint arXiv:2310.01405 , 2023. URL https://arxiv.org/ pdf/2310.01405. 61 A Author Contributions A.1 Infrastructure, T ooling, and ...