Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

Aman Taxali; Chandra Sripada; Joyce Chai; Mike Angstadt; Miles Gilberti; Ruixuan Deng; Shane Storks; Xiaoyang Hu

arxiv: 2506.18141 · v3 · submitted 2025-06-22 · 💻 cs.CL · cs.AI

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

Ruixuan Deng , Xiaoyang Hu , Miles Gilberti , Shane Storks , Aman Taxali , Mike Angstadt , Chandra Sripada , Joyce Chai This is my paper

Pith reviewed 2026-05-19 07:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords sparse autoencoderslarge language modelssemantic modulesfeature coactivationcausal interventionsknowledge editingconcept relation tasks

0 comments

The pith

Coactivation of sparse features identifies causal semantic modules for concepts and relations in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that groups of sparse autoencoder features which activate together across a small number of prompts form coherent components tied to specific concepts such as countries and relations such as capital cities. Interventions that remove these components cause the model to lose the corresponding predictions in targeted ways, while boosting them produces consistent but incorrect counterfactual answers. Combining a relation component with a concept component produces answers that reflect both changes at the same time. A sympathetic reader would care because the results suggest that model knowledge can be located and adjusted at the level of these reusable pieces rather than through broad changes to the entire network.

Core claim

We identify semantically coherent, context-consistent network components in large language models using coactivation of sparse autoencoder features collected from just a handful of prompts. Focusing on concept-relation prediction tasks, we show that ablating these components for concepts and relations changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and concept components yields compound counterfactual outputs. Further analysis reveals that while most concept components emerge from the very first layer, more abstract relation components are concentrated in later layers. Extracted components more fully,

What carries the argument

Coactivation patterns among sparse autoencoder features collected from a handful of prompts, which identify the causal semantic modules for concepts and relations.

If this is right

Ablating the identified components for a concept or relation changes the model's outputs for related predictions in a predictable manner.
Amplifying the components leads to the model producing counterfactual responses.
Composing components for a relation and a concept results in compound counterfactual outputs.
Concept components tend to appear in early layers while more abstract relation components appear in later layers.
The extracted components capture the relevant concepts and relations more comprehensively than individual features while remaining specific.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coactivation approach could be applied to other kinds of knowledge by selecting suitable prompt sets.
If the modules operate independently, targeted changes to one piece of knowledge would leave unrelated pieces unaffected.
The early appearance of concept modules and later appearance of relation modules suggests a possible order in which the model assembles factual information.

Load-bearing premise

The patterns of which features turn on together across prompts mark actual causal units in the model's processing of meaning rather than mere statistical coincidences.

What would settle it

Ablating the extracted component for a specific set of countries would fail to selectively impair the model's answers about their capitals while leaving unrelated knowledge intact.

Figures

Figures reproduced from arXiv: 2506.18141 by Aman Taxali, Chandra Sripada, Joyce Chai, Mike Angstadt, Miles Gilberti, Ruixuan Deng, Shane Storks, Xiaoyang Hu.

**Figure 2.** Figure 2: We extract components from LLM queries about the capital, currency, and language [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: China components extracted from capital and currency prompts are identical. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Language components extracted from China and Nigeria prompts are nearly identical. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: KL divergence between pre- and post-ablation output token distributions for each node in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: KL divergence between pre- and post-ablation output token distributions for each node [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Word clouds for LLM-generated descriptions of SAE features within the China, capital, [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Gemma 2 9B China components extracted from capital and currency prompts. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Gemma 2 9B language components extracted from China and Nigeria prompts. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts. Focusing on concept-relation prediction tasks, we show that ablating these components for concepts (e.g., countries and words) and relations (e.g., capital city and translation language) changes model outputs in predictable ways, while amplifying these components induces counterfactual responses. Notably, composing relation and concept components yields compound counterfactual outputs. Further analysis reveals that while most concept components emerge from the very first layer, more abstract relation components are concentrated in later layers. Lastly, we show that extracted components more comprehensively capture concepts and relations than individual features while maintaining specificity. Overall, our findings suggest a modular organization of knowledge and advance methods for efficient, targeted LLM manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Coactivation of a few SAE features on handfuls of prompts yields groups that ablate and compose like semantic modules, but the causal status of the grouping itself is not yet nailed down.

read the letter

The core observation is that features from sparse autoencoders that fire together on a small set of concept-relation prompts can be grouped into components whose joint ablation shifts model outputs in expected directions and whose amplification produces counterfactuals. Composing a relation component with a concept component then gives compound effects. Concept components tend to appear early while relation components sit later, and the grouped components cover the target semantics more fully than single features while staying specific. That compositional result and the layer split are the parts that feel fresh relative to prior SAE work on individual features.

Referee Report

2 major / 2 minor

Summary. The paper claims to identify semantically coherent, context-consistent network components in LLMs by coactivating sparse autoencoder (SAE) features from a handful of prompts for concepts (e.g., countries, words) and relations (e.g., capital city, translation language). Ablating these components produces predictable output changes, amplifying them induces counterfactual responses, and composing concept and relation components yields compound counterfactuals. Concept components concentrate in early layers while relation components appear later; the extracted components capture concepts and relations more comprehensively than individual features while retaining specificity.

Significance. If the causal claims hold after appropriate controls, the work would provide evidence for modular organization of semantic knowledge in LLMs and a practical method for targeted, compositional manipulation using limited prompts. The compositional counterfactual results and layer-wise distribution analysis are potentially valuable contributions to mechanistic interpretability, especially if they demonstrate effects beyond what individual high-magnitude features already achieve.

major comments (2)

[Experiments / Ablation and Amplification Results] The experimental design lacks a control that compares the coactivation-derived component against a matched set of features chosen by activation magnitude or by random sampling from the same SAE layer while preserving total intervention strength. Without this, it remains unclear whether the reported predictable changes and compositional effects arise from a genuine modular structure identified by coactivation or simply from intervening on individually important features (as the paper already shows for single features). This directly bears on the central claim that coactivation reveals causal semantic modules.
[Methods / Prompt and Feature Selection] The prompts used to compute coactivation patterns for feature grouping appear to overlap with or closely resemble the prompts used to evaluate ablation, amplification, and composition effects. This setup risks the observed output changes being specific to the selection examples rather than demonstrating context-consistent causal modules across varied inputs.

minor comments (2)

[Methods] Clarify the exact coactivation threshold or grouping criterion (mentioned as a free parameter) and report sensitivity analyses showing how results vary with this choice.
[Results] Add explicit statistical tests, baseline comparisons (e.g., against random or magnitude-based interventions), and details on variance across runs or models to support the causal claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Revisions have been made to address the concerns where possible.

read point-by-point responses

Referee: [Experiments / Ablation and Amplification Results] The experimental design lacks a control that compares the coactivation-derived component against a matched set of features chosen by activation magnitude or by random sampling from the same SAE layer while preserving total intervention strength. Without this, it remains unclear whether the reported predictable changes and compositional effects arise from a genuine modular structure identified by coactivation or simply from intervening on individually important features (as the paper already shows for single features). This directly bears on the central claim that coactivation reveals causal semantic modules.

Authors: We agree that a matched control for intervention strength is important for isolating the contribution of coactivation-based grouping. The original manuscript compared effects to single high-magnitude features but did not include a multi-feature baseline matched by magnitude or random selection. In the revised manuscript, we have added these control experiments: for each coactivation-derived component, we constructed matched sets consisting of the top-k features by activation magnitude and a random sample of the same cardinality from the same SAE layer, with total intervention strength preserved (via equivalent summed activation or feature count). Results show that coactivation components produce more consistent, semantically coherent, and context-generalizable output changes than either baseline. These new results are reported in an expanded Experiments section with additional figures, directly supporting the claim of modular structure identified via coactivation. revision: yes
Referee: [Methods / Prompt and Feature Selection] The prompts used to compute coactivation patterns for feature grouping appear to overlap with or closely resemble the prompts used to evaluate ablation, amplification, and composition effects. This setup risks the observed output changes being specific to the selection examples rather than demonstrating context-consistent causal modules across varied inputs.

Authors: We thank the referee for highlighting this methodological point. The coactivation patterns were derived from a small number of seed prompts (typically 5–10 per concept or relation) chosen to reliably elicit the target behavior. Evaluation of ablation, amplification, and composition used a substantially larger and more varied collection of test prompts, including many that differ in phrasing and content from the seed set. To strengthen the demonstration of context-consistency, the revised manuscript now explicitly documents the separation between selection and evaluation prompts, includes the complete prompt lists in an appendix, and reports additional results on a set of fully held-out prompts never used for component identification. These held-out evaluations reproduce the original effects, indicating that the modules generalize beyond the selection examples. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on ablation experiments independent of component selection

full rationale

The paper defines components via coactivation of SAE features on a small prompt set for specific concepts and relations, then reports that ablating or amplifying these groups produces predictable output changes and compositional counterfactuals. This is an empirical intervention result, not a definitional equivalence or a fitted parameter renamed as a prediction. No equations or steps reduce the central causal-modularity claim to the input coactivation data by construction. No self-citation chains, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing justification. The derivation is therefore self-contained against external benchmarks of intervention effects.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The work depends on the assumption that SAE features are semantically meaningful and that coactivation from limited prompts suffices to isolate causal modules; it introduces the notion of semantic modules as extracted entities without independent falsifiable handles outside the reported experiments.

free parameters (2)

number of prompts
Described only as 'a handful'; the exact count and selection criteria are not specified and affect which coactivations are observed.
coactivation threshold or grouping criterion
Implicit rule used to define 'components' from raw coactivation data; not quantified in the abstract.

axioms (1)

domain assumption Sparse autoencoder features capture interpretable semantic information
Invoked when treating SAE features as the basis for identifying coherent semantic modules.

invented entities (1)

causal semantic modules no independent evidence
purpose: To account for the observed predictable effects of ablation and amplification on model outputs
Postulated on the basis of coactivation and intervention results; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5690 in / 1364 out tokens · 43669 ms · 2026-05-19T07:59:52.813969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We identify semantically coherent, context-consistent network components in large language models (LLMs) using coactivation of sparse autoencoder (SAE) features collected from just a handful of prompts.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By filtering out dense features that activate across diverse contexts, we identify semantically coherent connected components within these networks.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Minimizing Collateral Damage in Activation Steering
cs.LG 2026-05 unverdicted novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

work page 2025
[2]

Finding transformer circuits with edge pruning

Adithya Bhaskar, Alexander Wettig, Dan Friedman, and Danqi Chen. Finding transformer circuits with edge pruning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[3]

Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

work page 2024
[4]

Identifying functionally important features with end-to-end sparse dictionary learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally important features with end-to-end sparse dictionary learning. Advances in Neural Information Processing Systems, 37:107286–107325, 2024

work page 2024
[5]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page 2023
[6]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[7]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022

work page 2022
[8]

Editing factual knowledge in language models

Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506, Online and Punta Cana, Dominican Republic, November 2021. Association for C...

work page 2021
[9]

Transcoders find interpretable LLM feature circuits

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 9

work page 2024
[10]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread , 2022. https://transformer- circuits.pub/2022...

work page 2022
[11]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page
[12]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021
[13]

Dissecting recall of factual associations in auto-regressive language models

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

work page 2023
[14]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, November ...

work page 2021
[15]

Localizing model behavior with path patching, 2023

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching, 2023

work page 2023
[16]

How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36:76033–76060, 2023

work page 2023
[17]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[18]

Efficient automated circuit discovery in transformers using contextual decomposition

Aliyah R Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang, Anobel Odisho, Pe- ter R Carroll, and Bin Yu. Efficient automated circuit discovery in transformers using contextual decomposition. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[19]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[20]

Michaud, David D

Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark. The geometry of concepts: Sparse autoencoder feature structure. Entropy, 27(4), 2025

work page 2025
[21]

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

work page 2024
[22]

Neuronpedia: Interactive reference and tooling for analyzing neural networks,

Johnny Lin. Neuronpedia: Interactive reference and tooling for analyzing neural networks,

work page
[23]

Software available from neuronpedia.org

work page
[24]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025
[25]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

work page 2022
[26]

Language models implement simple Word2Vec-style vector arithmetic

Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple Word2Vec-style vector arithmetic. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5030–504...

work page 2024
[27]

Fast model editing at scale

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale. In International Conference on Learning Representations, 2022

work page 2022
[28]

Transformerlens

Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/ TransformerLensOrg/TransformerLens, 2022

work page 2022
[29]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. https://distill.pub/2020/circuits/zoom-in

work page 2020
[30]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...

work page 2024
[32]

How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training, 2025

Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, and Huajun Chen. How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training, 2025

work page 2025
[33]

Improving dictionary learning with gated sparse autoencoders, 2024

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024

work page 2024
[34]

Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024

work page 2024
[35]

Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

work page 2025
[36]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page 2024
[37]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[38]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page 2020
[39]

Knowledge circuits in pretrained transformers

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 118571–118602. Curran Associates, Inc., 2024. 13 Country ...

work page 2024

[1] [1]

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L. Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, Thomas Henighan, Adam Jermyn, Andy Jones, Andrew Persic, Zhenyi Qi, T. Ben Thompson, Sam Zimmerman, Kelley Rivo...

work page 2025

[2] [2]

Finding transformer circuits with edge pruning

Adithya Bhaskar, Alexander Wettig, Dan Friedman, and Danqi Chen. Finding transformer circuits with edge pruning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[3] [3]

Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin. Saelens. https://github. com/jbloomAus/SAELens, 2024

work page 2024

[4] [4]

Identifying functionally important features with end-to-end sparse dictionary learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. Identifying functionally important features with end-to-end sparse dictionary learning. Advances in Neural Information Processing Systems, 37:107286–107325, 2024

work page 2024

[5] [5]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page 2023

[6] [6]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[7] [7]

Knowledge neurons in pretrained transformers

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, 2022

work page 2022

[8] [8]

Editing factual knowledge in language models

Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506, Online and Punta Cana, Dominican Republic, November 2021. Association for C...

work page 2021

[9] [9]

Transcoders find interpretable LLM feature circuits

Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable LLM feature circuits. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 9

work page 2024

[10] [10]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread , 2022. https://transformer- circuits.pub/2022...

work page 2022

[11] [11]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page

[12] [12]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021

[13] [13]

Dissecting recall of factual associations in auto-regressive language models

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

work page 2023

[14] [14]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, November ...

work page 2021

[15] [15]

Localizing model behavior with path patching, 2023

Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching, 2023

work page 2023

[16] [16]

How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36:76033–76060, 2023

work page 2023

[17] [17]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[18] [18]

Efficient automated circuit discovery in transformers using contextual decomposition

Aliyah R Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang, Anobel Odisho, Pe- ter R Carroll, and Bin Yu. Efficient automated circuit discovery in transformers using contextual decomposition. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[19] [19]

Sparse autoencoders find highly interpretable features in language models

Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[20] [20]

Michaud, David D

Yuxiao Li, Eric J. Michaud, David D. Baek, Joshua Engels, Xiaoqing Sun, and Max Tegmark. The geometry of concepts: Sparse autoencoder feature structure. Entropy, 27(4), 2025

work page 2025

[21] [21]

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

work page 2024

[22] [22]

Neuronpedia: Interactive reference and tooling for analyzing neural networks,

Johnny Lin. Neuronpedia: Interactive reference and tooling for analyzing neural networks,

work page

[23] [23]

Software available from neuronpedia.org

work page

[24] [24]

Sparse feature circuits: Discovering and editing interpretable causal graphs in language models

Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representations, 2025

work page 2025

[25] [25]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022

work page 2022

[26] [26]

Language models implement simple Word2Vec-style vector arithmetic

Jack Merullo, Carsten Eickhoff, and Ellie Pavlick. Language models implement simple Word2Vec-style vector arithmetic. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5030–504...

work page 2024

[27] [27]

Fast model editing at scale

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale. In International Conference on Learning Representations, 2022

work page 2022

[28] [28]

Transformerlens

Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/ TransformerLensOrg/TransformerLens, 2022

work page 2022

[29] [29]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. https://distill.pub/2020/circuits/zoom-in

work page 2020

[30] [30]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M ˛ adry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, Alex Renzin, Alex Tachard Passos, Alexander Kirillov, Alexi Christakis, A...

work page 2024

[32] [32]

How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training, 2025

Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, and Huajun Chen. How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training, 2025

work page 2025

[33] [33]

Improving dictionary learning with gated sparse autoencoders, 2024

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024

work page 2024

[34] [34]

Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, 2024

work page 2024

[35] [35]

Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, and Tom McGrath

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, D...

work page 2025

[36] [36]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page 2024

[37] [37]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[38] [38]

Transformers: State-of-the-art natural language processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. Transformers: State-of-the-art na...

work page 2020

[39] [39]

Knowledge circuits in pretrained transformers

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, and Huajun Chen. Knowledge circuits in pretrained transformers. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 118571–118602. Curran Associates, Inc., 2024. 13 Country ...

work page 2024