From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

Caleb Munigety

arxiv: 2605.22462 · v1 · pith:4H7JSQ6Dnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

Caleb Munigety This is my paper

Pith reviewed 2026-05-22 06:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords causal feature analysistransformer language modelssparse autoencodersactivation patchingindirect object identificationinterpretabilityrobustness testingdeployment evaluation

0 comments

The pith

A five-stage methodology applied to GPT-2 shows that sparse autoencoder features for indirect object identification are only partially causal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a five-stage process to analyze features in transformer language models by moving from detection to causal validation and practical deployment. It demonstrates the full pipeline on GPT-2 small performing the indirect object identification task, recovering the known circuit through activation patching and extracting selective features via sparse autoencoder. Causal tests reveal that ablating fifteen of these features leaves the model correct on 98 percent of prompts while they account for only 31 percent of activation variance. Robustness checks under distribution shifts show the circuit holds but feature effects weaken, and a cost model finds substantial monitoring savings. The combined stages expose limits that isolated techniques miss.

Core claim

The paper claims that its five-stage methodology, when run end-to-end on the indirect object identification task in GPT-2 small, establishes that fifteen per-name selective features recovered by a sparse autoencoder are specifically but only partially causal for the task: ablating them leaves the model accurate on 98 percent of prompts, the features explain only 31 percent of activation variance compared with the SAE's 99.7 percent, selectivity ratio anticorrelates with causal force at r equals negative 0.56, the circuit transfers cleanly under three distribution shifts while feature ablation effects degrade, and an optimal cost-based monitor yields 99.1 percent savings against baseline.

What carries the argument

The five-stage methodology of probe design, feature extraction via sparse autoencoder, causal validation by activation patching and ablation, robustness testing under distribution shifts, and deployment integration with cost evaluation.

If this is right

Activation patching recovers the canonical IOI circuit in which layer-9 head 9 alone produces recovery of plus 1.02.
Sparse autoencoders extract per-name selective features with effect sizes between 30 and 50 activation units.
The circuit transfers cleanly under three distribution shifts but feature ablation effects degrade substantially.
A cost-based deployment evaluation with assumed costs of 50 dollars per false negative and 0.42 dollars per false positive yields an optimal monitor at 8.96 dollars per 1000 queries.
Optimal monitor composition varies with cost ratio and base error rate, and the conjunction of all stages produces findings unreachable by any single stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The negative correlation between selectivity and causal force may indicate that the most selective features are not the ones driving the largest behavioral effects in the circuit.
Extending the same staged validation to other common circuits could show whether partial causality is typical rather than specific to the IOI task.
The observed gap between circuit robustness and feature robustness under shifts suggests that interpretability work should track both levels separately when building monitors.
Cost-sensitive deployment calculations can still favor using even partially causal features when false-negative costs are high relative to false-positive costs.

Load-bearing premise

The three chosen distribution shifts adequately represent the range of real-world changes that could affect feature causality and circuit robustness.

What would settle it

Re-running the full five-stage pipeline on a different model or task and finding that ablating the identified selective features drops accuracy by far more than 2 percent, or that the selectivity ratio correlates positively rather than negatively with causal force.

Figures

Figures reproduced from arXiv: 2605.22462 by Caleb Munigety.

**Figure 1.** Figure 1: Residual stream patching across all 12 layers and 15 token positions. Three patterns are visible: the IO name position carries positive recovery throughout the network; the S name position shows large negative recovery in early layers that diminishes by mid-network (the signature of S-inhibition); and the END position becomes the active prediction site by layers 10 and 11. 7 Stage 2: Sparse Autoencoder Fea… view at source ↗

**Figure 2.** Figure 2: Per-head patching at the END position. Layer 9 head 9 alone recovers [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Top ten IO-vs-S name-selective SAE features at GPT-2 layer 9, END position. Each feature [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Single-feature ablation: each feature’s preferred name’s logit-difference drops below [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: FVE of the SAE feature reconstruction as a function of the number of active features [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Selectivity ratio vs. causal drop under ablation. Pearson [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Stage 4 results. (a) The canonical name-mover and backup-head circuit replicates across [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Stage 5 results. (a) The SAE features as monitors achieve perfect ROC-AUC in-distribution, [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Stage 5d deployment evaluation. (a) Expected cost per 1000 queries as a function of [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sequences existing tools into a five-stage pipeline for causal feature analysis and reports concrete partial-causality numbers plus a robustness gap, but that gap depends on how representative the three shifts really are.

read the letter

The main point for you is that the authors have assembled a five-stage methodology that takes feature analysis in transformers from correlation to some level of causal validation and then to practical deployment checks. Applied to the indirect object identification task in GPT-2 small, it recovers the known circuit via activation patching and extracts selective features with a sparse autoencoder. The causal validation step shows these features are only partially responsible, since ablating fifteen of them still leaves the model at 98 percent accuracy on the prompts. They add two evaluations drawing from natural language analysis ideas: the selective features account for just 31 percent of the activation variance compared to 99.7 percent for the entire SAE, and there's a negative correlation of minus 0.56 between selectivity ratio and causal force. The robustness testing under three distribution shifts is where they highlight a gap, with the circuit holding up but the feature ablation effects dropping off. Finally, they run a cost-based evaluation assuming specific false negative and positive costs and a base error rate, finding an optimal monitor setup that saves a lot compared to the baseline. What is actually new here is the end-to-end sequencing of these stages into one methodology, along with the specific combination of partial causality findings, the variance comparison, the anticorrelation, and the cost deployment angle. The paper does well in grounding its claims with reported experimental outcomes like the accuracy retention, variance percentages, and the correlation value. These give a tangible sense of the effect sizes and support the idea that single-stage approaches miss some of the picture. The soft spots are mostly around the robustness testing. The central distinction between circuit robustness and feature robustness depends on how well those three distribution shifts capture real-world variations that might affect causality. If the shifts are limited in diversity or too similar to the original setup, the observed degradation could be particular to this test rather than a general property. The abstract mentions the outcomes but lacks details on error bars or exact selection procedures, so there is some uncertainty about whether choices were made after seeing the results. That said, the overall numbers align with the partial causality story, and the circularity burden appears low because it relies on established prior methods without claiming to derive everything from scratch. This paper is for researchers in mechanistic interpretability who are looking for structured ways to validate features beyond correlations, particularly those working with SAEs or circuit discovery on tasks like IOI. A reader who wants to see how these tools can be chained together for more reliable insights and even some applied evaluation would find it worthwhile. I think it deserves serious peer review. The work has enough concrete experiments and a clear proposal to benefit from referee feedback on the methodology and especially on strengthening the robustness claims with more varied tests.

Referee Report

1 major / 2 minor

Summary. The paper proposes a five-stage methodology (probe design, feature extraction, causal validation, robustness testing, deployment integration) for causal feature analysis in transformer LMs and demonstrates it end-to-end on GPT-2 small for the IOI task. Activation patching recovers the canonical IOI circuit; an SAE extracts per-name selective features; ablation of fifteen such features leaves 98% accuracy; the features explain only 31% of activation variance (vs. SAE's 99.7%) with selectivity ratio anticorrelating with causal force (r=-0.56); the circuit transfers across three distribution shifts while feature-ablation effects degrade; a cost-based monitor yields $8.96 per 1000 queries vs. $1000 baseline.

Significance. If the reported gap between circuit-level and feature-level causal robustness holds, the work supplies a concrete, multi-stage pipeline that moves beyond isolated correlational or causal probes and yields practical deployment metrics. The end-to-end demonstration, concrete numbers (98% post-ablation accuracy, 31% variance, r=-0.56), and cost-based evaluation are strengths that make the findings falsifiable and directly usable.

major comments (1)

[Robustness testing stage] Robustness testing stage: the central claim that the five-stage pipeline exposes a reliable gap between detection robustness and causal robustness rests on the contrast between clean circuit transfer and degraded feature-ablation effects under three distribution shifts. The manuscript provides no explicit justification, diversity metrics, or sensitivity analysis showing that these shifts adequately sample the space of real-world changes that could affect feature causality; if the shifts are mild or correlated with the original IOI distribution, the observed degradation may be an artifact of the test regime rather than a general property.

minor comments (2)

[Abstract and Causal validation section] The abstract and methods should report error bars or confidence intervals on the 98% accuracy, 31% variance, and r=-0.56 figures.
[Causal validation section] Clarify the exact definition of 'selectivity ratio' and how it is computed from the SAE features before correlating it with causal force.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment on the robustness testing stage below, agreeing that additional documentation is warranted.

read point-by-point responses

Referee: Robustness testing stage: the central claim that the five-stage pipeline exposes a reliable gap between detection robustness and causal robustness rests on the contrast between clean circuit transfer and degraded feature-ablation effects under three distribution shifts. The manuscript provides no explicit justification, diversity metrics, or sensitivity analysis showing that these shifts adequately sample the space of real-world changes that could affect feature causality; if the shifts are mild or correlated with the original IOI distribution, the observed degradation may be an artifact of the test regime rather than a general property.

Authors: We agree that the manuscript would benefit from greater transparency on this point. The three shifts were chosen to vary lexical items (e.g., name substitutions drawn from different frequency bands), syntactic framing, and prompt length while preserving the core IOI structure, but these design choices were not fully articulated. In the revision we will add a dedicated subsection that (i) justifies each shift with reference to potential real-world distributional changes that could affect feature selectivity, (ii) reports quantitative diversity metrics including token-level KL divergence and type-token ratio differences between the original and shifted sets, and (iii) includes a sensitivity analysis that perturbs shift parameters and re-measures the degradation in ablation effects. These additions will make the claim that the observed robustness gap is not an artifact of the test regime more defensible while leaving the empirical contrast between circuit transfer and feature degradation intact. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methodology applies standard external techniques

full rationale

The paper proposes a five-stage pipeline and applies it to the IOI task using activation patching (recovering the known circuit) and SAEs (recovering selective features). Causal validation, variance explained (31% vs 99.7%), selectivity correlation (r = -0.56), and robustness testing across three shifts are direct empirical measurements, not quantities defined from fitted parameters within the paper or reduced by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation. The central claim of a robustness gap follows from the observed contrast between circuit transfer and feature degradation; the representativeness of the shifts is an external assumption, not a circular reduction. The work is self-contained against prior benchmarks and does not rename known results as novel derivations.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The paper builds on established interpretability techniques as background assumptions and introduces explicit cost and error-rate values as free parameters in the final deployment stage. No new physical or mathematical entities are postulated.

free parameters (3)

Cost per false negative = $50
Assumed monetary value used in the cost-based deployment evaluation.
Cost per false positive = $0.42
Assumed monetary value used in the cost-based deployment evaluation.
Base error rate = 2%
Assumed percentage used to compute savings against the baseline.

axioms (2)

domain assumption Activation patching provides a valid causal intervention for recovering model circuits.
Invoked in the feature extraction and causal validation stages to recover the IOI circuit.
domain assumption Sparse autoencoders extract interpretable and selective features from transformer activations.
Central to the feature extraction stage and subsequent selectivity and causality analyses.

pith-pipeline@v0.9.0 · 5779 in / 1652 out tokens · 79661 ms · 2026-05-22T06:51:37.009569+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The five-stage methodology... causal validation... robustness testing... deployment integration... cost-based evaluation (assumed $50/FN, $0.42/FP)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

selectivity ratio anticorrelates with causal force (r = -0.56)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

[1]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda.Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717, 2024. https://arxiv.org/abs/2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Transformer Circuits Thread, 2023.https://transformer-circuits

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page 2023
[3]

ICML, 2024.https://arxiv.org/abs/2403.10949

Haozhe Chen, Carl Vondrick, and Chengzhi Mao.SelfIE: Self-Interpretation of Large Language Model Embeddings. ICML, 2024.https://arxiv.org/abs/2403.10949. 24

work page arXiv 2024
[4]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS, 2023.https://arxiv.org/abs/2304.14997

work page arXiv 2023
[5]

What you can cram into a single vector: Probing sentence embeddings for linguistic properties

Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. ACL, 2018.https://arxiv.org/abs/1805.01070

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey.Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR, 2024. https:// arxiv.org/abs/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M

Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, and Samuel Marks.Natural Language Autoencoders Produce Unsupervised Explanati...

work page 2026
[8]

ICML, 2024

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva.Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models. ICML, 2024. https://arxiv.org/abs/2401.06102

work page arXiv 2024
[9]

Manning.A Structural Probe for Finding Syntax in Word Representations

John Hewitt and Christopher D. Manning.A Structural Probe for Finding Syntax in Word Representations. NAACL-HLT, 2019.https://aclanthology.org/N19-1419/

work page 2019
[10]

Sharma, Daniel Wen, Owain Evans, and Samuel Marks.Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Adam Karvonen, James Chua, Conor Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Aditya S. Sharma, Daniel Wen, Owain Evans, and Samuel Marks.Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers. arXiv:2512.15674, 2025.https://arxiv.org/abs/2512.15674

work page arXiv 2025
[11]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. arXiv:2403.19647, 2024.https://arxiv.org/abs/2403.19647

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Locating and Editing Factual Associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov.Locating and Editing Factual Associations in GPT. NeurIPS, 2022.https://arxiv.org/abs/2202.05262

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

GitHub, 2022

Neel Nanda and Joseph Bloom.TransformerLens. GitHub, 2022. https://github.com/ TransformerLensOrg/TransformerLens

work page 2022
[14]

Distill, 2020

Christopher Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.Zoom In: An Introduction to Circuits. Distill, 2020. https://distill.pub/2020/ circuits/zoom-in/

work page 2020
[15]

Transformer Circuits Thread, 2022.https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads

Catherine Olsson, Nelson Elhage, Neel Nanda, et al.In-context Learning and Induction Heads. Transformer Circuits Thread, 2022.https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads

work page 2022
[16]

ICLR, 2025.https://arxiv.org/abs/2412.08686

Alexander Pan, Lijun Chen, and Jacob Steinhardt.LatentQA: Teaching LLMs to Decode Activations Into Natural Language. ICLR, 2025.https://arxiv.org/abs/2412.08686. 25

work page arXiv 2025
[17]

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI Technical Report, 2019. https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

work page 2019
[18]

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda.Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv:2404.16014, 2024.https://arxiv.org/abs/2404.16014

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Turner, Callum McDougall, Monte MacDiarmid, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan.Scaling Monosema...

work page 2024
[20]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid.Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023.https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

NeurIPS, 2020

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber.Investigating Gender Bias in Language Models Using Causal Media- tion Analysis. NeurIPS, 2020. https://proceedings.neurips.cc/paper/2020/hash/ 92650b2e92217715fe312e6fa7b90d82-Abstract.html

work page 2020
[22]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. ICLR, 2023. https://arxiv.org/abs/2211.00593

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks.Representation Engineering: A Top-Down Ap- proach to ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda.Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717, 2024. https://arxiv.org/abs/2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Transformer Circuits Thread, 2023.https://transformer-circuits

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page 2023

[3] [3]

ICML, 2024.https://arxiv.org/abs/2403.10949

Haozhe Chen, Carl Vondrick, and Chengzhi Mao.SelfIE: Self-Interpretation of Large Language Model Embeddings. ICML, 2024.https://arxiv.org/abs/2403.10949. 24

work page arXiv 2024

[4] [4]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS, 2023.https://arxiv.org/abs/2304.14997

work page arXiv 2023

[5] [5]

What you can cram into a single vector: Probing sentence embeddings for linguistic properties

Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. ACL, 2018.https://arxiv.org/abs/1805.01070

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey.Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR, 2024. https:// arxiv.org/abs/2309.08600

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M

Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, and Samuel Marks.Natural Language Autoencoders Produce Unsupervised Explanati...

work page 2026

[8] [8]

ICML, 2024

Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva.Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models. ICML, 2024. https://arxiv.org/abs/2401.06102

work page arXiv 2024

[9] [9]

Manning.A Structural Probe for Finding Syntax in Word Representations

John Hewitt and Christopher D. Manning.A Structural Probe for Finding Syntax in Word Representations. NAACL-HLT, 2019.https://aclanthology.org/N19-1419/

work page 2019

[10] [10]

Sharma, Daniel Wen, Owain Evans, and Samuel Marks.Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Adam Karvonen, James Chua, Conor Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Aditya S. Sharma, Daniel Wen, Owain Evans, and Samuel Marks.Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers. arXiv:2512.15674, 2025.https://arxiv.org/abs/2512.15674

work page arXiv 2025

[11] [11]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. arXiv:2403.19647, 2024.https://arxiv.org/abs/2403.19647

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Locating and Editing Factual Associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov.Locating and Editing Factual Associations in GPT. NeurIPS, 2022.https://arxiv.org/abs/2202.05262

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

GitHub, 2022

Neel Nanda and Joseph Bloom.TransformerLens. GitHub, 2022. https://github.com/ TransformerLensOrg/TransformerLens

work page 2022

[14] [14]

Distill, 2020

Christopher Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.Zoom In: An Introduction to Circuits. Distill, 2020. https://distill.pub/2020/ circuits/zoom-in/

work page 2020

[15] [15]

Transformer Circuits Thread, 2022.https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads

Catherine Olsson, Nelson Elhage, Neel Nanda, et al.In-context Learning and Induction Heads. Transformer Circuits Thread, 2022.https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads

work page 2022

[16] [16]

ICLR, 2025.https://arxiv.org/abs/2412.08686

Alexander Pan, Lijun Chen, and Jacob Steinhardt.LatentQA: Teaching LLMs to Decode Activations Into Natural Language. ICLR, 2025.https://arxiv.org/abs/2412.08686. 25

work page arXiv 2025

[17] [17]

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI Technical Report, 2019. https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

work page 2019

[18] [18]

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda.Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv:2404.16014, 2024.https://arxiv.org/abs/2404.16014

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Turner, Callum McDougall, Monte MacDiarmid, C

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan.Scaling Monosema...

work page 2024

[20] [20]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid.Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023.https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

NeurIPS, 2020

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber.Investigating Gender Bias in Language Models Using Causal Media- tion Analysis. NeurIPS, 2020. https://proceedings.neurips.cc/paper/2020/hash/ 92650b2e92217715fe312e6fa7b90d82-Abstract.html

work page 2020

[22] [22]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. ICLR, 2023. https://arxiv.org/abs/2211.00593

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks.Representation Engineering: A Top-Down Ap- proach to ...

work page internal anchor Pith review Pith/arXiv arXiv 2023