pith. sign in

arxiv: 2605.22462 · v1 · pith:4H7JSQ6Dnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

Pith reviewed 2026-05-22 06:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords causal feature analysistransformer language modelssparse autoencodersactivation patchingindirect object identificationinterpretabilityrobustness testingdeployment evaluation
0
0 comments X

The pith

A five-stage methodology applied to GPT-2 shows that sparse autoencoder features for indirect object identification are only partially causal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a five-stage process to analyze features in transformer language models by moving from detection to causal validation and practical deployment. It demonstrates the full pipeline on GPT-2 small performing the indirect object identification task, recovering the known circuit through activation patching and extracting selective features via sparse autoencoder. Causal tests reveal that ablating fifteen of these features leaves the model correct on 98 percent of prompts while they account for only 31 percent of activation variance. Robustness checks under distribution shifts show the circuit holds but feature effects weaken, and a cost model finds substantial monitoring savings. The combined stages expose limits that isolated techniques miss.

Core claim

The paper claims that its five-stage methodology, when run end-to-end on the indirect object identification task in GPT-2 small, establishes that fifteen per-name selective features recovered by a sparse autoencoder are specifically but only partially causal for the task: ablating them leaves the model accurate on 98 percent of prompts, the features explain only 31 percent of activation variance compared with the SAE's 99.7 percent, selectivity ratio anticorrelates with causal force at r equals negative 0.56, the circuit transfers cleanly under three distribution shifts while feature ablation effects degrade, and an optimal cost-based monitor yields 99.1 percent savings against baseline.

What carries the argument

The five-stage methodology of probe design, feature extraction via sparse autoencoder, causal validation by activation patching and ablation, robustness testing under distribution shifts, and deployment integration with cost evaluation.

If this is right

  • Activation patching recovers the canonical IOI circuit in which layer-9 head 9 alone produces recovery of plus 1.02.
  • Sparse autoencoders extract per-name selective features with effect sizes between 30 and 50 activation units.
  • The circuit transfers cleanly under three distribution shifts but feature ablation effects degrade substantially.
  • A cost-based deployment evaluation with assumed costs of 50 dollars per false negative and 0.42 dollars per false positive yields an optimal monitor at 8.96 dollars per 1000 queries.
  • Optimal monitor composition varies with cost ratio and base error rate, and the conjunction of all stages produces findings unreachable by any single stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The negative correlation between selectivity and causal force may indicate that the most selective features are not the ones driving the largest behavioral effects in the circuit.
  • Extending the same staged validation to other common circuits could show whether partial causality is typical rather than specific to the IOI task.
  • The observed gap between circuit robustness and feature robustness under shifts suggests that interpretability work should track both levels separately when building monitors.
  • Cost-sensitive deployment calculations can still favor using even partially causal features when false-negative costs are high relative to false-positive costs.

Load-bearing premise

The three chosen distribution shifts adequately represent the range of real-world changes that could affect feature causality and circuit robustness.

What would settle it

Re-running the full five-stage pipeline on a different model or task and finding that ablating the identified selective features drops accuracy by far more than 2 percent, or that the selectivity ratio correlates positively rather than negatively with causal force.

Figures

Figures reproduced from arXiv: 2605.22462 by Caleb Munigety.

Figure 1
Figure 1. Figure 1: Residual stream patching across all 12 layers and 15 token positions. Three patterns are visible: the IO name position carries positive recovery throughout the network; the S name position shows large negative recovery in early layers that diminishes by mid-network (the signature of S-inhibition); and the END position becomes the active prediction site by layers 10 and 11. 7 Stage 2: Sparse Autoencoder Fea… view at source ↗
Figure 2
Figure 2. Figure 2: Per-head patching at the END position. Layer 9 head 9 alone recovers [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Top ten IO-vs-S name-selective SAE features at GPT-2 layer 9, END position. Each feature [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Single-feature ablation: each feature’s preferred name’s logit-difference drops below [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FVE of the SAE feature reconstruction as a function of the number of active features [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Selectivity ratio vs. causal drop under ablation. Pearson [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stage 4 results. (a) The canonical name-mover and backup-head circuit replicates across [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stage 5 results. (a) The SAE features as monitors achieve perfect ROC-AUC in-distribution, [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Stage 5d deployment evaluation. (a) Expected cost per 1000 queries as a function of [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a five-stage methodology (probe design, feature extraction, causal validation, robustness testing, deployment integration) for causal feature analysis in transformer LMs and demonstrates it end-to-end on GPT-2 small for the IOI task. Activation patching recovers the canonical IOI circuit; an SAE extracts per-name selective features; ablation of fifteen such features leaves 98% accuracy; the features explain only 31% of activation variance (vs. SAE's 99.7%) with selectivity ratio anticorrelating with causal force (r=-0.56); the circuit transfers across three distribution shifts while feature-ablation effects degrade; a cost-based monitor yields $8.96 per 1000 queries vs. $1000 baseline.

Significance. If the reported gap between circuit-level and feature-level causal robustness holds, the work supplies a concrete, multi-stage pipeline that moves beyond isolated correlational or causal probes and yields practical deployment metrics. The end-to-end demonstration, concrete numbers (98% post-ablation accuracy, 31% variance, r=-0.56), and cost-based evaluation are strengths that make the findings falsifiable and directly usable.

major comments (1)
  1. [Robustness testing stage] Robustness testing stage: the central claim that the five-stage pipeline exposes a reliable gap between detection robustness and causal robustness rests on the contrast between clean circuit transfer and degraded feature-ablation effects under three distribution shifts. The manuscript provides no explicit justification, diversity metrics, or sensitivity analysis showing that these shifts adequately sample the space of real-world changes that could affect feature causality; if the shifts are mild or correlated with the original IOI distribution, the observed degradation may be an artifact of the test regime rather than a general property.
minor comments (2)
  1. [Abstract and Causal validation section] The abstract and methods should report error bars or confidence intervals on the 98% accuracy, 31% variance, and r=-0.56 figures.
  2. [Causal validation section] Clarify the exact definition of 'selectivity ratio' and how it is computed from the SAE features before correlating it with causal force.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address the major comment on the robustness testing stage below, agreeing that additional documentation is warranted.

read point-by-point responses
  1. Referee: Robustness testing stage: the central claim that the five-stage pipeline exposes a reliable gap between detection robustness and causal robustness rests on the contrast between clean circuit transfer and degraded feature-ablation effects under three distribution shifts. The manuscript provides no explicit justification, diversity metrics, or sensitivity analysis showing that these shifts adequately sample the space of real-world changes that could affect feature causality; if the shifts are mild or correlated with the original IOI distribution, the observed degradation may be an artifact of the test regime rather than a general property.

    Authors: We agree that the manuscript would benefit from greater transparency on this point. The three shifts were chosen to vary lexical items (e.g., name substitutions drawn from different frequency bands), syntactic framing, and prompt length while preserving the core IOI structure, but these design choices were not fully articulated. In the revision we will add a dedicated subsection that (i) justifies each shift with reference to potential real-world distributional changes that could affect feature selectivity, (ii) reports quantitative diversity metrics including token-level KL divergence and type-token ratio differences between the original and shifted sets, and (iii) includes a sensitivity analysis that perturbs shift parameters and re-measures the degradation in ablation effects. These additions will make the claim that the observed robustness gap is not an artifact of the test regime more defensible while leaving the empirical contrast between circuit transfer and feature degradation intact. revision: yes

Circularity Check

0 steps flagged

No significant circularity; methodology applies standard external techniques

full rationale

The paper proposes a five-stage pipeline and applies it to the IOI task using activation patching (recovering the known circuit) and SAEs (recovering selective features). Causal validation, variance explained (31% vs 99.7%), selectivity correlation (r = -0.56), and robustness testing across three shifts are direct empirical measurements, not quantities defined from fitted parameters within the paper or reduced by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation. The central claim of a robustness gap follows from the observed contrast between circuit transfer and feature degradation; the representativeness of the shifts is an external assumption, not a circular reduction. The work is self-contained against prior benchmarks and does not rename known results as novel derivations.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The paper builds on established interpretability techniques as background assumptions and introduces explicit cost and error-rate values as free parameters in the final deployment stage. No new physical or mathematical entities are postulated.

free parameters (3)
  • Cost per false negative = $50
    Assumed monetary value used in the cost-based deployment evaluation.
  • Cost per false positive = $0.42
    Assumed monetary value used in the cost-based deployment evaluation.
  • Base error rate = 2%
    Assumed percentage used to compute savings against the baseline.
axioms (2)
  • domain assumption Activation patching provides a valid causal intervention for recovering model circuits.
    Invoked in the feature extraction and causal validation stages to recover the IOI circuit.
  • domain assumption Sparse autoencoders extract interpretable and selective features from transformer activations.
    Central to the feature extraction stage and subsequent selectivity and causality analyses.

pith-pipeline@v0.9.0 · 5779 in / 1652 out tokens · 79661 ms · 2026-05-22T06:51:37.009569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 9 internal anchors

  1. [1]

    Refusal in Language Models Is Mediated by a Single Direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda.Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717, 2024. https://arxiv.org/abs/2406.11717

  2. [2]

    Transformer Circuits Thread, 2023.https://transformer-circuits

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  3. [3]

    ICML, 2024.https://arxiv.org/abs/2403.10949

    Haozhe Chen, Carl Vondrick, and Chengzhi Mao.SelfIE: Self-Interpretation of Large Language Model Embeddings. ICML, 2024.https://arxiv.org/abs/2403.10949. 24

  4. [4]

    Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards Automated Circuit Discovery for Mechanistic Interpretability

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso.Towards Automated Circuit Discovery for Mechanistic Interpretability. NeurIPS, 2023.https://arxiv.org/abs/2304.14997

  5. [5]

    What you can cram into a single vector: Probing sentence embeddings for linguistic properties

    Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. ACL, 2018.https://arxiv.org/abs/1805.01070

  6. [6]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey.Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR, 2024. https:// arxiv.org/abs/2309.08600

  7. [7]

    Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M

    Kit Fraser-Taliente, Subhash Kantamneni, Euan Ong, Dan Mossing, Christina Lu, Paul C. Bogdan, Emmanuel Ameisen, James Chen, Dzmitry Kishylau, Adam Pearce, Julius Tarng, Alex Wu, Jeff Wu, Yang Zhang, Daniel M. Ziegler, Evan Hubinger, Joshua Batson, Jack Lindsey, Samuel Zimmerman, and Samuel Marks.Natural Language Autoencoders Produce Unsupervised Explanati...

  8. [8]

    ICML, 2024

    Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva.Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models. ICML, 2024. https://arxiv.org/abs/2401.06102

  9. [9]

    Manning.A Structural Probe for Finding Syntax in Word Representations

    John Hewitt and Christopher D. Manning.A Structural Probe for Finding Syntax in Word Representations. NAACL-HLT, 2019.https://aclanthology.org/N19-1419/

  10. [10]

    Sharma, Daniel Wen, Owain Evans, and Samuel Marks.Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

    Adam Karvonen, James Chua, Conor Dumas, Kit Fraser-Taliente, Subhash Kantamneni, Julian Minder, Euan Ong, Aditya S. Sharma, Daniel Wen, Owain Evans, and Samuel Marks.Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers. arXiv:2512.15674, 2025.https://arxiv.org/abs/2512.15674

  11. [11]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models. arXiv:2403.19647, 2024.https://arxiv.org/abs/2403.19647

  12. [12]

    Locating and Editing Factual Associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov.Locating and Editing Factual Associations in GPT. NeurIPS, 2022.https://arxiv.org/abs/2202.05262

  13. [13]

    GitHub, 2022

    Neel Nanda and Joseph Bloom.TransformerLens. GitHub, 2022. https://github.com/ TransformerLensOrg/TransformerLens

  14. [14]

    Distill, 2020

    Christopher Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter.Zoom In: An Introduction to Circuits. Distill, 2020. https://distill.pub/2020/ circuits/zoom-in/

  15. [15]

    Transformer Circuits Thread, 2022.https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, et al.In-context Learning and Induction Heads. Transformer Circuits Thread, 2022.https://transformer-circuits.pub/2022/ in-context-learning-and-induction-heads

  16. [16]

    ICLR, 2025.https://arxiv.org/abs/2412.08686

    Alexander Pan, Lijun Chen, and Jacob Steinhardt.LatentQA: Teaching LLMs to Decode Activations Into Natural Language. ICLR, 2025.https://arxiv.org/abs/2412.08686. 25

  17. [17]

    Language Models are Unsupervised Multitask Learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI Technical Report, 2019. https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

  18. [18]

    Improving Dictionary Learning with Gated Sparse Autoencoders

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda.Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv:2404.16014, 2024.https://arxiv.org/abs/2404.16014

  19. [19]

    Turner, Callum McDougall, Monte MacDiarmid, C

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L. Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan.Scaling Monosema...

  20. [20]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid.Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248, 2023.https://arxiv.org/abs/2308.10248

  21. [21]

    NeurIPS, 2020

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber.Investigating Gender Bias in Language Models Using Causal Media- tion Analysis. NeurIPS, 2020. https://proceedings.neurips.cc/paper/2020/hash/ 92650b2e92217715fe312e6fa7b90d82-Abstract.html

  22. [22]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. ICLR, 2023. https://arxiv.org/abs/2211.00593

  23. [23]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks.Representation Engineering: A Top-Down Ap- proach to ...