Beyond Importance: Interchange-Sobol Sensitivity Reveals Task-Specific Content Channels in Transformer Components

Jin-Hong Du; Xiang Chen; Yifeng Guo

arxiv: 2606.20678 · v1 · pith:B6Y2GAUGnew · submitted 2026-06-12 · 📊 stat.ML · cs.LG· stat.ME

Beyond Importance: Interchange-Sobol Sensitivity Reveals Task-Specific Content Channels in Transformer Components

Yifeng Guo , Jin-Hong Du , Xiang Chen This is my paper

Pith reviewed 2026-06-27 04:01 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME

keywords mechanistic interpretabilitytransformer componentssensitivity analysiscontent channelsfactual recallSobol decompositionintervention methodsactivation patching

0 comments

The pith

Interchange-Group Sobol Decomposition separates content transport from computational degradation in transformer components by comparing replacement and ablation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard importance scores in mechanistic interpretability combine two distinct effects: a component may matter because it carries task-relevant information or because its removal disrupts the forward pass. The paper presents Interchange-Group Sobol Decomposition, which applies matched activation replacement and zero ablation to the same component, computes two Sobol-style variance indices, and takes their signed difference to isolate the transport role. Intervention quality is checked with a symmetric off-manifold diagnostic that must exceed one. On factual recall in GPT-2 small and Qwen2.5-1.5B, the method locates an early-layer channel that moves relation-frame content and is missed by ordinary importance rankings. Late attention heads instead handle subject retrieval, and late-layer clamping shows the early signal reaches the output through downstream changes rather than direct residual flow.

Core claim

IGSD estimates two Sobol-style variance indices from paired replacement and ablation interventions on the same component, uses their signed difference to separate content-transport contributions from computational-degradation contributions, and monitors validity with the symmetric off-manifold diagnostic ĜST>1. In factual recall, it identifies an early-layer content channel that transports relation-frame content while late attention transports subject-retrieval content, refining at head level to Attn_L9H8, with the early signal expressed through later transformations rather than residual pass-through.

What carries the argument

Interchange-Group Sobol Decomposition (IGSD), a paired-intervention framework that compares matched activation replacement against zero ablation on the same component and separates their variance contributions via signed difference.

If this is right

In factual recall, an early-layer channel carries relation content that standard importance methods underestimate.
Late attention heads carry subject-retrieval content while the early channel carries relation-frame content.
The early signal is expressed through downstream transformations rather than residual pass-through.
Replacement and deletion interventions are not interchangeable and their divergence supplies a diagnostic for content transport.
Head-granularity analysis isolates specific heads such as Attn_L9H8 as subject-retrieval carriers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The divergence between replacement and ablation could serve as a general test for whether a component functions as a content carrier in other tasks.
Importance rankings that ignore this distinction may systematically misattribute functional roles to components that mainly affect computation.
The method suggests testing whether early content channels appear in non-recall tasks or in models of different scales.

Load-bearing premise

The symmetric off-manifold diagnostic reliably confirms intervention validity and the signed difference between the two variance indices cleanly isolates transport from degradation.

What would settle it

If the signed difference between replacement-based and ablation-based Sobol indices shows no systematic alignment with independent activation-patching measures of relation-frame content across layers, the separation between transport and degradation roles would be falsified.

Figures

Figures reproduced from arXiv: 2606.20678 by Jin-Hong Du, Xiang Chen, Yifeng Guo.

**Figure 1.** Figure 1: Motivating phenomenon. (A) Swapping MLPL0 at the answer position with a matched donor activation changes the predicted token distribution and demotes the correct answer. (B) On factual recall, standard importance methods rank MLPL0 between #9 and #13 of 24 layer-local groups, whereas IGSD ranks it #1 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: IGSD method. (a) At layer ℓ, attention and MLP modules write additive vectors into the residual stream; IGSD targets one written vector hg(x, t). (b) For the same base prompt xA, the swap intervention replaces hg(xA, tA) by a matched donor activation hg(xB, tB), while the zero intervention replaces it by the zero vector. keep the donor activation close to the empirical activation manifold by matching outpu… view at source ↗

**Figure 3.** Figure 3: Layer-stratified content dissociation in factual recall. (A) At MLPL0, holding the relation fixed sharply reduces the swap response, while holding the subject fixed does not. (B) At AttnL9, the pattern reverses: holding the subject fixed sharply reduces the swap response. Bars show 95% paired-bootstrap CIs, with n = 141 pairs per bucket. late attention, late MLP, or both to their clean outputs at the swapp… view at source ↗

**Figure 4.** Figure 4: Task-specific swap-versus-zero profiles. Per-group δb = STcswap − STczero for GPT-2 small. Red indicates δ >b 0, blue indicates δ <b 0, and grey indicates CIs crossing zero. Hatched groups are flagged off-manifold by STc > 1 and excluded from mechanistic interpretation. Factual IGSD focal: MLP_L0 Prompt A: "Yahoo! Tech is owned by" Correct: ' Yahoo' 0 0.5 p(token) ' the' ' Facebook' ' Google' ' AOL' ' Yaho… view at source ↗

**Figure 5.** Figure 5: Prompt-level focal components. Top-5 predictions under clean, swap-MLPL0, and swapAttnL9 interventions. Factual recall and IOI reverse which component is focal. AttnL9 now injects donor-name interference, while MLPL0 leaves the answer largely intact. This prompt-level reversal mirrors the task-specific layer profiles in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical error vs Proposition 1’s bound. (A) In-regime configs (0 bound violations of 16): empirical error |STk − ST⋆ k | vs the Proposition 1 bound (diamonds, using N⋆ ≤ 2V ⋆ via ST⋆ k ≤ 1) and a data-dependent first-order envelope (circles, replacing N⋆ with Nbk). All points lie below y=x. (B) Scaling in εdep at a=0: empirical points cluster around the Proposition 1 slope 3/v0. k-nearest-neighbor matche… view at source ↗

**Figure 7.** Figure 7: Off-manifold mechanism in IGSD-style swap. (A) Active-subspace projection of X vs hA: joint (blue) lies on a correlated curve, random-donor swap (orange) fills the orthogonal directions. (B) Cross-statistic distribution under joint vs random donor (KDE). (C) Output shift |Y − Y swap| vs standardized off-distribution score, with binned conditional mean. (D) STcA as a function of the matched-donor pool size … view at source ↗

**Figure 8.** Figure 8: Induction off-manifold MLPL0 swap, real-word example (target “ refining”). SwapMLPL0 collapses the prediction (rank #3408, top-5 dominated by donor tokens), consistent with the STc > 1 flag; swap-AttnL9 leaves the answer at rank #1. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: Architecture comparison. Both models use the pre-norm block of Eq. (2); differences are normalization (LayerNorm vs. RMSNorm), positional encoding (learned absolute vs. RoPE), attention (single-KV vs. grouped-query), depth (12 vs. 28), and width (d=768 vs. 1536). F.1 Architectures F.2 Algorithm Algorithm 1 is the layer-local IGSD pipeline as used throughout the paper. The two inner loops sample matched pai… view at source ↗

**Figure 10.** Figure 10: Qwen2.5-1.5B factual layer profile (56 groups; conventions as in [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: Residual-stream causal DAG. (a) Continuous blue residual spine r0 → r˜0 → · · · → rL → Y ; modules read from the spine and write back via + nodes. (b) Late-layer clamp do(Mlate = Mlate,clean): late modules’ input branches severed (gray dotted) and write-backs replaced by clean updates (gray dashed); the spine remains continuous, preserving the residual-skip path while removing the swap-dependent mediated … view at source ↗

read the original abstract

Mechanistic interpretability methods summarize a transformer component by a single importance score, conflating two distinct roles: a component may matter because it transports task-relevant content, or because the forward computation degrades when its contribution is removed. We introduce \emph{Interchange-Group Sobol Decomposition} (IGSD), a paired-intervention framework that compares matched activation replacement with zero ablation on the same component, estimates two Sobol-style variance indices, and uses their signed difference to separate the two roles, with intervention validity monitored by a symmetric off-manifold diagnostic $\widehat{\mathrm{ST}}>1$. In factual recall, IGSD identifies an early-layer content channel in both GPT-2 small and Qwen2.5-1.5B that standard importance methods underestimate. A controlled subject and relation donor design shows that the early channel transports relation-frame content while late attention transports subject-retrieval content, refining at head granularity to the known $\mathrm{Attn}_{L9H8}$ head. Late-layer clamping confirms that the early signal is expressed through downstream transformations rather than residual pass-through. These results show that replacement and deletion are not interchangeable controls and their divergence provides a practical statistical diagnostic for content transport in transformer components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IGSD offers a paired-intervention Sobol approach to separate transport from degradation effects, but the abstract leaves the invariance of the signed difference unproven and the validation thin.

read the letter

The core takeaway is that this paper introduces Interchange-Group Sobol Decomposition to distinguish whether a transformer component transports task content or simply causes degradation when removed. They run matched replacement and zero-ablation interventions on the same component, estimate two variance indices, and use their signed difference as the diagnostic, with a symmetric off-manifold check for validity. Applied to factual recall, it flags an early-layer channel in GPT-2 small and Qwen2.5-1.5B that standard importance scores miss, then uses donor designs to show the early channel carries relation-frame content while late attention handles subject retrieval, down to a specific head, with clamping to rule out simple pass-through.

The new piece is the paired Sobol setup itself and the claim that replacement and ablation are not interchangeable. The controlled subject-relation donor experiments and the head-level refinement add some concreteness that goes beyond the abstract method.

The soft spots sit in the validation and the central attribution step. The abstract supplies no derivation showing the signed difference stays invariant to residual-stream interactions or downstream compensation, and the stress-test concern about overlapping distribution shifts from the two interventions is not addressed in the provided text. Without error analysis, variance numbers, or checks on whether the difference reflects net effect size rather than the intended semantic split, the separation claim rests on an assumption that needs explicit support. The soundness score from the reader report matches what is visible here.

This is for mechanistic interpretability researchers who already run activation interventions and want a statistical tool to refine them. A reader focused on factual recall circuits or editing methods could extract a practical distinction if the math checks out.

It deserves peer review so the derivations, implementation details, and quantitative results can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces Interchange-Group Sobol Decomposition (IGSD), a paired-intervention framework that estimates two Sobol-style variance indices from matched activation replacement versus zero ablation on the same transformer component. Their signed difference is used to separate content-transport roles from computational-degradation roles, with validity checked by a symmetric off-manifold diagnostic ĜST>1. Applied to factual-recall tasks, IGSD identifies an early-layer content channel in GPT-2 small and Qwen2.5-1.5B that standard importance scores underestimate; controlled donor experiments attribute relation-frame content to this channel and subject-retrieval content to late attention heads (refining to Attn_L9H8), with late-layer clamping confirming downstream expression rather than residual pass-through.

Significance. If the signed-difference attribution holds, the work supplies a concrete statistical diagnostic that refines mechanistic interpretability beyond scalar importance scores. The paired-intervention design and explicit off-manifold validity check are methodological strengths; the empirical separation of transport versus degradation roles in factual recall, together with head-level refinement, would be a useful addition to the interpretability toolkit.

major comments (2)

[Abstract] Abstract and method description: the central claim that the signed difference of the two variance indices cleanly isolates content transport from degradation is load-bearing, yet the manuscript supplies no derivation showing that this difference is invariant to downstream residual-stream interactions or compensatory mechanisms. The skeptic's concern that replacement and ablation may induce partially overlapping distribution shifts therefore remains unaddressed.
[§4] §4 (empirical results) and the description of ĜST>1: the symmetric off-manifold diagnostic is presented as validating the interventions, but no quantitative threshold justification, sensitivity analysis, or counter-example test is given to establish that ĜST>1 reliably rules out the confounding effects raised in the stress-test note.

minor comments (1)

[Abstract] Notation: the symbol ĜST is introduced without an explicit equation reference in the abstract; a forward pointer to its definition would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the two major comments below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and method description: the central claim that the signed difference of the two variance indices cleanly isolates content transport from degradation is load-bearing, yet the manuscript supplies no derivation showing that this difference is invariant to downstream residual-stream interactions or compensatory mechanisms. The skeptic's concern that replacement and ablation may induce partially overlapping distribution shifts therefore remains unaddressed.

Authors: The paired design of IGSD is intended to control for many of these effects by using matched interventions on the same component. However, we acknowledge the absence of an explicit derivation for invariance under residual-stream interactions. In the revision, we will add a theoretical section deriving the signed difference under a linear residual-stream model and discuss the conditions under which overlapping distribution shifts are minimized. We believe this addresses the core concern without altering the empirical findings. revision: yes
Referee: [§4] §4 (empirical results) and the description of ĜST>1: the symmetric off-manifold diagnostic is presented as validating the interventions, but no quantitative threshold justification, sensitivity analysis, or counter-example test is given to establish that ĜST>1 reliably rules out the confounding effects raised in the stress-test note.

Authors: We agree that additional validation for the ĜST>1 threshold would improve the paper. The threshold is chosen based on the point where the diagnostic indicates the intervention is off-manifold, but we will include a sensitivity analysis across different thresholds and a counter-example test using synthetic data in the revised §4. This will provide quantitative justification and demonstrate robustness against the noted confounding effects. revision: yes

Circularity Check

0 steps flagged

No circularity: IGSD framework is externally defined and applied without reduction to inputs or self-citations

full rationale

The paper introduces Interchange-Group Sobol Decomposition (IGSD) as a new paired-intervention method that computes two Sobol-style variance indices from matched replacement and zero-ablation interventions on the same component, then takes their signed difference to separate transport from degradation roles while using an off-manifold diagnostic for validity. This construction is defined from standard sensitivity-analysis primitives and the paper's own intervention design; it does not reduce by equation to fitted parameters, self-referential definitions, or prior author work. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the abstract or method description, and the empirical findings on early-layer channels in GPT-2 and Qwen2.5 are presented as applications rather than derivations that presuppose the target result. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full text unavailable so ledger entries are inferred at high level from described method.

axioms (2)

domain assumption The signed difference of Sobol indices from replacement versus ablation cleanly isolates content transport.
Core premise of IGSD stated in abstract.
domain assumption The off-manifold diagnostic ĜST>1 validates intervention quality.
Used to monitor validity per abstract.

pith-pipeline@v0.9.1-grok · 5755 in / 1207 out tokens · 64154 ms · 2026-06-27T04:01:17.575927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Mathematics and computers in simulation , volume=

Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates , author=. Mathematics and computers in simulation , volume=. 2001 , publisher=

2001
[2]

Design and estimator for the total sensitivity index , author=

Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index , author=. Computer physics communications , volume=. 2010 , publisher=

2010
[3]

Computer Physics Communications , volume=

Analysis of variance designs for model output , author=. Computer Physics Communications , volume=. 1999 , publisher=

1999
[4]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
[5]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=
[7]

Journal of Machine Learning Research , volume=

Causal abstraction: A theoretical foundation for mechanistic interpretability , author=. Journal of Machine Learning Research , volume=
[8]

Uncertainty in Artificial Intelligence , pages=

Approximate causal abstractions , author=. Uncertainty in Artificial Intelligence , pages=. 2020 , organization=

2020
[9]

Advances in Neural Information Processing Systems , volume=

Interpretability at scale: Identifying causal mechanisms in alpaca , author=. Advances in Neural Information Processing Systems , volume=
[10]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[11]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

2023
[12]

Advances in Neural Information Processing Systems , volume=

Investigating gender bias in language models using causal mediation analysis , author=. Advances in Neural Information Processing Systems , volume=
[13]

How to use and interpret activation patching

How to use and interpret activation patching , author=. arXiv preprint arXiv:2404.15255 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

The Twelfth International Conference on Learning Representations , year=

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods , author=. The Twelfth International Conference on Learning Representations , year=
[15]

arXiv preprint arXiv:2403.00745 , year=

Atp*: An efficient and scalable method for localizing llm behaviour to components , author=. arXiv preprint arXiv:2403.00745 , year=

work page arXiv
[16]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex J Andonian and Yonatan Belinkov , booktitle=. Locating and Editing Factual Associations in. 2022 , url=

2022
[17]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Dissecting recall of factual associations in auto-regressive language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[18]

The Twelfth International Conference on Learning Representations , year=

Linearity of Relation Decoding in Transformer Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[19]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Interpretability of language models via task spaces , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[20]

Transformer Circuits Thread , year =

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. Transformer Circuits Thread , year =
[21]

2024 , publisher=

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet , author=. 2024 , publisher=

2024
[22]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Extracting interpretable task-specific circuits from large language models for faster inference , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[23]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[24]

2000 , publisher=

Asymptotic statistics , author=. 2000 , publisher=

2000
[25]

2022 , howpublished =

TransformerLens , author =. 2022 , howpublished =

2022
[26]

The Fourteenth International Conference on Learning Representations , year=

Circuit Insights: Towards Interpretability Beyond Activations , author=. The Fourteenth International Conference on Learning Representations , year=
[27]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[28]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

arXiv preprint arXiv:2501.17520 , year=

Conditional Feature Importance revisited: Double Robustness, Efficiency and Inference , author=. arXiv preprint arXiv:2501.17520 , year=

work page arXiv
[30]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

2024
[31]

The Eleventh International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=
[32]

Mechanistic Interpretability for

Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

2024
[33]

ACM Computing Surveys , volume=

Bridging the black box: A survey on mechanistic interpretability in ai , author=. ACM Computing Surveys , volume=. 2026 , publisher=

2026
[34]

Political Analysis , volume=

Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies , author=. Political Analysis , volume=. 2012 , publisher=

2012
[35]

Journal of the American Statistical Association , volume=

Optimal matching for observational studies , author=. Journal of the American Statistical Association , volume=. 1989 , publisher=

1989
[36]

Journal of the American Statistical Association , volume=

Using mixed integer programming for matching in an observational study of kidney failure after surgery , author=. Journal of the American Statistical Association , volume=. 2012 , publisher=

2012
[37]

, title=

Rosenbaum, Paul R. , title=. 2002 , address=

2002
[38]

2003 , publisher=

Partial identification of probability distributions , author=. 2003 , publisher=

2003
[39]

Econometrica , volume=

Confidence intervals for partially identified parameters , author=. Econometrica , volume=. 2004 , publisher=

2004
[40]

The Eleventh International Conference on Learning Representations , year=

Mass-Editing Memory in a Transformer , author=. The Eleventh International Conference on Learning Representations , year=
[41]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=
[42]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

International Conference on Learning Representations , volume=

A simple and effective pruning approach for large language models , author=. International Conference on Learning Representations , volume=
[44]

International conference on machine learning , pages=

Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[1] [1]

Mathematics and computers in simulation , volume=

Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates , author=. Mathematics and computers in simulation , volume=. 2001 , publisher=

2001

[2] [2]

Design and estimator for the total sensitivity index , author=

Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index , author=. Computer physics communications , volume=. 2010 , publisher=

2010

[3] [3]

Computer Physics Communications , volume=

Analysis of variance designs for model output , author=. Computer Physics Communications , volume=. 1999 , publisher=

1999

[4] [4]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

[5] [5]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Advances in Neural Information Processing Systems , volume=

Towards automated circuit discovery for mechanistic interpretability , author=. Advances in Neural Information Processing Systems , volume=

[7] [7]

Journal of Machine Learning Research , volume=

Causal abstraction: A theoretical foundation for mechanistic interpretability , author=. Journal of Machine Learning Research , volume=

[8] [8]

Uncertainty in Artificial Intelligence , pages=

Approximate causal abstractions , author=. Uncertainty in Artificial Intelligence , pages=. 2020 , organization=

2020

[9] [9]

Advances in Neural Information Processing Systems , volume=

Interpretability at scale: Identifying causal mechanisms in alpaca , author=. Advances in Neural Information Processing Systems , volume=

[10] [10]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[11] [11]

Interpretability in the Wild: a Circuit for Indirect Object Identification in

Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

2023

[12] [12]

Advances in Neural Information Processing Systems , volume=

Investigating gender bias in language models using causal mediation analysis , author=. Advances in Neural Information Processing Systems , volume=

[13] [13]

How to use and interpret activation patching

How to use and interpret activation patching , author=. arXiv preprint arXiv:2404.15255 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

The Twelfth International Conference on Learning Representations , year=

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods , author=. The Twelfth International Conference on Learning Representations , year=

[15] [15]

arXiv preprint arXiv:2403.00745 , year=

Atp*: An efficient and scalable method for localizing llm behaviour to components , author=. arXiv preprint arXiv:2403.00745 , year=

work page arXiv

[16] [16]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex J Andonian and Yonatan Belinkov , booktitle=. Locating and Editing Factual Associations in. 2022 , url=

2022

[17] [17]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Dissecting recall of factual associations in auto-regressive language models , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[18] [18]

The Twelfth International Conference on Learning Representations , year=

Linearity of Relation Decoding in Transformer Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[19] [19]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Interpretability of language models via task spaces , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[20] [20]

Transformer Circuits Thread , year =

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author =. Transformer Circuits Thread , year =

[21] [21]

2024 , publisher=

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet , author=. 2024 , publisher=

2024

[22] [22]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Extracting interpretable task-specific circuits from large language models for faster inference , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[23] [23]

The Twelfth International Conference on Learning Representations , year=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[24] [24]

2000 , publisher=

Asymptotic statistics , author=. 2000 , publisher=

2000

[25] [25]

2022 , howpublished =

TransformerLens , author =. 2022 , howpublished =

2022

[26] [26]

The Fourteenth International Conference on Learning Representations , year=

Circuit Insights: Towards Interpretability Beyond Activations , author=. The Fourteenth International Conference on Learning Representations , year=

[27] [27]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

[28] [28]

Qwen2.5-Coder Technical Report

Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

arXiv preprint arXiv:2501.17520 , year=

Conditional Feature Importance revisited: Double Robustness, Efficiency and Inference , author=. arXiv preprint arXiv:2501.17520 , year=

work page arXiv

[30] [30]

Nature , volume=

Detecting hallucinations in large language models using semantic entropy , author=. Nature , volume=. 2024 , publisher=

2024

[31] [31]

The Eleventh International Conference on Learning Representations , year=

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=

[32] [32]

Mechanistic Interpretability for

Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

2024

[33] [33]

ACM Computing Surveys , volume=

Bridging the black box: A survey on mechanistic interpretability in ai , author=. ACM Computing Surveys , volume=. 2026 , publisher=

2026

[34] [34]

Political Analysis , volume=

Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies , author=. Political Analysis , volume=. 2012 , publisher=

2012

[35] [35]

Journal of the American Statistical Association , volume=

Optimal matching for observational studies , author=. Journal of the American Statistical Association , volume=. 1989 , publisher=

1989

[36] [36]

Journal of the American Statistical Association , volume=

Using mixed integer programming for matching in an observational study of kidney failure after surgery , author=. Journal of the American Statistical Association , volume=. 2012 , publisher=

2012

[37] [37]

, title=

Rosenbaum, Paul R. , title=. 2002 , address=

2002

[38] [38]

2003 , publisher=

Partial identification of probability distributions , author=. 2003 , publisher=

2003

[39] [39]

Econometrica , volume=

Confidence intervals for partially identified parameters , author=. Econometrica , volume=. 2004 , publisher=

2004

[40] [40]

The Eleventh International Conference on Learning Representations , year=

Mass-Editing Memory in a Transformer , author=. The Eleventh International Conference on Learning Representations , year=

[41] [41]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

[42] [42]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

International Conference on Learning Representations , volume=

A simple and effective pruning approach for large language models , author=. International Conference on Learning Representations , volume=

[44] [44]

International conference on machine learning , pages=

Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International conference on machine learning , pages=. 2023 , organization=

2023