arxiv: 2605.06480 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CL

Recognition: unknown

Patch-Effect Graph Kernels for LLM Interpretability

Ruben Fernandez-Boullon , David N. Olivieri

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:39 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords patch-effect graphsactivation patchinggraph kernelsmechanistic interpretabilitycausal circuitstransformer modelsindirect object identificationinfluence graphs

0 comments

The pith

Patch-effect graphs built from activation patching preserve structural signals that graph kernels classify using localized edge features better than global shape descriptors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes activation patching results in transformers as graphs over model components. Three construction methods encode influence patterns: direct causal mediation, partial correlation, and co-influence. Graph kernels then extract features from these graphs to compare patching profiles across prompts and tasks. Localized edge-slot features achieve higher accuracy in distinguishing different patching scenarios than measures of overall graph shape. Paired validation experiments further indicate that edges selected by the co-influence and partial-correlation methods produce stronger activation effects than random or low-rank alternatives, all measured against prompt-only and raw tensor baselines.

Core claim

Patch-effect graphs represent activation-patching profiles as structures with nodes for model components and edges weighted by influence measures from causal mediation, partial correlation, or co-influence. When graph kernels analyze these graphs, localized features tied to specific edge slots yield higher classification accuracy for task-specific patching patterns than global descriptors of graph shape. A screened paired-patching test confirms that candidate edges from co-influence and partial-correlation constructions correspond to interventions with measurably larger activation-influence effects than random or low-rank selections. The evaluation makes explicit that graph features compress

What carries the argument

Patch-effect graphs over model components with edges from direct-influence, partial-correlation, or co-influence constructions, analyzed by graph kernels to extract discriminative structural features such as localized edge slots.

If this is right

Graph representations compress high-dimensional patching data into structures that support systematic comparison across prompts and tasks.
Localized edge features outperform global shape descriptors for identifying task-discriminative patterns in patching results.
Edges selected via co-influence and partial-correlation methods exhibit stronger activation-influence effects in targeted paired tests.
Explicit prompt-only and raw patch-effect baselines clarify the scope of evidence for circuit-level claims.
The pipeline separates robust slice-discriminative signals from stronger task-general causal assertions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could support automated circuit search by treating high-discrimination graphs as targets for optimization.
Testing on larger models would show whether the advantage of localized edge features scales with model size.
The baseline controls imply that other interpretability methods should adopt similar explicit comparisons to raw data to strengthen causal claims.
Graph kernels could be swapped for alternative feature extractors to test whether the structural signal is method-specific.

Load-bearing premise

The three graph construction methods accurately encode the causal structure of activation-patching interventions without introducing artifacts that the kernel analysis then exploits.

What would settle it

If graphs constructed from randomized or non-causal patching data produce the same classification accuracies and validation scores as the actual patching-derived graphs, the claim that the graphs preserve discriminative causal signals would fail.

Figures

Figures reproduced from arXiv: 2605.06480 by David N. Olivieri, Ruben Fernandez-Boullon.

**Figure 1.** Figure 1: End-to-end pipeline from activation patching to kernel-based circuit classification. A view at source ↗

**Figure 2.** Figure 2: Mean patch-effect heatmaps for GPT-2 small residual-stream nodes ( view at source ↗

**Figure 3.** Figure 3: Feature dimensionality as the analyzed node set grows ( view at source ↗

**Figure 4.** Figure 4: Left: accuracy vs. log10 feature dimension for graph representations on GPT-2 (circles) and DistilGPT-2 (squares) from the compression sweep. The Pareto frontier is dominated by localised edge-slot features; global shape descriptors lie well below it. Right: 2D PCA projection of WL subtree embeddings for the 2 × 32 bootstrap graphs (32 per slice); colours indicate corruption type (abba vs. name_swap). Thi… view at source ↗

read the original abstract

Mechanistic interpretability aims to reverse-engineer transformer computations by identifying causal circuits through activation patching. However, scaling these interventions across diverse prompts and task families produces high-dimensional, unstructured datasets that are difficult to compare systematically. We propose a framework that reframes mechanistic analysis as a graph machine-learning problem by representing activation-patching profiles as patch-effect graphs over model components. We introduce three graph-construction methods: direct-influence via causal mediation, partial-correlation, and co-influence and apply graph kernels to analyze the resulting structures. Evaluating this approach on GPT-2 Small using Indirect Object Identification (IOI) and related tasks, we find that patch-effect graphs preserve discriminative structural signals. Specifically, localized edge-slot features provide higher classification accuracy than global graph-shape descriptors. A screened paired-patching validation suggests that CI and PC selected candidate edges correspond to stronger activation-influence effects than random or low-rank candidates. Crucially, by evaluating these representations against rigorous prompt-only and raw patch-effect controls, we make the evidential scope of the benchmark explicit: graph features compress structured patching signal, while raw tensors and surface cues define strong baselines that any circuit-level claim should address. Ultimately, our framework provides a compression and evaluation pipeline for comparing patching-derived structures under controlled baselines, separating robust slice-discriminative evidence from stronger task-general causal-circuit claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper turns activation patching into graphs and applies kernels for better comparison, but the causal claims for the statistical graph builders are the main weak point.

read the letter

The main thing here is a pipeline that converts high-dimensional activation patching results into graphs over model components and then uses graph kernels to classify or discriminate the structures. On GPT-2 Small with IOI tasks, the localized edge features from these graphs beat global shape descriptors in accuracy, and a paired-patching check finds that the edges picked by partial correlation and co-influence show stronger effects than random or low-rank alternatives. They also run controls with prompt-only and raw patch-effect baselines to keep the scope of the claims clear.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes reframing activation-patching results in transformers as patch-effect graphs constructed via three methods (direct-influence through causal mediation, partial-correlation, and co-influence). Graph kernels are then applied to these graphs on GPT-2 Small for IOI and related tasks. The central claims are that patch-effect graphs preserve discriminative structural signals, with localized edge-slot features yielding higher classification accuracy than global graph-shape descriptors, and that a screened paired-patching validation shows CI/PC-selected edges correspond to stronger activation-influence effects than random or low-rank baselines. The work explicitly benchmarks against prompt-only and raw patch-effect controls to bound the evidential scope.

Significance. If the results hold under the stated controls, the framework supplies a systematic compression and comparison pipeline for high-dimensional patching data across prompts and tasks. The explicit prompt-only and raw-tensor baselines are a clear strength, as they prevent overclaiming causal-circuit discovery while still allowing structured-signal analysis. The approach could help mechanistic interpretability move from case-by-case circuit identification toward reproducible, graph-based evaluation of patching-derived structures.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the claim that localized edge-slot features provide higher classification accuracy than global descriptors depends on the three graph-construction methods faithfully encoding causal activation-patching structure. Direct-influence is causal by construction, but partial-correlation and co-influence are statistical; the paired-patching validation shows stronger effects for selected edges yet does not isolate whether the accuracy gains arise from causal circuits rather than correlational artifacts that the kernels later exploit.
[Methods] Methods (graph construction): no explicit test is reported that compares the discriminative power of the three constructions against a purely interventional (causal-mediation-only) baseline while holding the kernel fixed. Without this, it remains possible that the reported advantage of localized features is driven by the statistical methods' sensitivity to non-causal associations rather than by preservation of patching-derived causal structure.

minor comments (3)

[Methods] Provide the exact definitions and formulas for 'edge-slot features' versus 'global graph-shape descriptors' (including which kernels are used for each) so that the classification comparison can be reproduced.
[Evaluation] Report raw accuracy numbers, standard errors, and the precise statistical test used for the 'higher classification accuracy' claim rather than qualitative statements.
[Evaluation] Clarify how the 'screened' paired-patching validation avoids post-hoc selection bias when comparing CI/PC edges against random and low-rank candidates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of the framework's potential. We address the major comments point by point below, indicating the revisions we will make to clarify the causal aspects of our analysis.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the claim that localized edge-slot features provide higher classification accuracy than global descriptors depends on the three graph-construction methods faithfully encoding causal activation-patching structure. Direct-influence is causal by construction, but partial-correlation and co-influence are statistical; the paired-patching validation shows stronger effects for selected edges yet does not isolate whether the accuracy gains arise from causal circuits rather than correlational artifacts that the kernels later exploit.

Authors: We agree that it is important to distinguish the causal nature of the constructions. The direct-influence method relies on causal mediation analysis, which is interventional by design. In contrast, partial-correlation and co-influence are statistical associations derived from the patching data. Our results show that localized edge-slot features outperform global graph-shape descriptors consistently across all three methods, including the causal direct-influence construction. This suggests that the advantage is not limited to statistical artifacts. The screened paired-patching validation further supports that CI and PC edges align with stronger influence effects. To strengthen the manuscript, we will revise the abstract and evaluation section to explicitly state that the performance advantage holds for the causal construction and discuss the complementary role of the statistical methods. We will also add a sentence noting the evidential scope regarding causality. revision: yes
Referee: [Methods] Methods (graph construction): no explicit test is reported that compares the discriminative power of the three constructions against a purely interventional (causal-mediation-only) baseline while holding the kernel fixed. Without this, it remains possible that the reported advantage of localized features is driven by the statistical methods' sensitivity to non-causal associations rather than by preservation of patching-derived causal structure.

Authors: We acknowledge that we did not report a direct comparison isolating the causal-mediation-only graphs against the others with the kernel held fixed. Our current evaluation presents results for each graph construction independently, demonstrating the localized feature advantage in each case. To address this concern rigorously, we will perform and include in the revised manuscript an additional experiment: we will fix the graph kernel and compare classification accuracies using only direct-influence graphs versus the partial-correlation and co-influence graphs. This will help isolate whether the gains stem from causal structure preservation or statistical sensitivities. The results of this analysis will be added to the Evaluation section, along with updated figures if necessary. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on external benchmarks and empirical validation

full rationale

The paper frames activation-patching data as patch-effect graphs via three explicit construction methods (direct-influence, partial-correlation, co-influence), then applies graph kernels and reports classification accuracies plus screened paired-patching results. All load-bearing claims are evaluated against prompt-only and raw patch-effect controls, which are independent of the authors' fitted graph features. No equation or derivation reduces a reported quantity to a parameter defined from the same quantity; no self-citation chain or uniqueness theorem is invoked to force the outcome. The derivation chain therefore remains self-contained against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not introduce new physical entities or fitted constants. The framework rests on standard assumptions from causal mediation analysis and graph theory that are treated as background rather than derived or invented here.

pith-pipeline@v0.9.0 · 5534 in / 1222 out tokens · 73870 ms · 2026-05-08T09:39:18.229379+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

2023
[2]

Towards automated circuit discovery for mechanistic interpretability

Arthur Conmy, Augustine N Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adria Garriga- Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems, 2023

2023
[3]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021

2021
[4]

2023 , archivePrefix=

Nicholas Goldowsky-Dill, Chris MacLeod, Buck Shlegeris, and Nate Bhatt. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969, 2023

work page arXiv 2023
[5]

How does GPT-2 compute greater-than?: Interpret- ing mathematical abilities in a pre-trained language model

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT-2 compute greater-than?: Interpret- ing mathematical abilities in a pre-trained language model. InAdvances in Neural Information Processing Systems, 2023

2023
[6]

Diffusion kernels on graphs and other discrete input spaces

Risi Imre Kondor and John Lafferty. Diffusion kernels on graphs and other discrete input spaces. In International Conference on Machine Learning, 2002

2002
[7]

A survey on graph kernels.Applied Network Science, 5(1):1–42, 2020

Nils M Kriege, Fredrik D Johansson, and Christopher Morris. A survey on graph kernels.Applied Network Science, 5(1):1–42, 2020

2020
[8]

Attribution patching: activation patching at industrial scale

Neel Nanda. Attribution patching: activation patching at industrial scale. Alignment Forum, 2023. URL https://www.neelnanda.io/mechanistic-interpretability/attribution-patching

2023
[9]

In-context learning and induction heads.Trans- former Circuits Thread, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.Trans- former Circuits Thread, 2022

2022
[10]

Cambridge University Press, 2009

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2009

2009
[11]

Language models are unsupervised multitask learners.OpenAI Blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 1(8):9, 2019

2019
[12]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review arXiv 1910
[13]

Effi- cient graphlet kernels for large graph comparison

Nino Shervashidze, SVN Vishwanathan, Tobias Petri, Kurt Mehlhorn, and Karsten M Borgwardt. Effi- cient graphlet kernels for large graph comparison. InArtificial Intelligence and Statistics, pages 488–495, 2009

2009
[14]

Weisfeiler-Lehman graph kernels.Journal of Machine Learning Research, 12:2539–2561, 2011

Nino Shervashidze, Pascal Schweitzer, Erik Jan Van Leeuwen, Kurt Mehlhorn, and Karsten M Borgwardt. Weisfeiler-Lehman graph kernels.Journal of Machine Learning Research, 12:2539–2561, 2011

2011
[15]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[16]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InInternational Conference on Learning Representations, 2023. 10 A Additional Material A.1 Additional Discussion A.1.1 Summary of Evidence for Each Hypothesis Hyp. Claim Status Evi...

2023