arxiv: 2605.09129 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: no theorem link

Data-driven Circuit Discovery for Interpretability of Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:16 UTC · model grok-4.3

classification 💻 cs.AI

keywords circuit discoverylanguage modelsmechanistic interpretabilitydata clusteringfaithfulness evaluationtask decompositionmodel internals

0 comments

The pith

Existing methods for discovering circuits in language models recover subgraphs tied to the dataset rather than to the intended task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Circuit discovery seeks to locate the computational subgraphs that language models use to solve specific tasks. Existing approaches assume each task is handled by one such circuit and that any dataset faithfully represents the task as humans conceive it. Tests across four tasks show that minor dataset changes preserving task meaning produce circuits with low overlap and poor cross-faithfulness, while a dataset mixing two unrelated tasks still yields a single circuit that scores high on faithfulness for both. The proposed Data-driven Circuit Discovery method instead groups examples by similarity of how the model processes them and extracts a separate circuit for each group. Each resulting circuit explains its own group more faithfully than any single circuit explains the full dataset.

Core claim

The paper claims that existing circuit discovery methods recover circuits tied to the specific dataset rather than to the general task, as evidenced by low cross-faithfulness between circuits from semantically similar but distinct datasets and by the ability of a single circuit to achieve high faithfulness on mixed-task data. It introduces Data-driven Circuit Discovery, which drops the single-circuit assumption by clustering dataset examples based on the model's internal processing patterns and then extracting a dedicated circuit for each cluster.

What carries the argument

Data-driven Circuit Discovery (DCD), a framework that first clusters examples by similarity of model processing and then locates a separate circuit for each resulting group.

If this is right

A single human-defined task can be implemented by several distinct circuits within the same model.
Circuits recovered by existing methods are more faithful to the particular dataset used than to the broader task.
Grouping examples by processing similarity allows distinct mechanisms to be analyzed separately instead of being merged.
Each per-group circuit explains model behavior on its examples more accurately than any unified circuit explains the full dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability research may need to shift from assuming one circuit per task to identifying and comparing multiple mechanisms that models actually use.
The clustering step could be tested on other model families or modalities to check whether similar hidden structure appears outside language models.
If the clusters correspond to observable behavioral differences, the method might help diagnose when models switch strategies on edge cases within a task.

Load-bearing premise

That clustering examples according to how the model processes them will isolate genuinely distinct computational mechanisms rather than arbitrary or spurious groupings.

What would settle it

Run DCD on a mixed dataset of two tasks whose separate circuits already show near-zero cross-faithfulness; if the resulting per-cluster circuits do not each achieve higher faithfulness on their own group than a single circuit achieves on the whole mixed set, the core claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.09129 by Daking Rai, Mor Geva, Ziyu Yao.

**Figure 1.** Figure 1: Comparison of circuit discovery paradigms. (a) Hypothesis-driven: researchers define a task, construct a dataset, and discover a circuit on it, hypothesizing that the single circuit implements the task and that the dataset adequately represents it. (b) Data-driven Circuit Discovery (DCD) drops the hypotheses and lets the data determine the structure of discovery. Specifically, it follows a two-stage workfl… view at source ↗

**Figure 2.** Figure 2: (a) Faithfulness across multiple circuit sizes of the 2-person IOI circuit evaluated on five test sets: complexity (3-person, filler), syntax (passive), domain (letter), and its own test set (solid blue). (b) Cross-dataset faithfulness at circuit size 0.05 for entity-binding position variants (P1–P8) on Qwen2.5. Rows represent the circuits being evaluated, with the evaluation dataset in each column. (c) Pa… view at source ↗

**Figure 3.** Figure 3: Cross-task faithfulness curves for circuits discovered on datasets with varying entity-binding [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Best-of-K faithfulness vs. circuit size on the all-task mixed dataset across GPT-2, Qwen2.5-7B-Instruct, and Llama-3.1-8B-Instruct. Shaded region: range across DCD variants. Notably, K-RANDOM also substantially outperforms hypothesis-driven methods, despite relying on random partitions of the data. We note that this can be partially attributed to evaluation setting: allowing K∗ circuits and selecting the … view at source ↗

**Figure 5.** Figure 5: Results on GPT-2. (a) Each DCD cluster consists of examples from only a single task; (b) per-example faithfulness shows clear block structure – each DCD circuit achieves high faithfulness on a subset of examples. Each row across (c1-c7) corresponds to one circuit; (c) K-RANDOM’s circuits show diffuse, near-uniform faithfulness with no circuit aligning with a specific subset of examples. DCD circuits have c… view at source ↗

**Figure 6.** Figure 6: In-distribution IOI circuits transfer unevenly to out-of-distribution variants across all three [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Circuits discovered using EAP. Each panel evaluates a circuit discovered on the 2-person [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Circuits discovered using E-ACT. Each panel evaluates a circuit discovered on the 2-person variant (solid blue) against variants that differ along the complexity (3-person, filler), syntax (passive), and domain (letter) axes. B.3.2 Complexity This subsection reports results for IOI under complexity-axis variation: the 2-person variant (indistribution) evaluated against the 3-person and filler variants. We… view at source ↗

**Figure 9.** Figure 9: IOI complexity-axis results on GPT-2 Small (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: IOI complexity-axis results on Qwen2.5-7B-Instruct (EAP-IG). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: IOI complexity-axis results on Llama-3.1-8B-Instruct (EAP-IG). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: IOI complexity-axis results on Qwen2.5-7B-Instruct (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: IOI complexity-axis results on Llama-3.1-8B-Instruct (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: IOI syntax-axis results on GPT-2 Small (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: IOI syntax-axis results on Qwen2.5-7B-Instruct (EAP-IG). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: IOI syntax-axis results on Llama-3.1-8B-Instruct (EAP-IG). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: IOI syntax-axis results on GPT-2 Small (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: IOI syntax-axis results on Qwen2.5-7B-Instruct (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: IOI syntax-axis results on Qwen2.5-7B (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: IOI syntax-axis results on Llama-3.1-8B-Instruct (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 21.** Figure 21: IOI domain-axis results on GPT-2 Small (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

**Figure 22.** Figure 22: IOI domain-axis results on Qwen2.5-7B-Instruct (EAP-IG). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗

**Figure 23.** Figure 23: IOI domain-axis results on Llama-3.1-8B-Instruct (EAP-IG). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗

**Figure 24.** Figure 24: IOI domain-axis results on GPT-2 Small (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p025_24.png] view at source ↗

**Figure 25.** Figure 25: IOI domain-axis results on Qwen2.5-7B-Instruct (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p025_25.png] view at source ↗

**Figure 26.** Figure 26: IOI domain-axis results on Llama-3.1-8B-Instruct (EAP). Panels as in Figure [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗

**Figure 27.** Figure 27: Complexity-variant analysis of entity binding on Qwen2.5-7B-Instruct, three views of [PITH_FULL_IMAGE:figures/full_fig_p026_27.png] view at source ↗

**Figure 28.** Figure 28: Complexity-variant analysis of entity binding on Llama-3.1-8B-Instruct, same setup as [PITH_FULL_IMAGE:figures/full_fig_p026_28.png] view at source ↗

**Figure 29.** Figure 29: Complexity-variant analysis of entity binding on Qwen2.5-7B-Instruct (EAP). Panels as [PITH_FULL_IMAGE:figures/full_fig_p027_29.png] view at source ↗

**Figure 30.** Figure 30: Complexity-variant analysis of entity binding on Llama-3.1-8B-Instruct (EAP). Panels as [PITH_FULL_IMAGE:figures/full_fig_p027_30.png] view at source ↗

**Figure 31.** Figure 31: Position-variant analysis of entity binding on Qwen2.5-7B-Instruct, three views of the [PITH_FULL_IMAGE:figures/full_fig_p027_31.png] view at source ↗

**Figure 32.** Figure 32: Delimiter-variant analysis of entity binding on Qwen2.5-7B-Instruct, comparing 2-comma [PITH_FULL_IMAGE:figures/full_fig_p028_32.png] view at source ↗

**Figure 33.** Figure 33: Position-variant analysis of entity binding on Llama-3.1-8B-Instruct (EAP-IG). Panels as [PITH_FULL_IMAGE:figures/full_fig_p028_33.png] view at source ↗

**Figure 34.** Figure 34: Position-variant analysis of entity binding on Llama-3.1-8B-Instruct (EAP). Panels as in [PITH_FULL_IMAGE:figures/full_fig_p028_34.png] view at source ↗

**Figure 35.** Figure 35: Entity binding domain-axis results on Qwen2.5-7B-Instruct (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p029_35.png] view at source ↗

**Figure 36.** Figure 36: Entity binding domain-axis results on Llama-3.1-8B-Instruct (EAP-IG). Panels as in [PITH_FULL_IMAGE:figures/full_fig_p029_36.png] view at source ↗

**Figure 37.** Figure 37: Arithmetic complexity-axis results on Llama-3.1-8B-Instruct (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p030_37.png] view at source ↗

**Figure 38.** Figure 38: Arithmetic complexity-axis results on Llama-3.1-8B-Instruct (EAP). [PITH_FULL_IMAGE:figures/full_fig_p030_38.png] view at source ↗

**Figure 39.** Figure 39: Arithmetic syntax-axis results on Llama-3.1-8B-Instruct (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p030_39.png] view at source ↗

**Figure 40.** Figure 40: Arithmetic syntax-axis results on Llama-3.1-8B-Instruct (EAP). [PITH_FULL_IMAGE:figures/full_fig_p031_40.png] view at source ↗

**Figure 41.** Figure 41: Arithmetic domain-axis results on Llama-3.1-8B-Instruct (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p031_41.png] view at source ↗

**Figure 42.** Figure 42: Arithmetic domain-axis results on Llama-3.1-8B-Instruct (EAP). [PITH_FULL_IMAGE:figures/full_fig_p031_42.png] view at source ↗

**Figure 43.** Figure 43: Sequence completion complexity-axis results on GPT-2 Small (EAP-IG). [PITH_FULL_IMAGE:figures/full_fig_p032_43.png] view at source ↗

**Figure 44.** Figure 44: Sequence completion complexity-axis results on Qwen2.5-7B-Instruct (EAP-IG). Panels [PITH_FULL_IMAGE:figures/full_fig_p032_44.png] view at source ↗

**Figure 45.** Figure 45: Sequence completion complexity-axis results on Llama-3.1-8B-Instruct (EAP-IG). Panels [PITH_FULL_IMAGE:figures/full_fig_p032_45.png] view at source ↗

**Figure 46.** Figure 46: Sequence completion complexity-axis results on GPT-2 Small (EAP). Panels as in [PITH_FULL_IMAGE:figures/full_fig_p033_46.png] view at source ↗

**Figure 47.** Figure 47: Sequence completion complexity-axis results on Qwen2.5-7B-Instruct (EAP). Panels as in [PITH_FULL_IMAGE:figures/full_fig_p033_47.png] view at source ↗

**Figure 48.** Figure 48: Sequence completion complexity-axis results on Llama-3.1-8B-Instruct (EAP). Panels as [PITH_FULL_IMAGE:figures/full_fig_p033_48.png] view at source ↗

**Figure 49.** Figure 49: Best-of-K faithfulness vs. circuit size on the IOI mixture across GPT-2, Llama-3.1-8BInstruct, and Qwen-2.5-7B-Instruct. Shaded region: range across DCD variants. 0.001 0.002 0.005 0.01 0.02 0.03 0.04 0.05 0.1 0.2 0.5 1 Circuit size (fraction of edges) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Faithfulness (a) Qwen-2.5-7B-Instruct DCD (k-means PCA) K-RANDOM K-REPRESENTATIVE EAP-IG (global) EAP (global) Random edge… view at source ↗

**Figure 50.** Figure 50: Best-of-K faithfulness vs. circuit size on the entity binding mixture across Qwen2.5-7binstruct and Llama-3.1-8B-Instruct. Shaded region: range across DCD variants. 0.001 0.002 0.005 0.01 0.02 0.03 0.04 0.05 0.1 0.2 0.5 1 Circuit size (fraction of edges) 0.0 0.2 0.4 0.6 0.8 1.0 Faithfulness Llama-3.1-8B-Instruct Arithmetic DCD (k-means PCA) K-RANDOM K-REPRESENTATIVE EAP-IG (global) EAP (global) Random ed… view at source ↗

**Figure 51.** Figure 51: Best-of-K faithfulness vs. circuit size on the arithmetic mixture in Llama-3.1-8B-Instruct. Shaded region: range across DCD variants. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_51.png] view at source ↗

**Figure 52.** Figure 52: Best-of-K faithfulness vs. circuit size on the sequence-completion mixture across Qwen2.5-7b-instruct and Llama-3.1-8B-Instruct. Shaded region: range across DCD variants. C.3 Additional Results: DCD circuits have coherent and interpretable faithfulness behaviors We provide additional results on Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct in [PITH_FULL_IMAGE:figures/full_fig_p037_52.png] view at source ↗

**Figure 53.** Figure 53: Results on Llama-3.1-8b-Instruct. (a) Each [PITH_FULL_IMAGE:figures/full_fig_p038_53.png] view at source ↗

**Figure 54.** Figure 54: Results on Qwen2.5-7b-Instruct. (a) Each [PITH_FULL_IMAGE:figures/full_fig_p038_54.png] view at source ↗

read the original abstract

Circuit discovery aims to explain how language models (LMs) implement a specific task by localizing and interpreting a circuit, a computational subgraph responsible for the LM's behavior. Existing circuit discovery methods are hypothesis-driven; they first informally define a task with a dataset, and then apply a circuit discovery algorithm over that dataset to obtain a single circuit. This imposes two strong assumptions: that the LM implements the task with a single circuit, and that the dataset adequately represents the task as humans understand it. We systematically test these assumptions across four previously studied tasks and find that even minor dataset variations that preserve task semantics can produce circuits with low edge overlap and cross-dataset faithfulness. More strikingly, when applied to a mixed dataset with two distinct tasks whose separately discovered circuits have near-zero cross-faithfulness, existing methods still return a single circuit with high faithfulness across both tasks. This indicates that current methods discover dataset-specific circuits, rather than general task circuits. We propose Data-driven Circuit Discovery (DCD), a new discovery framework that drops both assumptions: instead of returning a single circuit for a dataset, DCD first clusters examples in the dataset by how similarly the model processes them and discovers a separate circuit for each group. This allows distinct mechanisms to appear separately rather than merged into a single circuit; each circuit explains its group, not the full task. Experiments show that DCD discovers multiple circuits per dataset, each more faithful to its group than a single circuit discovered by existing methods. Broadly, DCD lets the data reveal mechanistic structure within LMs, rather than relying on human-defined task boundaries that may not align with how models organize their computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that existing circuit discovery methods for language models rest on two assumptions—that a task is implemented by a single circuit and that a given dataset adequately represents the task—and tests these by showing low edge overlap and cross-faithfulness even for minor dataset variations that preserve semantics. On mixed datasets combining two tasks whose separate circuits have near-zero cross-faithfulness, standard methods still recover a single high-faithfulness circuit, suggesting they capture dataset-specific rather than task-general structure. The authors introduce Data-driven Circuit Discovery (DCD), which clusters examples by similarity of model processing and extracts a separate circuit per cluster; experiments indicate each such circuit is more faithful to its subgroup than a single circuit found by prior methods.

Significance. If the clustering step reliably isolates distinct computational mechanisms rather than surface-feature correlations, DCD would represent a meaningful shift from hypothesis-driven to data-driven interpretability, allowing mechanistic structure to emerge from the data instead of human task definitions. The mixed-dataset experiment provides a concrete demonstration that current methods can merge distinct circuits, and the per-group faithfulness gains are a falsifiable, quantitative claim. The work also supplies a practical framework that could be applied to other interpretability pipelines.

major comments (3)

[Methods] Clustering procedure (Methods section): the claim that DCD isolates distinct mechanisms rests on the unstated similarity metric used to cluster examples 'by how similarly the model processes them.' Without an explicit definition (e.g., cosine similarity on which layer activations, gradient alignment, or logit differences) and controls for confounds such as sequence length or token overlap, it is possible that clusters reflect spurious correlations rather than shared computational pathways; this directly affects whether the reported per-group faithfulness improvement is mechanistic or an artifact of reduced variance in smaller subproblems.
[Experiments] Mixed-dataset experiment (Experiments section): the central empirical result—that existing methods return a single circuit with high faithfulness on a mixture whose separately discovered circuits have near-zero cross-faithfulness—requires the exact faithfulness values, the models and datasets used, and any statistical controls or ablation studies. The abstract states the qualitative outcome but does not report quantitative thresholds or variance estimates, making it difficult to assess whether the 'high faithfulness' is robust or merely consistent with a merged but still dataset-specific circuit.
[Experiments] Faithfulness evaluation on subgroups (Experiments section): when faithfulness is measured on the derived clusters rather than the original task distribution, it is necessary to verify that the metric remains well-calibrated (e.g., via comparison to random or size-matched baselines). If the per-group circuits simply solve easier subproblems, the reported improvement over a single circuit could be expected by construction rather than evidence of better mechanistic discovery.

minor comments (2)

Notation for circuits and edges should be standardized across figures and text; currently the abstract refers to 'edge overlap' without defining the precise edge representation used in the overlap calculation.
The paper would benefit from an explicit statement of the four previously studied tasks and the models (e.g., GPT-2, Llama) on which all experiments were run, preferably in a table in the experimental setup.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Clustering procedure (Methods section): the claim that DCD isolates distinct mechanisms rests on the unstated similarity metric used to cluster examples 'by how similarly the model processes them.' Without an explicit definition (e.g., cosine similarity on which layer activations, gradient alignment, or logit differences) and controls for confounds such as sequence length or token overlap, it is possible that clusters reflect spurious correlations rather than shared computational pathways; this directly affects whether the reported per-group faithfulness improvement is mechanistic or an artifact of reduced variance in smaller subproblems.

Authors: We agree that an explicit definition of the similarity metric and controls for confounds are required for the claim to be fully substantiated. The current Methods section describes clustering by similarity of model processing but does not provide the precise implementation details or confound controls. In the revised manuscript we will add a complete specification of the metric (cosine similarity on selected layer activations), the clustering algorithm, and ablation experiments that control for sequence length and token overlap. These additions will allow readers to evaluate whether the clusters reflect shared computational pathways. revision: yes
Referee: [Experiments] Mixed-dataset experiment (Experiments section): the central empirical result—that existing methods return a single circuit with high faithfulness on a mixture whose separately discovered circuits have near-zero cross-faithfulness—requires the exact faithfulness values, the models and datasets used, and any statistical controls or ablation studies. The abstract states the qualitative outcome but does not report quantitative thresholds or variance estimates, making it difficult to assess whether the 'high faithfulness' is robust or merely consistent with a merged but still dataset-specific circuit.

Authors: We acknowledge that the abstract presents only a qualitative summary. The Experiments section already specifies the models, datasets, and reports faithfulness values together with variance across random seeds and statistical controls. We will revise the abstract to include the key quantitative thresholds and will add any missing ablation details to the main text so that the robustness of the mixed-dataset result is fully documented. revision: yes
Referee: [Experiments] Faithfulness evaluation on subgroups (Experiments section): when faithfulness is measured on the derived clusters rather than the original task distribution, it is necessary to verify that the metric remains well-calibrated (e.g., via comparison to random or size-matched baselines). If the per-group circuits simply solve easier subproblems, the reported improvement over a single circuit could be expected by construction rather than evidence of better mechanistic discovery.

Authors: We agree that calibration of the faithfulness metric on the derived subgroups must be demonstrated. In the revised manuscript we will include additional baselines: random circuits of comparable size and single circuits discovered on size-matched random subsets of the same clusters. These comparisons will show whether the observed faithfulness gains exceed what would be expected from reduced problem variance alone. revision: yes

Circularity Check

0 steps flagged

No circularity: DCD defined and evaluated independently

full rationale

The paper defines DCD as an independent procedure—cluster dataset examples by model processing similarity, then run circuit discovery per cluster—followed by separate faithfulness evaluation on the resulting groups. No step reduces by construction to its own inputs, fitted parameters, or self-citations; the clustering metric and faithfulness scores are distinct from the circuit outputs, and the central claim (multiple circuits outperforming a single merged one) is tested empirically rather than assumed. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard domain assumptions from mechanistic interpretability literature; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Language models implement tasks via localized computational subgraphs called circuits
Foundational premise of circuit discovery methods referenced throughout the abstract.

pith-pipeline@v0.9.0 · 5594 in / 1319 out tokens · 54717 ms · 2026-05-12T02:16:19.158220+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

arXiv preprint arXiv:2404.14082 (2024)

L. Bereska and E. Gavves. Mechanistic interpretability for ai safety–a review.arXiv preprint arXiv:2404.14082, 2024

work page arXiv 2024
[2]

Chughtai, A

B. Chughtai, A. Cooney, and N. Nanda. Summing up the facts: Additive mechanisms behind factual recall in llms.arXiv preprint arXiv:2402.07321, 2024

work page arXiv 2024
[3]

Conmy, A

A. Conmy, A. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. To- wards automated circuit discovery for mechanistic interpretability. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 16318–16352. Curran Associates, Inc., 2023. URL https://p...

work page 2023
[4]

Davies, M

X. Davies, M. Nadeau, N. Prakash, T. R. Shaham, and D. Bau. Discovering variable binding circuitry with desiderata.CoRR, 2023

work page 2023
[5]

Dubey, A

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

work page 2024
[6]

Elhage, N

N. Elhage, N. Nanda, C. Olsson, T. Henighan, N. Joseph, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, N. DasSarma, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCandlish, and C. Olah. A mathematical framework for transformer circuits.Transformer Circui...

work page
[7]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021
[8]

Feng and J

J. Feng and J. Steinhardt. How do language models bind entities in context? InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=zb3b6oKO77

work page 2024
[9]

Ferrando and E

J. Ferrando and E. V oita. Information flow routes: Automatically interpreting language models at scale. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17432–17445, 2024

work page 2024
[10]

A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

J. Ferrando, G. Sarti, A. Bisazza, and M. R. Costa-Jussà. A primer on the inner workings of transformer-based language models.arXiv preprint arXiv:2405.00208, 2024

work page arXiv 2024
[11]

Franco, L

G. Franco, L. M. Tassis, A. Rohr, and M. Crovella. Finding highly interpretable prompt-specific circuits in language models.arXiv preprint arXiv:2602.13483, 2026

work page internal anchor Pith review arXiv 2026
[12]

arXiv preprint arXiv:2510.06182 , year=

Y . Gur-Arieh, M. Geva, and A. Geiger. Mixing mechanisms: How language models retrieve bound entities in-context.arXiv preprint arXiv:2510.06182, 2025. 10

work page arXiv 2025
[13]

Hanna, O

M. Hanna, O. Liu, and A. Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.Advances in Neural Information Processing Systems, 36:76033–76060, 2023

work page 2023
[14]

Hanna, S

M. Hanna, S. Pezzelle, and Y . Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=TZ0CCGDcuT

work page 2024
[15]

Mamidanna, D

S. Mamidanna, D. Rai, Z. Yao, and Y . Zhou. All for one: Llms solve mental math at the last token with information transferred from other tokens. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 30735–30748, 2025

work page 2025
[16]

Marks, C

S. Marks, C. Rager, E. J. Michaud, Y . Belinkov, D. Bau, and A. Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=I4e82CIDxv

work page 2025
[17]

Mueller, A

A. Mueller, A. Geiger, S. Wiegreffe, D. Arad, I. Arcuschin, A. Belfki, Y . S. Chan, J. F. Fiotto- Kaufman, T. Haklay, M. Hanna, J. Huang, R. Gupta, Y . Nikankin, H. Orgad, N. Prakash, A. Reusch, A. Sankaranarayanan, S. Shao, A. Stolfo, M. Tutek, A. Zur, D. Bau, and Y . Belinkov. MIB: A mechanistic interpretability benchmark. InForty-second International C...

work page 2025
[18]

Nikankin, D

Y . Nikankin, D. Arad, Y . Gandelsman, and Y . Belinkov. Same task, different circuits: Disen- tangling modality-specific mechanisms in vlms. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

work page 2025
[19]

Nikankin, A

Y . Nikankin, A. Reusch, A. Mueller, and Y . Belinkov. Arithmetic without algorithms: Language models solve math with a bag of heuristics. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=O9YTt26r2P

work page 2025
[20]

C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom- in

work page doi:10.23915/distill.00024.001 2020
[21]

Olsson, N

C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y . Bai, A. Chen, T. Conerly, D. Drain, D. Ganguli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan- dlish, and C. Olah. In-context learning and induction heads.Transformer C...

work page 2022
[22]

Prakash, T

N. Prakash, T. R. Shaham, T. Haklay, Y . Belinkov, and D. Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8sKcAWOf2D

work page 2024
[23]

A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report.arXiv preprint, 2024

work page 2024
[24]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[25]

D. Rai, Y . Zhou, S. Feng, A. Saparov, and Z. Yao. A practical review of mechanistic inter- pretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024

work page arXiv 2024
[26]

D. Rai, S. Miller, K. Moran, and Z. Yao. Failure by interference: Language models make balanced parentheses errors when faulty mechanisms overshadow sound ones. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=1t4hR9JCcS

work page 2025
[27]

Stolfo, Y

A. Stolfo, Y . Belinkov, and M. Sachan. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7035–7052, 2023. 11

work page 2023
[28]

A. Syed, C. Rager, and A. Conmy. Attribution patching outperforms automated circuit discovery. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 407–416, 2024

work page 2024
[29]

Tibshirani, G

R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic.Journal of the royal statistical society: series b (statistical methodology), 63(2): 411–423, 2001

work page 2001
[30]

K. R. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=NpsVSN6o4ul

work page 2023
[31]

2023 , archivePrefix=

F. Zhang and N. Nanda. Towards best practices of activation patching in language models: Metrics and methods.arXiv preprint arXiv:2309.16042, 2023. A Dataset construction In this section, we provide details on the dataset construction across all four tasks, representative example prompts (section A.4), and the accuracy of each model on each variant (secti...

work page arXiv 2023