Multi-component Causal Tracing in Large Language Models

Ali Tajer; Dennis Wei; Dmitriy A. Katz; Prasanna Sattigeri; Zirui Yan

arxiv: 2606.03085 · v1 · pith:4LCLSZHMnew · submitted 2026-06-02 · 💻 cs.LG · cs.CL

Multi-component Causal Tracing in Large Language Models

Zirui Yan , Dennis Wei , Dmitriy A. Katz , Prasanna Sattigeri , Ali Tajer This is my paper

Pith reviewed 2026-06-28 11:50 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords causal tracinglarge language modelsmulti-component interventionattention headsMLP neuronsmodel interpretabilitycausal pathwaysoptimization

0 comments

The pith

A new algorithm identifies groups of LLM components that causally drive target metrics by converting discrete selection into continuous optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework that traces causal effects across multiple internal components of large language models at the same time. It targets subsets of attention heads and MLP neurons that most strongly influence metrics such as accuracy or fairness. The method applies flexible interventions and transforms the combinatorial selection task into a continuous optimization problem that yields binary component choices. This produces an efficient search that the authors show outperforms prior single-component or baseline approaches. A sympathetic reader would care because it supplies a practical route to locate the internal pathways responsible for specific model behaviors.

Core claim

The paper presents a unified framework for multi-component causal tracing that systematically identifies the subsets of model components most critical to a desired target performance metric. This is achieved by incorporating flexible interventions applied to a wide range of desired metrics and designing an efficient algorithm that leverages soft interventions together with a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components.

What carries the argument

The efficient algorithm that applies soft interventions and a metric transformation to convert combinatorial multi-component selection into continuous optimization under constraints.

If this is right

Subsets of attention heads and MLP neurons can be traced simultaneously for their joint effect on a metric.
The approach works with flexible interventions across a range of target metrics including accuracy and fairness.
The continuous relaxation yields binary selection decisions that are more efficient than exhaustive search.
Experimental results show the selected subsets have higher impact on the target metric than those found by existing baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transformation could be applied to trace components that affect safety-related metrics such as refusal behavior.
Once high-impact subsets are located, targeted fine-tuning or editing could be restricted to those components rather than the full model.
The continuous formulation may generalize to other discrete selection problems in neural network analysis beyond transformers.
Repeated application across different prompts could map how component importance shifts with input distribution.

Load-bearing premise

The soft interventions combined with the metric transformation accurately reflect true causal contributions without introducing bias or losing critical information from the original combinatorial structure.

What would settle it

A controlled test in which the subsets selected by the algorithm are intervened upon yet produce no measurable change in the target metric, or in which the method fails to outperform standard baselines on held-out examples.

Figures

Figures reproduced from arXiv: 2606.03085 by Ali Tajer, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri, Zirui Yan.

**Figure 2.** Figure 2: Counterfactual intervention in LLMs. We denote the subset of components selected for treatment by H ⊆ C. To specify the components selected for treatment, we define {mi : i ∈ [N]} such that mi ≜ 1{ci ∈ H}, where 1 is the indicator function. Accordingly, we define m ≜ (mi , . . . , mN ). In this context, a treatment involves intervening in these components by replacing specific attention weights or neuron… view at source ↗

**Figure 3.** Figure 3: Results of attention heads from GPT2-small on the WinoBias dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Results of selecting MLP neurons on the Professions dataset with GPT2-medium. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Left: Results of Factual locating measure vs. number of MLP neurons on the CounterFact dataset with distilGPT2. Right: Execution time for different algorithms on Professions and CounterFact datasets. 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Number of Components 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Value Opration [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Results on the VBD dataset under two com [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Results of an ablation study when removing [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Results of selecting attention heads from GPT2-small on the WinoGender dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Results of selecting attention heads from GPT2-medium on the WinoGender dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Results of selecting attention heads from GPT2-medium on the WinoBias dataset. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Results of selecting attention heads on the WinoGender dataset [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Results of selecting attention heads on the Winobias dataset: Gender bias measure vs. number of neurons. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Results of selecting MLP neurons on the Professions dataset with GPT2-small. [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Results of selecting MLP neurons on the Professions dataset: Gender bias measure vs. number of neurons. [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Results of Factual locating measure vs. num [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Comparison of properties of m of attentions on WinoGender dataset. 0 10 20 30 Epoch 0 25 50 75 100 Sparsity GPT2 Small GPT2 Medium GPT2 Large GPT2 XL 0 10 20 30 Epoch 0 100 200 300 Binary Violation GPT2 Small GPT2 Medium GPT2 Large GPT2 XL [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: Properties of m of attentions on WinoGender dataset. Left: Sparsity S/N. Right: Violation of binary m(1 − m). 10 −6 10 −5 10 −4 10 −3 λ1 0.6 0.8 Final sparsity [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Results of sparsity vs. λ1 on the WinoGender dataset with GPT2-small. 0 10 20 30 40 50 Number of Components 0.0 0.2 0.4 0.6 0.8 1.0 Accuracy Value Opration [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Results of accuracy on the VBD dataset on [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

read the original abstract

Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a continuous-relaxation algorithm for picking multiple LLM components at once, but the justification that the relaxed solution matches the true discrete causal subsets is missing.

read the letter

The main takeaway is a practical algorithm that extends single-component causal tracing to simultaneous selection of multiple attention heads and MLP neurons. It turns the combinatorial search over subsets into a continuous optimization by using soft interventions plus a metric transformation, then rounds to binary decisions under constraints.

What is new is the unified framework that handles multi-component selection for target metrics such as accuracy or fairness, rather than tracing one component or layer at a time. The experiments claim the method finds higher-impact subsets than the baselines it compares against, and the code is released.

The soft spot is exactly the one flagged in the stress-test note. The abstract states that the transformation produces proper binary decisions, but supplies no derivation showing the fixed point of the relaxed objective coincides with the argmax over the original 2^n hard-intervention objective. Without that step or supporting error analysis, it is unclear whether the reported subsets reflect genuine causal impact or artifacts of the surrogate. The soundness rating in the reader report looks accurate on this point.

The work is aimed at people already working on LLM interpretability who need a tool for multi-component analysis. A reader focused on downstream uses like editing or fairness checks might extract usable ideas from the experiments, provided the relaxation holds up under closer inspection.

I would send it to peer review so the full derivations and validation can be checked; the core idea is straightforward enough that referees could evaluate it quickly.

Referee Report

3 major / 2 minor

Summary. The paper presents a unified framework for multi-component causal tracing in LLMs. It extends single-component tracing by identifying subsets of components (attention heads, MLP neurons) most critical to target metrics (accuracy, fairness) via flexible interventions. To handle combinatorial complexity, it introduces an efficient algorithm using soft interventions and a metric transformation that converts the discrete search into a continuous optimization problem solved under constraints to yield binary component selections. Experiments claim the method efficiently finds high-impact subsets and outperforms baselines, with code released.

Significance. If the soft-intervention relaxation and metric transformation provably recover the same high-impact subsets as exhaustive hard interventions, the framework would offer a scalable extension of causal tracing to multi-component settings, enabling more systematic interpretability and editing of LLMs. Reproducibility via the linked code repository is a positive factor.

major comments (3)

[Abstract; Method (algorithm description)] Abstract and Method section: the central efficiency claim rests on the assertion that soft interventions plus the unspecified metric transformation, 'under proper constraints,' produce binary decisions whose causal impact matches exhaustive hard interventions over the original 2^n combinatorial objective. No derivation is supplied showing that the fixed point of the relaxed objective coincides with the argmax of the discrete problem; any mismatch would render the reported subsets artifacts of the surrogate rather than true causal drivers.
[Experiments] Experiments section: the claim of outperformance over baselines is presented without reported validation that the continuous relaxation recovers the same component subsets as brute-force hard interventions on small-scale cases (e.g., models with <10 components where 2^n enumeration is feasible). This leaves open whether the efficiency gain comes at the cost of correctness.
[Method] Method section: the 'carefully designed metric transformation' is described only at a high level; without an explicit statement of the transformation (or its fixed-point properties), it is impossible to assess whether it introduces bias or loses information from the original combinatorial structure, as required by the weakest assumption in the reader's report.

minor comments (2)

[Abstract] The abstract refers to 'existing baseline approaches' without naming them; a brief enumeration in the introduction or related-work section would improve clarity.
[Method] Notation for the soft-intervention parameters and the transformed metric should be introduced with explicit symbols rather than descriptive phrases to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address each major point below and will incorporate the requested clarifications and validations into a revised manuscript.

read point-by-point responses

Referee: [Abstract; Method (algorithm description)] Abstract and Method section: the central efficiency claim rests on the assertion that soft interventions plus the unspecified metric transformation, 'under proper constraints,' produce binary decisions whose causal impact matches exhaustive hard interventions over the original 2^n combinatorial objective. No derivation is supplied showing that the fixed point of the relaxed objective coincides with the argmax of the discrete problem; any mismatch would render the reported subsets artifacts of the surrogate rather than true causal drivers.

Authors: We agree that the manuscript would be strengthened by an explicit derivation establishing equivalence between the relaxed continuous problem and the original discrete objective. In the revision we will add a dedicated subsection in the Method section that derives the fixed-point properties of the metric transformation under the stated constraints and proves that the binary solutions recovered by the continuous optimizer coincide with the argmax of the combinatorial objective. revision: yes
Referee: [Experiments] Experiments section: the claim of outperformance over baselines is presented without reported validation that the continuous relaxation recovers the same component subsets as brute-force hard interventions on small-scale cases (e.g., models with <10 components where 2^n enumeration is feasible). This leaves open whether the efficiency gain comes at the cost of correctness.

Authors: We concur that empirical verification on enumerable small instances is necessary to confirm correctness of the relaxation. The revised manuscript will include new experiments that enumerate all 2^n subsets for models with fewer than 10 components, directly compare the subsets recovered by the continuous method against the exhaustive optimum, and report agreement rates. revision: yes
Referee: [Method] Method section: the 'carefully designed metric transformation' is described only at a high level; without an explicit statement of the transformation (or its fixed-point properties), it is impossible to assess whether it introduces bias or loses information from the original combinatorial structure, as required by the weakest assumption in the reader's report.

Authors: The current description intentionally keeps the transformation at a high level for readability, but we recognize that an explicit formulation is required for rigorous evaluation. The revision will state the precise functional form of the metric transformation, derive its fixed-point properties, and show how the transformation preserves the ranking of component subsets from the original discrete objective. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic relaxation presented as independent contribution

full rationale

The paper describes a new algorithmic framework that converts a combinatorial multi-component selection problem into a continuous optimization via soft interventions and a metric transformation, solved under constraints to yield binary decisions. No equations, fitted parameters, or self-citations are quoted in the provided text that would make any claimed 'high-impact subset' equivalent by construction to the input metric values or prior results. The central claim rests on the design of the relaxation and empirical outperformance versus baselines, which is an independent algorithmic assertion rather than a definitional or self-referential reduction. This is the normal case of a self-contained method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are described. The method implicitly relies on standard assumptions of causal tracing in neural networks.

pith-pipeline@v0.9.1-grok · 5734 in / 1011 out tokens · 17542 ms · 2026-06-28T11:50:34.916425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Investigating gender bias in language models using causal mediation analysis , author=. Proc. Advances in Neural Information Processing Systems , year=
[2]

Locating and editing factual associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and editing factual associations in. 2022 , address=

2022
[3]

Meng, Kevin and Sharma, Arnab Sen and Andonian, Alex and Belinkov, Yonatan and Bau, David , title =. Proc. International Conference on Learning Representations , year = 2023, address=

2023
[4]

knowledge editing in language models , author=

Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models , author=. Proc. Advances in Neural Information Processing Systems , year=
[5]

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proc. Conference on Empirical Methods in Natural Language Processing. 2021 , month =. doi:10.18653/v1/2021.emnlp-main.446

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[6]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space , author=. Proc. Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2022.emnlp-main.3

work page doi:10.18653/v1/2022.emnlp-main.3 2022
[7]

2019 , howpublished=

Language Models are Unsupervised Multitask Learners , author=. 2019 , howpublished=

2019
[8]

2019 , url =

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Cl. 2019 , url =. 1910.03771 , archivePrefix=

Pith/arXiv arXiv 2019
[9]

Sanh, Victor and Debut, L and Chaumond, J and Wolf, T , eprint=
[10]

Logic, language, and security: Essays dedicated to Andre Scedrov on the occasion of his 65th birthday , pages=

Gender bias in neural natural language processing , author=. Logic, language, and security: Essays dedicated to Andre Scedrov on the occasion of his 65th birthday , pages=. 2020 , publisher=. doi:https://doi.org/10.1007/978-3-030-62077-6_14 , address=

work page doi:10.1007/978-3-030-62077-6_14 2020
[11]

Man is to computer programmer as woman is to homemaker? debiasing word embeddings , author=. Proc. Advances in Neural Information Processing Systems , year=
[12]

Gender bias in coreference resolution: Evaluation and debiasing methods , author=. Proc. Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/N18-2003

work page doi:10.18653/v1/n18-2003 2003
[13]

Gender bias in coreference resolution , author=. Proc. Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/N18-2002

work page doi:10.18653/v1/n18-2002 2002
[14]

What Does BERT Look at? An Analysis of BERT `s Attention

Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D. What Does BERT Look at? An Analysis of BERT `s Attention. Proc. ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2019

2019
[15]

Locating and mitigating gender bias in large language models , author=. Proc. International Conference on Intelligent Computing , year=
[16]

Can Editing

Chen, Canyu and Huang, Baixiang and Li, Zekun and Chen, Zhaorun and Lai, Shiyang and Xu, Xiongxiao and Gu, Jia-Chen and Gu, Jindong and Yao, Huaxiu and Xiao, Chaowei and Yan, Xifeng and Wang, William Yang and Torr, Philip and Song, Dawn and Shu, Kai , booktitle =. Can Editing. 2026 , doi =

2026
[17]

Causal mediation analysis for interpreting neural

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Sakenis, Simas and Huang, Jason and Singer, Yaron and Shieber, Stuart , eprint=. Causal mediation analysis for interpreting neural
[18]

Probabilistic and causal inference: the works of Judea Pearl , pages=

Direct and indirect effects , author=. Probabilistic and causal inference: the works of Judea Pearl , pages=. 2022 , publisher=

2022
[19]

Prompting large language model for machine translation: A case study , author=. Proc. International Conference on Machine Learning , year=
[20]

Nature Medicine , volume=

Adapted large language models can outperform medical experts in clinical text summarization , author=. Nature Medicine , volume=. 2024 , address=

2024
[21]

ACM Transactions on Software Engineering and Methodology , volume=

Self-planning code generation with large language models , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2024 , url=

2024
[22]

Science China Information Sciences , volume=

The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , volume=. 2025 , publisher=

2025
[23]

Societal biases in language generation: Progress and challenges , author=. Proc. Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing. doi:10.18653/v1/2021.acl-long.330

work page doi:10.18653/v1/2021.acl-long.330 2021
[24]

Nature Machine Intelligence , volume=

Factuality challenges in the era of large language models and opportunities for fact-checking , author=. Nature Machine Intelligence , volume=. 2024 , publisher=

2024
[25]

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models , year=

Wang, Youze and Hu, Wenbo and Dong, Yinpeng and Liu, Jing and Zhang, Hanwang and Hong, Richang , journal=. Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models , year=
[26]

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , title =. Proc. International Conference on Learning Representations , year = 2024, month=

2024
[27]

2023 , month =

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , month =

2023
[28]

2024 , url=

A primer on the inner workings of transformer-based language models , author=. 2024 , url=. 2405.00208 , archivePrefix=

arXiv 2024
[29]

2021 , url=

A mathematical framework for transformer circuits , author=. 2021 , url=

2021
[30]

2022 , url=

In-context learning and induction heads , author=. 2022 , url=. 2209.11895 , archivePrefix=

Pith/arXiv arXiv 2022
[31]

2024 , url =

From understanding to utilization: A survey on explainability for large language models , author=. 2024 , url =. 2401.12874 , archivePrefix=

arXiv 2024
[32]

A toy model of universality: Reverse engineering how networks learn group operations , author=. Proc. International Conference on Machine Learning , year=
[33]

Language models as knowledge bases? , author=. Proc. Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing , year=. doi:10.18653/v1/D19-1250

work page doi:10.18653/v1/d19-1250
[34]

Inference-time intervention: Eliciting truthful answers from a language model , author=. Proc. Advances in Neural Information Processing Systems , year=
[35]

Interpretability in the wild: a circuit for indirect object identification in

Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the wild: a circuit for indirect object identification in. 2023 , address=

2023
[36]

Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

Sikdar, Sandipan and Bhattacharya, Parantapa and Heese, Kieran. Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models. Proc. Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing. 2021. doi:10.18653/v1/2021.acl-long.71

work page doi:10.18653/v1/2021.acl-long.71 2021
[37]

Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection

Chen, Hanjie and Zheng, Guangtao and Ji, Yangfeng. Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection. Proc. Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.494

work page doi:10.18653/v1/2020.acl-main.494 2020
[38]

G lob E nc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

Modarressi, Ali and Fayyaz, Mohsen and Yaghoobzadeh, Yadollah and Pilehvar, Mohammad Taher. G lob E nc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers. Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.19

work page doi:10.18653/v1/2022.naacl-main.19 2022
[39]

Discovering variable binding circuitry with desiderata , author=. Proc. International Conference on Machine Learning Workshop on Challenges in Deployable Generative AI , year=
[40]

Fine-tuning enhances existing mechanisms: A case study on entity tracking , author=. Proc. International Conference on Learning Representations , year=
[41]

Towards best practices of activation patching in language models: Metrics and methods , author=. Proc. International Conference on Learning Representations , month =. 2024 , address =

2024
[42]

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

Geiger, Atticus and Richardson, Kyle and Potts, Christopher. Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation. Proc. BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2020. doi:10.18653/v1/2020.blackboxnlp-1.16

work page doi:10.18653/v1/2020.blackboxnlp-1.16 2020
[43]

Inducing causal structure for interpretable neural networks , author=. Proc. International Conference on Machine Learning , year=
[44]

Causal abstractions of neural networks , author=. Proc. Advances in Neural Information Processing Systems , year=
[45]

Harvard Journal of Law & Technology , volume=

Counterfactual explanations without opening the black box: Automated decisions and the GDPR , author=. Harvard Journal of Law & Technology , volume=
[46]

ACM Computing Surveys , volume=

A survey of algorithmic recourse: contrastive explanations and consequential recommendations , author=. ACM Computing Surveys , volume=
[47]

Decomposing and editing predictions by modeling model computation , author=. Proc. International Conference on Machine Learning , year=
[48]

Attribution patching outperforms automated circuit discovery , author=. Proc. Advances in Neural Information Processing Systems Workshop on Attributing Model Behavior at Scale , year=
[49]

Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms , author=. Proc. Conference on Language Modeling , year=
[50]

Quantifying Context Mixing in Transformers , author=. Proc. Conference of the European Chapter of the Association for Computational Linguistics , month=. 2023 , address=

2023
[51]

Towards automated circuit discovery for mechanistic interpretability , author=. Proc. Advances in Neural Information Processing Systems , year=
[52]

Interpretability at scale: Identifying causal mechanisms in alpaca , author=. Proc. Advances in Neural Information Processing Systems , year=
[53]

2025 , url =

Qwen3 technical report , author =. 2025 , url =. 2505.09388 , archivePrefix=

Pith/arXiv arXiv 2025
[54]

Llama 3.2: Revolutionizing edge

Llama Team, AI @ Meta , year=. Llama 3.2: Revolutionizing edge
[55]

Multi-Level Explanations for Generative Language Models , author=. Proc. Annual Meeting of the Association for Computational Linguistics , year=
[56]

2302.13971 , archivePrefix=

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. 2302.13971 , archivePrefix=

Pith/arXiv arXiv
[57]

, booktitle=

Louizos, Christos and Welling, Max and Kingma, Diederik P. , booktitle=. Learning Sparse Neural Networks through. 2018 , url=

2018

[1] [1]

Investigating gender bias in language models using causal mediation analysis , author=. Proc. Advances in Neural Information Processing Systems , year=

[2] [2]

Locating and editing factual associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and editing factual associations in. 2022 , address=

2022

[3] [3]

Meng, Kevin and Sharma, Arnab Sen and Andonian, Alex and Belinkov, Yonatan and Bau, David , title =. Proc. International Conference on Learning Representations , year = 2023, address=

2023

[4] [4]

knowledge editing in language models , author=

Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models , author=. Proc. Advances in Neural Information Processing Systems , year=

[5] [5]

Transformer Feed-Forward Layers Are Key-Value Memories , author=. Proc. Conference on Empirical Methods in Natural Language Processing. 2021 , month =. doi:10.18653/v1/2021.emnlp-main.446

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021

[6] [6]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space , author=. Proc. Conference on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2022.emnlp-main.3

work page doi:10.18653/v1/2022.emnlp-main.3 2022

[7] [7]

2019 , howpublished=

Language Models are Unsupervised Multitask Learners , author=. 2019 , howpublished=

2019

[8] [8]

2019 , url =

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Cl. 2019 , url =. 1910.03771 , archivePrefix=

Pith/arXiv arXiv 2019

[9] [9]

Sanh, Victor and Debut, L and Chaumond, J and Wolf, T , eprint=

[10] [10]

Logic, language, and security: Essays dedicated to Andre Scedrov on the occasion of his 65th birthday , pages=

Gender bias in neural natural language processing , author=. Logic, language, and security: Essays dedicated to Andre Scedrov on the occasion of his 65th birthday , pages=. 2020 , publisher=. doi:https://doi.org/10.1007/978-3-030-62077-6_14 , address=

work page doi:10.1007/978-3-030-62077-6_14 2020

[11] [11]

Man is to computer programmer as woman is to homemaker? debiasing word embeddings , author=. Proc. Advances in Neural Information Processing Systems , year=

[12] [12]

Gender bias in coreference resolution: Evaluation and debiasing methods , author=. Proc. Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/N18-2003

work page doi:10.18653/v1/n18-2003 2003

[13] [13]

Gender bias in coreference resolution , author=. Proc. Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. doi:10.18653/v1/N18-2002

work page doi:10.18653/v1/n18-2002 2002

[14] [14]

What Does BERT Look at? An Analysis of BERT `s Attention

Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D. What Does BERT Look at? An Analysis of BERT `s Attention. Proc. ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2019

2019

[15] [15]

Locating and mitigating gender bias in large language models , author=. Proc. International Conference on Intelligent Computing , year=

[16] [16]

Can Editing

Chen, Canyu and Huang, Baixiang and Li, Zekun and Chen, Zhaorun and Lai, Shiyang and Xu, Xiongxiao and Gu, Jia-Chen and Gu, Jindong and Yao, Huaxiu and Xiao, Chaowei and Yan, Xifeng and Wang, William Yang and Torr, Philip and Song, Dawn and Shu, Kai , booktitle =. Can Editing. 2026 , doi =

2026

[17] [17]

Causal mediation analysis for interpreting neural

Vig, Jesse and Gehrmann, Sebastian and Belinkov, Yonatan and Qian, Sharon and Nevo, Daniel and Sakenis, Simas and Huang, Jason and Singer, Yaron and Shieber, Stuart , eprint=. Causal mediation analysis for interpreting neural

[18] [18]

Probabilistic and causal inference: the works of Judea Pearl , pages=

Direct and indirect effects , author=. Probabilistic and causal inference: the works of Judea Pearl , pages=. 2022 , publisher=

2022

[19] [19]

Prompting large language model for machine translation: A case study , author=. Proc. International Conference on Machine Learning , year=

[20] [20]

Nature Medicine , volume=

Adapted large language models can outperform medical experts in clinical text summarization , author=. Nature Medicine , volume=. 2024 , address=

2024

[21] [21]

ACM Transactions on Software Engineering and Methodology , volume=

Self-planning code generation with large language models , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2024 , url=

2024

[22] [22]

Science China Information Sciences , volume=

The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , volume=. 2025 , publisher=

2025

[23] [23]

Societal biases in language generation: Progress and challenges , author=. Proc. Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing. doi:10.18653/v1/2021.acl-long.330

work page doi:10.18653/v1/2021.acl-long.330 2021

[24] [24]

Nature Machine Intelligence , volume=

Factuality challenges in the era of large language models and opportunities for fact-checking , author=. Nature Machine Intelligence , volume=. 2024 , publisher=

2024

[25] [25]

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models , year=

Wang, Youze and Hu, Wenbo and Dong, Yinpeng and Liu, Jing and Zhang, Hanwang and Hong, Richang , journal=. Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models , year=

[26] [26]

Cunningham, Hoagy and Ewart, Aidan and Riggs, Logan and Huben, Robert and Sharkey, Lee , title =. Proc. International Conference on Learning Representations , year = 2024, month=

2024

[27] [27]

2023 , month =

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , month =

2023

[28] [28]

2024 , url=

A primer on the inner workings of transformer-based language models , author=. 2024 , url=. 2405.00208 , archivePrefix=

arXiv 2024

[29] [29]

2021 , url=

A mathematical framework for transformer circuits , author=. 2021 , url=

2021

[30] [30]

2022 , url=

In-context learning and induction heads , author=. 2022 , url=. 2209.11895 , archivePrefix=

Pith/arXiv arXiv 2022

[31] [31]

2024 , url =

From understanding to utilization: A survey on explainability for large language models , author=. 2024 , url =. 2401.12874 , archivePrefix=

arXiv 2024

[32] [32]

A toy model of universality: Reverse engineering how networks learn group operations , author=. Proc. International Conference on Machine Learning , year=

[33] [33]

Language models as knowledge bases? , author=. Proc. Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing , year=. doi:10.18653/v1/D19-1250

work page doi:10.18653/v1/d19-1250

[34] [34]

Inference-time intervention: Eliciting truthful answers from a language model , author=. Proc. Advances in Neural Information Processing Systems , year=

[35] [35]

Interpretability in the wild: a circuit for indirect object identification in

Wang, Kevin and Variengien, Alexandre and Conmy, Arthur and Shlegeris, Buck and Steinhardt, Jacob , booktitle=. Interpretability in the wild: a circuit for indirect object identification in. 2023 , address=

2023

[36] [36]

Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

Sikdar, Sandipan and Bhattacharya, Parantapa and Heese, Kieran. Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models. Proc. Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing. 2021. doi:10.18653/v1/2021.acl-long.71

work page doi:10.18653/v1/2021.acl-long.71 2021

[37] [37]

Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection

Chen, Hanjie and Zheng, Guangtao and Ji, Yangfeng. Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection. Proc. Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.494

work page doi:10.18653/v1/2020.acl-main.494 2020

[38] [38]

G lob E nc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

Modarressi, Ali and Fayyaz, Mohsen and Yaghoobzadeh, Yadollah and Pilehvar, Mohammad Taher. G lob E nc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers. Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.19

work page doi:10.18653/v1/2022.naacl-main.19 2022

[39] [39]

Discovering variable binding circuitry with desiderata , author=. Proc. International Conference on Machine Learning Workshop on Challenges in Deployable Generative AI , year=

[40] [40]

Fine-tuning enhances existing mechanisms: A case study on entity tracking , author=. Proc. International Conference on Learning Representations , year=

[41] [41]

Towards best practices of activation patching in language models: Metrics and methods , author=. Proc. International Conference on Learning Representations , month =. 2024 , address =

2024

[42] [42]

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

Geiger, Atticus and Richardson, Kyle and Potts, Christopher. Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation. Proc. BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. 2020. doi:10.18653/v1/2020.blackboxnlp-1.16

work page doi:10.18653/v1/2020.blackboxnlp-1.16 2020

[43] [43]

Inducing causal structure for interpretable neural networks , author=. Proc. International Conference on Machine Learning , year=

[44] [44]

Causal abstractions of neural networks , author=. Proc. Advances in Neural Information Processing Systems , year=

[45] [45]

Harvard Journal of Law & Technology , volume=

Counterfactual explanations without opening the black box: Automated decisions and the GDPR , author=. Harvard Journal of Law & Technology , volume=

[46] [46]

ACM Computing Surveys , volume=

A survey of algorithmic recourse: contrastive explanations and consequential recommendations , author=. ACM Computing Surveys , volume=

[47] [47]

Decomposing and editing predictions by modeling model computation , author=. Proc. International Conference on Machine Learning , year=

[48] [48]

Attribution patching outperforms automated circuit discovery , author=. Proc. Advances in Neural Information Processing Systems Workshop on Attributing Model Behavior at Scale , year=

[49] [49]

Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms , author=. Proc. Conference on Language Modeling , year=

[50] [50]

Quantifying Context Mixing in Transformers , author=. Proc. Conference of the European Chapter of the Association for Computational Linguistics , month=. 2023 , address=

2023

[51] [51]

Towards automated circuit discovery for mechanistic interpretability , author=. Proc. Advances in Neural Information Processing Systems , year=

[52] [52]

Interpretability at scale: Identifying causal mechanisms in alpaca , author=. Proc. Advances in Neural Information Processing Systems , year=

[53] [53]

2025 , url =

Qwen3 technical report , author =. 2025 , url =. 2505.09388 , archivePrefix=

Pith/arXiv arXiv 2025

[54] [54]

Llama 3.2: Revolutionizing edge

Llama Team, AI @ Meta , year=. Llama 3.2: Revolutionizing edge

[55] [55]

Multi-Level Explanations for Generative Language Models , author=. Proc. Annual Meeting of the Association for Computational Linguistics , year=

[56] [56]

2302.13971 , archivePrefix=

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. 2302.13971 , archivePrefix=

Pith/arXiv arXiv

[57] [57]

, booktitle=

Louizos, Christos and Welling, Max and Kingma, Diederik P. , booktitle=. Learning Sparse Neural Networks through. 2018 , url=

2018