arxiv: 2404.15255 · v1 · submitted 2024-04-23 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

How to use and interpret activation patching

Stefan Heimersheim , Neel Nanda

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation patchingmechanistic interpretabilitycausal interventionsmodel circuitsmetric selectionbaseline inputsinterpretability pitfalls

0 comments

The pith

Activation patching can provide misleading evidence about circuits if metrics and baselines are not chosen carefully.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper gives practical guidance on how to run and read activation patching experiments, a method that swaps model activations across different inputs to test causal roles inside neural networks. It reviews multiple ways to apply the swaps and focuses on what the outcomes actually reveal about circuits, with special attention to metric selection and the traps that arise from poor choices. A sympathetic reader would care because the technique is common for reverse-engineering how models work, yet misapplied experiments can suggest causal contributions that are not really there. The advice centers on making sure results reflect the intended intervention rather than artifacts from baselines or measurement choices.

Core claim

Activation patching experiments supply evidence about circuits only when the chosen metric accurately captures the causal effect and the baseline inputs minimize interference from other model parts. Different patching variants exist, but all require careful interpretation to avoid overclaiming what the patched activations control. The paper stresses that the evidence for a circuit depends directly on how the effect is quantified and on whether the baseline setup cleanly isolates the target component.

What carries the argument

Activation patching, the replacement of specific activations computed on one input with those from another input to measure resulting changes in model output.

If this is right

Researchers obtain more trustworthy maps of which activations causally affect specific outputs.
Pitfalls from metric choice are reduced, improving the reproducibility of circuit claims.
Interpretations of patching results become consistent enough to compare across studies.
Standardized application methods make it easier to rule out setup artifacts in future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cautions could apply when adapting patching to other causal interventions used in interpretability.
In very large models the interference risk may grow, requiring extra baseline checks.
Some earlier circuit identifications might shift if re-run with the recommended metric practices.
A direct test would compare patching outcomes on a known circuit across several baseline distributions.

Load-bearing premise

That activation patching can isolate the causal contribution of specific activations without substantial interference from other model components or from the choice of baseline inputs.

What would settle it

An experiment in which patching the same activations with two different baselines yields opposite conclusions about whether those activations are necessary for a given behavior.

read the original abstract

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear practical guide on activation patching that organizes existing pitfalls and interpretation advice without new results or formal claims.

read the letter

The main thing to know is that this paper pulls together experience-based advice on running and reading activation patching experiments, with a focus on what the results actually tell us about circuits and how metric choices can mislead. It does not claim new techniques or findings, but it does lay out the different patching variants and the common ways people misinterpret the outputs. That synthesis is the real value here, especially the sections on baseline selection and how different metrics change what the experiment is measuring. The authors walk through concrete examples of interference and why patching does not always isolate a single activation cleanly, which matches what most people run into when they try this on real models. The writing stays grounded and avoids overclaiming isolation or causality. The soft spots are predictable for this kind of paper: everything rests on the authors' stated experience rather than systematic tests or new data, so readers still have to judge how well the listed pitfalls generalize. There is no formal analysis of when the assumptions break, and the discussion of evidence for circuits stays at the level of logical caveats rather than quantified bounds. That keeps the piece short and usable but also means it functions more as a checklist than a definitive reference. This is the sort of thing that belongs in the hands of people actively running mechanistic interpretability experiments on language models. A practitioner who has already done a few patching runs will get immediate use from the metric and baseline sections. It does not move the theoretical frontier, but it can reduce the number of sloppy experiments that get published. I would send it to peer review because the field needs more of these targeted guides that improve how existing tools are applied, even if the contribution is incremental.

Referee Report

0 major / 1 minor

Summary. The manuscript is a practical guide summarizing advice and best practices for applying activation patching in mechanistic interpretability. It covers different ways to apply the technique and discusses interpretation of results, with a focus on the evidence patching experiments provide about circuits and on pitfalls associated with metric choice and baseline selection.

Significance. If the described advice holds, the paper provides a useful consolidation of experience-based guidance for a widely used technique. This can help standardize practices, reduce misinterpretation risks from metric effects and interference, and improve the reliability of circuit identification claims in interpretability research.

minor comments (1)

[Overview of application methods] The overview of application methods would benefit from a short concrete example illustrating how baseline input choice affects the patching metric, to make the described pitfalls more actionable for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately reflects the paper's focus on practical guidance for activation patching, including its applications, interpretation of results regarding circuits, and pitfalls related to metrics and baselines. There are no major comments requiring response or revision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a descriptive practical guide summarizing experience-based advice on applying and interpreting activation patching experiments. It contains no equations, derivations, fitted parameters, or formal claims that reduce to self-referential inputs. All content consists of qualitative guidance on metrics, baselines, and evidence for circuits, without any load-bearing steps that equate outputs to inputs by construction or via self-citation chains. The central discussion of evidence and pitfalls is self-contained and externally falsifiable through replication of the described experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central advice rests on the domain assumption that activation patching measures causal effects in a circuit-like model; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Activation patching isolates causal contributions of model components to outputs.
This premise is invoked throughout the discussion of what patching experiments evidence about circuits.

pith-pipeline@v0.9.0 · 5357 in / 1031 out tokens · 27141 ms · 2026-05-16T20:31:21.488340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation, Foundation.DAlembert.Inevitability bilinear_family_forced, washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Activation patching is a popular mechanistic interpretability technique... We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.
Foundation.LedgerCanonicality, Foundation.HierarchyEmergence reality_from_one_distinction, hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We provide a summary of advice and best practices... overview of the different ways to apply activation patching and a discussion on how to interpret the results.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How LLMs Are Persuaded: A Few Attention Heads, Rerouted
cs.AI 2026-05 unverdicted novelty 7.0

Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic
cs.LG 2026-05 unverdicted novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass
cs.LG 2026-05 unverdicted novelty 7.0

Instruction-tuned language models stabilize their next-token predictions later in the forward pass than pretrained models, with late MLP layers providing the strongest tested control point under matched histories.
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
cs.CL 2026-05 unverdicted novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
cs.LG 2026-05 unverdicted novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
cs.AI 2026-05 unverdicted novelty 7.0

Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.
Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers
cs.CV 2026-04 accept novelty 7.0

Zero-ablation overstates register content dependence in DINO ViTs because mean, noise, and cross-image shuffle replacements preserve performance while zeroing does not.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
cs.AI 2026-05 unverdicted novelty 6.0

Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers
cs.CV 2026-04 unverdicted novelty 6.0

A classification-trained ViT encodes patch boundaries at layers 5-6 and depth at layer 8, with causal interventions showing the depth signal is actively re-derived rather than passively carried.
How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
cs.LG 2026-04 unverdicted novelty 6.0

LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
cs.CL 2026-04 unverdicted novelty 6.0

LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
cs.AI 2026-04 conditional novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
cs.LG 2026-04 unverdicted novelty 6.0

Hallucination is an early trajectory commitment in transformers governed by asymmetric attractor dynamics, with prompt encoding selecting the basin and correction needing multi-step intervention.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs
cs.AI 2026-04 unverdicted novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...
On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models
cs.CL 2026-02 unverdicted novelty 6.0

Suggestive evidence indicates language models develop interconnected social world models by functionally integrating theory of mind and pragmatic reasoning.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 17 Pith papers · 6 internal anchors

[1]

Causal scrubbing: A method for rigorously testing interpretability hypotheses

Chan, Lawrence, Adria Garriga-Alonso, Nix Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishin- skaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas (2022). Causal scrubbing: A method for rigorously testing interpretability hypotheses. Alignment Forum. URL: https://www. alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for- rigorously-testing

work page 2022
[2]

Towards Automated Circuit Discovery for Mechanistic Interpretability

Conmy, Arthur, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso (Apr. 2023). “Towards Automated Circuit Discovery for Mechanistic Interpretability”. In: arXiv e-prints, arXiv:2304.14997, arXiv:2304.14997. DOI: 10.48550/arXiv.2304.14997. arXiv: 2304.14997 [cs.LG]

work page doi:10.48550/arxiv.2304.14997 2023
[3]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey (Sept. 2023). “Sparse Autoencoders Find Highly Interpretable Features in Language Models”. In: arXiv e-prints, arXiv:2309.08600, arXiv:2309.08600. DOI: 10.48550/arXiv.2309.08600. arXiv: 2309.08600 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600 2023
[4]

How do Language Models Bind Entities in Context?

Feng, Jiahai and Jacob Steinhardt (Oct. 2023). “How do Language Models Bind Entities in Context?” In: arXiv e-prints, arXiv:2310.17191, arXiv:2310.17191. DOI: 10.48550/arXiv.2310.17191. arXiv: 2310.17191 [cs.LG]

work page doi:10.48550/arxiv.2310.17191 2023
[5]

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

Finlayson, Matthew, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov (June 2021). “Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models”. In: arXiv e-prints, arXiv:2106.06087, arXiv:2106.06087. DOI: 10.48550/arXiv.2106. 06087. arXiv: 2106.06087 [cs.CL]

work page doi:10.48550/arxiv.2106 2021
[6]

Causal Abstractions of Neural Networks

Geiger, Atticus, Hanson Lu, Thomas Icard, and Christopher Potts (June 2021a). “Causal Abstractions of Neural Networks”. In: arXiv e-prints, arXiv:2106.02997, arXiv:2106.02997. DOI: 10.48550/ arXiv.2106.02997. arXiv: 2106.02997 [cs.AI]. 11

work page arXiv
[7]

Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

Geiger, Atticus, Kyle Richardson, and Christopher Potts (Apr. 2020). “Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation”. In:arXiv e-prints, arXiv:2004.14623, arXiv:2004.14623. DOI: 10.48550/arXiv.2004.14623. arXiv: 2004.14623 [cs.CL]

work page doi:10.48550/arxiv.2004.14623 2020
[8]

Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhe-Feng Wang, Baoxing Huai, and Min Zhang

Geiger, Atticus, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah D. Goodman, and Christopher Potts (Dec. 2021b). “Inducing Causal Structure for Interpretable Neural Networks”. In: arXiv e-prints, arXiv:2112.00826, arXiv:2112.00826. DOI: 10.48550/arXiv. 2112.00826. arXiv: 2112.00826 [cs.LG]

work page internal anchor Pith review doi:10.48550/arxiv
[9]

Dissecting Recall of Factual Associations in Auto-Regressive Language Models

Geva, Mor, Jasmijn Bastings, Katja Filippova, and Amir Globerson (Apr. 2023). “Dissecting Recall of Factual Associations in Auto-Regressive Language Models”. In: arXiv e-prints, arXiv:2304.14767, arXiv:2304.14767. DOI: 10.48550/arXiv.2304.14767. arXiv: 2304.14767 [cs.CL]

work page doi:10.48550/arxiv.2304.14767 2023
[10]

Localizing Model Behavior with Path Patching

Goldowsky-Dill, Nicholas, Chris MacLeod, Lucas Sato, and Aryaman Arora (Apr. 2023). “Localizing Model Behavior with Path Patching”. In: arXiv e-prints, arXiv:2304.05969, arXiv:2304.05969. DOI: 10.48550/arXiv.2304.05969. arXiv: 2304.05969 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.05969 2023
[11]

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Hanna, Michael, Ollie Liu, and Alexandre Variengien (Apr. 2023). “How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model”. In: arXiv e-prints, arXiv:2305.00586, arXiv:2305.00586. DOI: 10.48550/arXiv.2305.00586 . arXiv: 2305.00586 [cs.CL]

work page doi:10.48550/arxiv.2305.00586 2023
[12]

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Hase, Peter, Mohit Bansal, Been Kim, and Asma Ghandeharioun (Jan. 2023). “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models”. In: arXiv e-prints, arXiv:2301.04213, arXiv:2301.04213. DOI: 10.48550/ arXiv.2301.04213. arXiv: 2301.04213 [cs.LG]

work page arXiv 2023
[13]

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations

Hase, Peter, Harry Xie, and Mohit Bansal (June 2021). “The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations”. In: arXiv e-prints, arXiv:2106.00786, arXiv:2106.00786. DOI: 10.48550/arXiv.2106.00786. arXiv: 2106.00786 [cs.LG]

work page doi:10.48550/arxiv.2106.00786 2021
[14]

URL: https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a- circuit-for-python-docstrings-in-a-4-layer-attention-only

Heimersheim, Stefan and Jett Janiak (2023).A circuit for Python docstrings in a 4-layer attention-only transformer. URL: https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a- circuit-for-python-docstrings-in-a-4-layer-attention-only

work page 2023
[15]

In-Context Learning Creates Task Vectors

Hendel, Roee, Mor Geva, and Amir Globerson (Oct. 2023). “In-Context Learning Creates Task Vectors”. In:arXiv e-prints, arXiv:2310.15916, arXiv:2310.15916. DOI: 10.48550/arXiv.2310. 15916. arXiv: 2310.15916 [cs.CL]

work page doi:10.48550/arxiv.2310 2023
[16]

Rigorously Assessing Natural Language Explanations of Neurons

Huang, Jing, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, and Christopher Potts (Sept. 2023). “Rigorously Assessing Natural Language Explanations of Neurons”. In: arXiv e-prints, arXiv:2309.10312, arXiv:2309.10312. DOI: 10.48550/arXiv.2309.10312. arXiv: 2309.10312 [cs.CL]

work page doi:10.48550/arxiv.2309.10312 2023
[17]

Does Circuit Analysis Interpretability Scale? Evidence from Multi- ple Choice Capabilities in Chinchilla

Lieberum, Tom, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik (July 2023). “Does Circuit Analysis Interpretability Scale? Evidence from Multi- ple Choice Capabilities in Chinchilla”. In: arXiv e-prints, arXiv:2307.09458, arXiv:2307.09458. DOI: 10.48550/arXiv.2307.09458. arXiv: 2307.09458 [cs.LG]

work page doi:10.48550/arxiv.2307.09458 2023
[18]

Copy Suppression: Comprehensively Understanding an Attention Head

McDougall, Callum, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda (Oct. 2023). “Copy Suppression: Comprehensively Understanding an Attention Head”. In: arXiv e-prints, arXiv:2310.04625, arXiv:2310.04625. DOI: 10.48550/arXiv.2310.04625. arXiv: 2310.04625 [cs.LG]

work page doi:10.48550/arxiv.2310.04625 2023
[19]

The Hydra Effect: Emergent Self-repair in Language Model Computations

McGrath, Thomas, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, and Shane Legg (July 2023). “The Hydra Effect: Emergent Self-repair in Language Model Computations”. In:arXiv e-prints, arXiv:2307.15771, arXiv:2307.15771. DOI: 10.48550/arXiv.2307.15771. arXiv: 2307.15771 [cs.LG]

work page doi:10.48550/arxiv.2307.15771 2023
[20]

Locating and Editing Factual Associations in GPT

Meng, Kevin, David Bau, Alex Andonian, and Yonatan Belinkov (Feb. 2022). “Locating and Editing Factual Associations in GPT”. In: arXiv e-prints, arXiv:2202.05262, arXiv:2202.05262. DOI: 10.48550/arXiv.2202.05262. arXiv: 2202.05262 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.05262 2022
[21]

Circuit Component Reuse Across Tasks in Transformer Language Models

Merullo, Jack, Carsten Eickhoff, and Ellie Pavlick (Oct. 2023). “Circuit Component Reuse Across Tasks in Transformer Language Models”. In:arXiv e-prints, arXiv:2310.08744, arXiv:2310.08744. DOI: 10.48550/arXiv.2310.08744. arXiv: 2310.08744 [cs.CL]

work page doi:10.48550/arxiv.2310.08744 2023
[22]

How to Think About Activation Patching

Nanda, Neel (2023). Attribution Patching: Activation Patching At Industrial Scale. Blogpost. Section “How to Think About Activation Patching”.URL: https://www.neelnanda.io/mechanistic- interpretability/attribution-patching

work page 2023
[23]

Fact Finding: Attempting to Reverse- Engineer Factual Recall on the Neuron Level (Post 1)

Nanda, Neel, SenR, János Kramár, and Rohin Shah (2023). Fact Finding: Attempting to Reverse- Engineer Factual Recall on the Neuron Level (Post 1). Alignment Forum. URL: https://www. 12 alignmentforum . org / posts / iGuwZTHWb6DFY3sKB / fact - finding - attempting - to - reverse-engineer-factual-recall

work page 2023
[24]

Discovering the Compo- sitional Structure of Vector Representations with Role Learning Networks

Soulos, Paul, Tom McCoy, Tal Linzen, and Paul Smolensky (Oct. 2019). “Discovering the Compo- sitional Structure of Vector Representations with Role Learning Networks”. In: arXiv e-prints, arXiv:1910.09113, arXiv:1910.09113. DOI: 10.48550/arXiv.1910.09113. arXiv: 1910.09113 [cs.LG]

work page doi:10.48550/arxiv.1910.09113 2019
[25]

A Mechanistic Interpre- tation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis

Stolfo, Alessandro, Yonatan Belinkov, and Mrinmaya Sachan (May 2023). “A Mechanistic Interpre- tation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis”. In: arXiv e-prints, arXiv:2305.15054, arXiv:2305.15054. DOI: 10.48550/arXiv.2305.15054 . arXiv: 2305.15054 [cs.CL]

work page doi:10.48550/arxiv.2305.15054 2023
[26]

Linear Representations of Sentiment in Large Language Models

Tigges, Curt, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda (Oct. 2023). “Linear Rep- resentations of Sentiment in Large Language Models”. In: arXiv e-prints, arXiv:2310.15154, arXiv:2310.15154. DOI: 10.48550/arXiv.2310.15154. arXiv: 2310.15154 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.15154 2023
[27]

Function Vectors in Large Language Models

Todd, Eric, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau (Oct. 2023). “Function Vectors in Large Language Models”. In:arXiv e-prints, arXiv:2310.15213, arXiv:2310.15213. DOI: 10.48550/arXiv.2310.15213. arXiv: 2310.15213 [cs.CL]

work page doi:10.48550/arxiv.2310.15213 2023
[28]

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias

Huang, Yaron Singer, and Stuart Shieber (Apr. 2020). “Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias”. In: arXiv e-prints, arXiv:2004.12265, arXiv:2004.12265. DOI: 10.48550/arXiv.2004.12265. arXiv: 2004.12265 [cs.CL]

work page doi:10.48550/arxiv.2004.12265 2020
[29]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt (Nov. 2022). “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small”. In: arXiv e-prints, arXiv:2211.00593, arXiv:2211.00593. DOI: 10.48550/arXiv.2211.00593. arXiv: 2211.00593 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.00593 2022
[30]

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Zhang, Fred and Neel Nanda (Sept. 2023). “Towards Best Practices of Activation Patching in Language Models: Metrics and Methods”. In:arXiv e-prints, arXiv:2309.16042, arXiv:2309.16042. DOI: 10.48550/arXiv.2309.16042. arXiv: 2309.16042 [cs.LG]. 13

work page doi:10.48550/arxiv.2309.16042 2023