pith. machine review for the scientific record. sign in

arxiv: 2404.15255 · v1 · submitted 2024-04-23 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

How to use and interpret activation patching

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords activation patchingmechanistic interpretabilitycausal interventionsmodel circuitsmetric selectionbaseline inputsinterpretability pitfalls
0
0 comments X

The pith

Activation patching can provide misleading evidence about circuits if metrics and baselines are not chosen carefully.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper gives practical guidance on how to run and read activation patching experiments, a method that swaps model activations across different inputs to test causal roles inside neural networks. It reviews multiple ways to apply the swaps and focuses on what the outcomes actually reveal about circuits, with special attention to metric selection and the traps that arise from poor choices. A sympathetic reader would care because the technique is common for reverse-engineering how models work, yet misapplied experiments can suggest causal contributions that are not really there. The advice centers on making sure results reflect the intended intervention rather than artifacts from baselines or measurement choices.

Core claim

Activation patching experiments supply evidence about circuits only when the chosen metric accurately captures the causal effect and the baseline inputs minimize interference from other model parts. Different patching variants exist, but all require careful interpretation to avoid overclaiming what the patched activations control. The paper stresses that the evidence for a circuit depends directly on how the effect is quantified and on whether the baseline setup cleanly isolates the target component.

What carries the argument

Activation patching, the replacement of specific activations computed on one input with those from another input to measure resulting changes in model output.

If this is right

  • Researchers obtain more trustworthy maps of which activations causally affect specific outputs.
  • Pitfalls from metric choice are reduced, improving the reproducibility of circuit claims.
  • Interpretations of patching results become consistent enough to compare across studies.
  • Standardized application methods make it easier to rule out setup artifacts in future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cautions could apply when adapting patching to other causal interventions used in interpretability.
  • In very large models the interference risk may grow, requiring extra baseline checks.
  • Some earlier circuit identifications might shift if re-run with the recommended metric practices.
  • A direct test would compare patching outcomes on a known circuit across several baseline distributions.

Load-bearing premise

That activation patching can isolate the causal contribution of specific activations without substantial interference from other model components or from the choice of baseline inputs.

What would settle it

An experiment in which patching the same activations with two different baselines yields opposite conclusions about whether those activations are necessary for a given behavior.

read the original abstract

Activation patching is a popular mechanistic interpretability technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply activation patching and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about circuits, and on the choice of metric and associated pitfalls.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript is a practical guide summarizing advice and best practices for applying activation patching in mechanistic interpretability. It covers different ways to apply the technique and discusses interpretation of results, with a focus on the evidence patching experiments provide about circuits and on pitfalls associated with metric choice and baseline selection.

Significance. If the described advice holds, the paper provides a useful consolidation of experience-based guidance for a widely used technique. This can help standardize practices, reduce misinterpretation risks from metric effects and interference, and improve the reliability of circuit identification claims in interpretability research.

minor comments (1)
  1. [Overview of application methods] The overview of application methods would benefit from a short concrete example illustrating how baseline input choice affects the patching metric, to make the described pitfalls more actionable for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. The referee's summary accurately reflects the paper's focus on practical guidance for activation patching, including its applications, interpretation of results regarding circuits, and pitfalls related to metrics and baselines. There are no major comments requiring response or revision.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a descriptive practical guide summarizing experience-based advice on applying and interpreting activation patching experiments. It contains no equations, derivations, fitted parameters, or formal claims that reduce to self-referential inputs. All content consists of qualitative guidance on metrics, baselines, and evidence for circuits, without any load-bearing steps that equate outputs to inputs by construction or via self-citation chains. The central discussion of evidence and pitfalls is self-contained and externally falsifiable through replication of the described experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central advice rests on the domain assumption that activation patching measures causal effects in a circuit-like model; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Activation patching isolates causal contributions of model components to outputs.
    This premise is invoked throughout the discussion of what patching experiments evidence about circuits.

pith-pipeline@v0.9.0 · 5357 in / 1031 out tokens · 27141 ms · 2026-05-16T20:31:21.488340+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How LLMs Are Persuaded: A Few Attention Heads, Rerouted

    cs.AI 2026-05 unverdicted novelty 7.0

    Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

  2. Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

  3. The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass

    cs.LG 2026-05 unverdicted novelty 7.0

    Instruction-tuned language models stabilize their next-token predictions later in the forward pass than pretrained models, with late MLP layers providing the strongest tested control point under matched histories.

  4. Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 7.0

    Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

  5. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    cs.LG 2026-05 unverdicted novelty 7.0

    Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

  6. Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

    cs.AI 2026-05 unverdicted novelty 7.0

    Llama-3.1-8B computes sums for cyclic concepts using base-10 addition via task-agnostic Fourier features with periods 2, 5, and 10 rather than modular arithmetic in the concept period.

  7. Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

    cs.CV 2026-04 accept novelty 7.0

    Zero-ablation overstates register content dependence in DINO ViTs because mean, noise, and cross-image shuffle replacements preserve performance while zeroing does not.

  8. Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

    cs.LG 2026-05 unverdicted novelty 6.0

    Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.

  9. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

  10. When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.

  11. From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers

    cs.CV 2026-04 unverdicted novelty 6.0

    A classification-trained ViT encodes patch boundaries at layers 5-6 and depth at layer 8, with causal interventions showing the depth signal is actively re-derived rather than passively carried.

  12. How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

    cs.LG 2026-04 unverdicted novelty 6.0

    LLMs implement a second-order confidence architecture where the PANL activation encodes both error likelihood and the ability to correct it, beyond verbal confidence or log-probabilities.

  13. From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

    cs.CL 2026-04 unverdicted novelty 6.0

    LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.

  14. Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

    cs.AI 2026-04 conditional novelty 6.0

    Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

  15. Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    Hallucination is an early trajectory commitment in transformers governed by asymmetric attractor dynamics, with prompt encoding selecting the basin and correction needing multi-step intervention.

  16. Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...

  17. On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

    cs.CL 2026-02 unverdicted novelty 6.0

    Suggestive evidence indicates language models develop interconnected social world models by functionally integrating theory of mind and pragmatic reasoning.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 17 Pith papers · 6 internal anchors

  1. [1]

    Causal scrubbing: A method for rigorously testing interpretability hypotheses

    Chan, Lawrence, Adria Garriga-Alonso, Nix Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishin- skaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas (2022). Causal scrubbing: A method for rigorously testing interpretability hypotheses. Alignment Forum. URL: https://www. alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for- rigorously-testing

  2. [2]

    Towards Automated Circuit Discovery for Mechanistic Interpretability

    Conmy, Arthur, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga- Alonso (Apr. 2023). “Towards Automated Circuit Discovery for Mechanistic Interpretability”. In: arXiv e-prints, arXiv:2304.14997, arXiv:2304.14997. DOI: 10.48550/arXiv.2304.14997. arXiv: 2304.14997 [cs.LG]

  3. [3]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey (Sept. 2023). “Sparse Autoencoders Find Highly Interpretable Features in Language Models”. In: arXiv e-prints, arXiv:2309.08600, arXiv:2309.08600. DOI: 10.48550/arXiv.2309.08600. arXiv: 2309.08600 [cs.LG]

  4. [4]

    How do Language Models Bind Entities in Context?

    Feng, Jiahai and Jacob Steinhardt (Oct. 2023). “How do Language Models Bind Entities in Context?” In: arXiv e-prints, arXiv:2310.17191, arXiv:2310.17191. DOI: 10.48550/arXiv.2310.17191. arXiv: 2310.17191 [cs.LG]

  5. [5]

    Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

    Finlayson, Matthew, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov (June 2021). “Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models”. In: arXiv e-prints, arXiv:2106.06087, arXiv:2106.06087. DOI: 10.48550/arXiv.2106. 06087. arXiv: 2106.06087 [cs.CL]

  6. [6]

    Causal Abstractions of Neural Networks

    Geiger, Atticus, Hanson Lu, Thomas Icard, and Christopher Potts (June 2021a). “Causal Abstractions of Neural Networks”. In: arXiv e-prints, arXiv:2106.02997, arXiv:2106.02997. DOI: 10.48550/ arXiv.2106.02997. arXiv: 2106.02997 [cs.AI]. 11

  7. [7]

    Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

    Geiger, Atticus, Kyle Richardson, and Christopher Potts (Apr. 2020). “Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation”. In:arXiv e-prints, arXiv:2004.14623, arXiv:2004.14623. DOI: 10.48550/arXiv.2004.14623. arXiv: 2004.14623 [cs.CL]

  8. [8]

    Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhe-Feng Wang, Baoxing Huai, and Min Zhang

    Geiger, Atticus, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah D. Goodman, and Christopher Potts (Dec. 2021b). “Inducing Causal Structure for Interpretable Neural Networks”. In: arXiv e-prints, arXiv:2112.00826, arXiv:2112.00826. DOI: 10.48550/arXiv. 2112.00826. arXiv: 2112.00826 [cs.LG]

  9. [9]

    Dissecting Recall of Factual Associations in Auto-Regressive Language Models

    Geva, Mor, Jasmijn Bastings, Katja Filippova, and Amir Globerson (Apr. 2023). “Dissecting Recall of Factual Associations in Auto-Regressive Language Models”. In: arXiv e-prints, arXiv:2304.14767, arXiv:2304.14767. DOI: 10.48550/arXiv.2304.14767. arXiv: 2304.14767 [cs.CL]

  10. [10]

    Localizing Model Behavior with Path Patching

    Goldowsky-Dill, Nicholas, Chris MacLeod, Lucas Sato, and Aryaman Arora (Apr. 2023). “Localizing Model Behavior with Path Patching”. In: arXiv e-prints, arXiv:2304.05969, arXiv:2304.05969. DOI: 10.48550/arXiv.2304.05969. arXiv: 2304.05969 [cs.LG]

  11. [11]

    How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

    Hanna, Michael, Ollie Liu, and Alexandre Variengien (Apr. 2023). “How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model”. In: arXiv e-prints, arXiv:2305.00586, arXiv:2305.00586. DOI: 10.48550/arXiv.2305.00586 . arXiv: 2305.00586 [cs.CL]

  12. [12]

    Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

    Hase, Peter, Mohit Bansal, Been Kim, and Asma Ghandeharioun (Jan. 2023). “Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models”. In: arXiv e-prints, arXiv:2301.04213, arXiv:2301.04213. DOI: 10.48550/ arXiv.2301.04213. arXiv: 2301.04213 [cs.LG]

  13. [13]

    The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations

    Hase, Peter, Harry Xie, and Mohit Bansal (June 2021). “The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations”. In: arXiv e-prints, arXiv:2106.00786, arXiv:2106.00786. DOI: 10.48550/arXiv.2106.00786. arXiv: 2106.00786 [cs.LG]

  14. [14]

    URL: https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a- circuit-for-python-docstrings-in-a-4-layer-attention-only

    Heimersheim, Stefan and Jett Janiak (2023).A circuit for Python docstrings in a 4-layer attention-only transformer. URL: https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a- circuit-for-python-docstrings-in-a-4-layer-attention-only

  15. [15]

    In-Context Learning Creates Task Vectors

    Hendel, Roee, Mor Geva, and Amir Globerson (Oct. 2023). “In-Context Learning Creates Task Vectors”. In:arXiv e-prints, arXiv:2310.15916, arXiv:2310.15916. DOI: 10.48550/arXiv.2310. 15916. arXiv: 2310.15916 [cs.CL]

  16. [16]

    Rigorously Assessing Natural Language Explanations of Neurons

    Huang, Jing, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, and Christopher Potts (Sept. 2023). “Rigorously Assessing Natural Language Explanations of Neurons”. In: arXiv e-prints, arXiv:2309.10312, arXiv:2309.10312. DOI: 10.48550/arXiv.2309.10312. arXiv: 2309.10312 [cs.CL]

  17. [17]

    Does Circuit Analysis Interpretability Scale? Evidence from Multi- ple Choice Capabilities in Chinchilla

    Lieberum, Tom, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, and Vladimir Mikulik (July 2023). “Does Circuit Analysis Interpretability Scale? Evidence from Multi- ple Choice Capabilities in Chinchilla”. In: arXiv e-prints, arXiv:2307.09458, arXiv:2307.09458. DOI: 10.48550/arXiv.2307.09458. arXiv: 2307.09458 [cs.LG]

  18. [18]

    Copy Suppression: Comprehensively Understanding an Attention Head

    McDougall, Callum, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda (Oct. 2023). “Copy Suppression: Comprehensively Understanding an Attention Head”. In: arXiv e-prints, arXiv:2310.04625, arXiv:2310.04625. DOI: 10.48550/arXiv.2310.04625. arXiv: 2310.04625 [cs.LG]

  19. [19]

    The Hydra Effect: Emergent Self-repair in Language Model Computations

    McGrath, Thomas, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, and Shane Legg (July 2023). “The Hydra Effect: Emergent Self-repair in Language Model Computations”. In:arXiv e-prints, arXiv:2307.15771, arXiv:2307.15771. DOI: 10.48550/arXiv.2307.15771. arXiv: 2307.15771 [cs.LG]

  20. [20]

    Locating and Editing Factual Associations in GPT

    Meng, Kevin, David Bau, Alex Andonian, and Yonatan Belinkov (Feb. 2022). “Locating and Editing Factual Associations in GPT”. In: arXiv e-prints, arXiv:2202.05262, arXiv:2202.05262. DOI: 10.48550/arXiv.2202.05262. arXiv: 2202.05262 [cs.CL]

  21. [21]

    Circuit Component Reuse Across Tasks in Transformer Language Models

    Merullo, Jack, Carsten Eickhoff, and Ellie Pavlick (Oct. 2023). “Circuit Component Reuse Across Tasks in Transformer Language Models”. In:arXiv e-prints, arXiv:2310.08744, arXiv:2310.08744. DOI: 10.48550/arXiv.2310.08744. arXiv: 2310.08744 [cs.CL]

  22. [22]

    How to Think About Activation Patching

    Nanda, Neel (2023). Attribution Patching: Activation Patching At Industrial Scale. Blogpost. Section “How to Think About Activation Patching”.URL: https://www.neelnanda.io/mechanistic- interpretability/attribution-patching

  23. [23]

    Fact Finding: Attempting to Reverse- Engineer Factual Recall on the Neuron Level (Post 1)

    Nanda, Neel, SenR, János Kramár, and Rohin Shah (2023). Fact Finding: Attempting to Reverse- Engineer Factual Recall on the Neuron Level (Post 1). Alignment Forum. URL: https://www. 12 alignmentforum . org / posts / iGuwZTHWb6DFY3sKB / fact - finding - attempting - to - reverse-engineer-factual-recall

  24. [24]

    Discovering the Compo- sitional Structure of Vector Representations with Role Learning Networks

    Soulos, Paul, Tom McCoy, Tal Linzen, and Paul Smolensky (Oct. 2019). “Discovering the Compo- sitional Structure of Vector Representations with Role Learning Networks”. In: arXiv e-prints, arXiv:1910.09113, arXiv:1910.09113. DOI: 10.48550/arXiv.1910.09113. arXiv: 1910.09113 [cs.LG]

  25. [25]

    A Mechanistic Interpre- tation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis

    Stolfo, Alessandro, Yonatan Belinkov, and Mrinmaya Sachan (May 2023). “A Mechanistic Interpre- tation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis”. In: arXiv e-prints, arXiv:2305.15054, arXiv:2305.15054. DOI: 10.48550/arXiv.2305.15054 . arXiv: 2305.15054 [cs.CL]

  26. [26]

    Linear Representations of Sentiment in Large Language Models

    Tigges, Curt, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda (Oct. 2023). “Linear Rep- resentations of Sentiment in Large Language Models”. In: arXiv e-prints, arXiv:2310.15154, arXiv:2310.15154. DOI: 10.48550/arXiv.2310.15154. arXiv: 2310.15154 [cs.LG]

  27. [27]

    Function Vectors in Large Language Models

    Todd, Eric, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau (Oct. 2023). “Function Vectors in Large Language Models”. In:arXiv e-prints, arXiv:2310.15213, arXiv:2310.15213. DOI: 10.48550/arXiv.2310.15213. arXiv: 2310.15213 [cs.CL]

  28. [28]

    Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias

    Huang, Yaron Singer, and Stuart Shieber (Apr. 2020). “Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias”. In: arXiv e-prints, arXiv:2004.12265, arXiv:2004.12265. DOI: 10.48550/arXiv.2004.12265. arXiv: 2004.12265 [cs.CL]

  29. [29]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt (Nov. 2022). “Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small”. In: arXiv e-prints, arXiv:2211.00593, arXiv:2211.00593. DOI: 10.48550/arXiv.2211.00593. arXiv: 2211.00593 [cs.LG]

  30. [30]

    Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

    Zhang, Fred and Neel Nanda (Sept. 2023). “Towards Best Practices of Activation Patching in Language Models: Metrics and Methods”. In:arXiv e-prints, arXiv:2309.16042, arXiv:2309.16042. DOI: 10.48550/arXiv.2309.16042. arXiv: 2309.16042 [cs.LG]. 13