pith. machine review for the scientific record. sign in

arxiv: 2304.05969 · v2 · submitted 2023-04-12 · 💻 cs.LG

Recognition: no theorem link

Localizing Model Behavior with Path Patching

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords path patchingmechanistic interpretabilitycausal localizationinduction headstransformer circuitsactivation interventionsGPT-2 analysis
0
0 comments X

The pith

Path patching lets researchers test whether a neural network's behavior is localized to a specific set of paths through its components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces path patching as a method to express and quantitatively evaluate hypotheses that particular model behaviors arise from interactions along defined paths rather than from the full network. This approach moves beyond qualitative inspection by allowing direct causal interventions that measure how much a hypothesized set of paths contributes to an output. If the method works, it supplies a reproducible way to confirm or refute claims about where computation happens inside transformers and similar models. The authors demonstrate it by refining an existing account of induction heads and by characterizing a behavior in GPT-2. The technique also comes with an open-source implementation to support further experiments of the same kind.

Core claim

Path patching replaces activations along selected paths while leaving other activations unchanged, thereby isolating the causal effect of those paths on the model's output. This provides a quantitative test for the claim that a given behavior is localized to the chosen paths rather than distributed across the network. The method is used to sharpen the description of induction heads and to examine a concrete behavior in GPT-2, showing that the localization hypotheses can be stated and measured with greater precision than before.

What carries the argument

Path patching, an intervention that swaps activations along a hypothesized set of paths to measure their isolated causal contribution to behavior.

If this is right

  • Researchers can state localization hypotheses in terms of explicit paths and obtain numerical evidence for or against them.
  • Existing qualitative accounts of induction heads can be refined by measuring the exact contribution of the relevant paths.
  • The same procedure can be applied to other behaviors in GPT-2 or similar models to produce comparable localization results.
  • An open implementation lowers the cost of running additional path-patching experiments on new hypotheses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique could be extended to compare competing localization hypotheses by pitting their path sets against each other in the same experiment.
  • If path patching proves reliable, it might serve as a building block for automated search over possible localizations rather than manual hypothesis construction.
  • Similar path-based interventions could be tried on models outside the transformer family to check whether the localization pattern holds more generally.

Load-bearing premise

Changing activations only along the chosen paths does not create unintended side effects or interactions that would alter the model's behavior through other routes.

What would settle it

Run path patching on a set of paths hypothesized to produce a specific output and observe whether the output changes exactly as predicted while all other model activations remain untouched.

read the original abstract

Localizing behaviors of neural networks to a subset of the network's components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces path patching, a technique for expressing and quantitatively testing hypotheses that neural network behaviors are localized to specific sets of paths through components. It applies the method to refine prior explanations of induction heads, characterizes a behavior in GPT-2, and releases an open-source framework for similar experiments.

Significance. If the isolation property holds, path patching would provide a useful quantitative tool for mechanistic interpretability, moving beyond ad-hoc qualitative localization claims. The open-sourced framework is a clear strength that supports reproducibility and extension by others.

major comments (1)
  1. [§3] §3 (Path Patching): The central claim that the intervention isolates causal contributions along hypothesized paths requires that residual-stream interactions with non-path components remain unchanged. No explicit measurement or ablation of cross-path leakage or downstream interference is reported, which is load-bearing for the quantitative evaluation of the induction-head and GPT-2 results.
minor comments (2)
  1. [Abstract] Abstract: The claim that the method 'refines' the induction-head explanation is stated without specifying the concrete change relative to prior work (e.g., what new quantitative evidence is added).
  2. [Experiments] Experiments: Figure captions and tables would benefit from explicit reporting of the exact quantitative metric (e.g., logit difference or accuracy delta) used to assess localization success.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The concern about explicitly verifying the isolation property is substantive, and we have revised the manuscript to include new measurements addressing cross-path leakage and downstream interference.

read point-by-point responses
  1. Referee: [§3] §3 (Path Patching): The central claim that the intervention isolates causal contributions along hypothesized paths requires that residual-stream interactions with non-path components remain unchanged. No explicit measurement or ablation of cross-path leakage or downstream interference is reported, which is load-bearing for the quantitative evaluation of the induction-head and GPT-2 results.

    Authors: We agree that an explicit check on residual-stream interactions strengthens the quantitative claims. Path patching replaces activations only along the hypothesized path while running the remainder of the forward pass on the clean input; by construction this keeps non-path component inputs identical to the clean run except for the direct contributions arriving via the patched path. Nevertheless, to address the referee's point we have added Section 3.4, which reports an ablation measuring L2-norm changes to activations of all non-path components before versus after patching. For the induction-head experiments the median change is below 4% and does not alter the reported effect sizes; analogous results hold for the GPT-2 behavior. We also include a short discussion of why downstream interference is already captured by the path-patching metric itself. These additions are now load-bearing for the revised quantitative claims. revision: yes

Circularity Check

0 steps flagged

Path patching introduced as independent experimental technique with no derivation chain

full rationale

The paper presents path patching as a new methodological tool for expressing and testing localization hypotheses in neural networks. No equations, parameters, or results are derived from prior fitted values or self-referential definitions. The abstract and description frame it as an experimental technique applied to induction heads and GPT-2 behaviors, without any load-bearing self-citations, ansatz smuggling, or renaming of known results as derivations. The central claim rests on the validity of the intervention method itself rather than reducing to its own inputs by construction. This is a standard case of a methods paper with self-contained content against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that network behaviors can be meaningfully localized to paths and that patching operations can isolate those paths. No free parameters or invented physical entities are described in the abstract.

axioms (1)
  • domain assumption Neural network behaviors can be localized to subsets of paths through components
    This is the core hypothesis class the method is designed to test.
invented entities (1)
  • path patching no independent evidence
    purpose: Quantitative test for path-localized behavior hypotheses
    New experimental technique introduced in the paper

pith-pipeline@v0.9.0 · 5387 in / 1026 out tokens · 72381 ms · 2026-05-16T19:34:59.863980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  3. Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.

  4. In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

    cs.LG 2026-05 conditional novelty 7.0

    In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.

  5. Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

    cs.CL 2026-05 unverdicted novelty 7.0

    Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

  6. How Language Models Process Negation

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

  7. The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...

  8. Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

  9. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  10. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

    cs.AI 2026-05 unverdicted novelty 6.0

    CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

  11. Instructions Shape Production of Language, not Processing

    cs.CL 2026-05 unverdicted novelty 6.0

    Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

  12. Patch-Effect Graph Kernels for LLM Interpretability

    cs.AI 2026-05 unverdicted novelty 6.0

    Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape desc...

  13. Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

    cs.AI 2026-04 conditional novelty 6.0

    Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

  14. Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

    cs.AI 2026-04 unverdicted novelty 6.0

    Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a ...

  15. The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...

  16. Automated Attention Pattern Discovery at Scale in Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.

  17. Instructions Shape Production of Language, not Processing

    cs.CL 2026-05 unverdicted novelty 5.0

    Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.

  18. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  19. Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

    cs.CL 2026-01 unverdicted novelty 5.0

    The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

  20. How to use and interpret activation patching

    cs.LG 2024-04 accept novelty 5.0

    Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

  21. High-Dimensional Statistics: Reflections on Progress and Open Problems

    math.ST 2026-05 unverdicted novelty 2.0

    A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 18 Pith papers · 2 internal anchors

  1. [1]

    2022 , eprint=

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. 2022 , eprint=

  2. [2]

    2023 , archivePrefix=

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. 2023 , archivePrefix=

  3. [3]

    ArXiv , year=

    In-context Learning and Induction Heads , author=. ArXiv , year=

  4. [4]

    Distill , volume=

    Zoom in: An introduction to circuits , author=. Distill , volume=

  5. [5]

    2023 , eprint=

    A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations , author=. 2023 , eprint=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    Causal abstractions of neural networks , author=. Advances in Neural Information Processing Systems , volume=

  7. [8]

    2022 , journal=

    Softmax Linear Units , author=. 2022 , journal=

  8. [9]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Inducing Causal Structure for Interpretable Neural Networks , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  9. [12]

    Advances in neural information processing systems , volume=

    Investigating gender bias in language models using causal mediation analysis , author=. Advances in neural information processing systems , volume=

  10. [13]

    Advances in Neural Information Processing Systems , volume=

    Locating and editing factual associations in gpt , author=. Advances in Neural Information Processing Systems , volume=

  11. [15]

    Uncertainty in Artificial Intelligence , pages=

    Approximate causal abstractions , author=. Uncertainty in Artificial Intelligence , pages=. 2020 , organization=

  12. [16]

    Advances in neural information processing systems , volume=

    Residual networks behave like ensembles of relatively shallow networks , author=. Advances in neural information processing systems , volume=

  13. [17]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  14. [19]

    Advances in neural information processing systems , volume=

    This looks like that: deep learning for interpretable image recognition , author=. Advances in neural information processing systems , volume=

  15. [20]

    2022 , eprint=

    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers , author=. 2022 , eprint=

  16. [21]

    2023 , eprint=

    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , author=. 2023 , eprint=

  17. [22]

    2023 , eprint=

    Tracr: Compiled Transformers as a Laboratory for Interpretability , author=. 2023 , eprint=

  18. [23]

    2023 , eprint=

    Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. 2023 , eprint=

  19. [24]

    , author=

    Causal scrubbing: a method for rigorously testing interpretability hypotheses. , author=. 2022 , url=

  20. [26]

    2023 , originalyear =

    Scheurer, Jérémy Scheurer and Phil3 and tony and Thibodeau, Jacques and Lindner, David , title =. 2023 , originalyear =

  21. [27]

    Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,

    Recent Advances in Adversarial Training for Adversarial Robustness , author =. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,. 2021 , month =. doi:10.24963/ijcai.2021/591 , url =

  22. [29]

    Approximate causal abstractions

    Sander Beckers, Frederick Eberhardt, and Joseph Y Halpern. Approximate causal abstractions. In Uncertainty in Artificial Intelligence, pp.\ 606--615. PMLR, 2020

  23. [30]

    Eliciting latent predictions from transformers with the tuned lens, 2023

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023

  24. [31]

    An interpretability illusion for bert

    Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Vi \'e gas, and Martin Wattenberg. An interpretability illusion for bert. arXiv preprint arXiv:2104.07143, 2021

  25. [32]

    Causal scrubbing: a method for rigorously testing interpretability hypotheses., 2022

    Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: a method for rigorously testing interpretability hypotheses., 2022. URL https://bit.ly/3WRBhPD

  26. [33]

    This looks like that: deep learning for interpretable image recognition

    Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan K Su. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32, 2019

  27. [34]

    A toy model of universality: Reverse engineering how networks learn group operations, 2023

    Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations, 2023

  28. [35]

    Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023

  29. [36]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  30. [37]

    Softmax linear units

    Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav K...

  31. [38]

    Causal analysis of syntactic agreement mechanisms in neural language models

    Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. arXiv preprint arXiv:2106.06087, 2021

  32. [39]

    Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation

    Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. arXiv preprint arXiv:2004.14623, 2020

  33. [40]

    Causal abstractions of neural networks

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34: 0 9574--9586, 2021

  34. [41]

    Inducing causal structure for interpretable neural networks

    Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volu...

  35. [42]

    Causal abstraction for faithful model interpretation

    Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation. arXiv preprint arXiv:2301.04709, 2023 a

  36. [43]

    Finding alignments between interpretable causal variables and distributed neural representations

    Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D Goodman. Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023 b

  37. [44]

    Tracr: Compiled transformers as a laboratory for interpretability, 2023

    David Lindner, János Kramár, Matthew Rahtz, Thomas McGrath, and Vladimir Mikulik. Tracr: Compiled transformers as a laboratory for interpretability, 2023

  38. [45]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35: 0 17359--17372, 2022

  39. [46]

    Zoom in: An introduction to circuits

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5 0 (3): 0 e00024--001, 2020

  40. [47]

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, T. J. Henighan, Benjamin Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, John Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom B. Brown, Jack Clark, Jared Kaplan, Sam McCand...

  41. [48]

    Direct and Indirect Effects

    Judea Pearl. Direct and indirect effects. CoRR, abs/1301.2300, 2013. URL http://arxiv.org/abs/1301.2300

  42. [49]

    Shortformer: Better language modeling using shorter inputs

    Ofir Press, Noah A Smith, and Mike Lewis. Shortformer: Better language modeling using shorter inputs. arXiv preprint arXiv:2012.15832, 2020

  43. [50]

    Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023

    Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023

  44. [51]

    Practical pitfalls of causal scrubbing, 2023

    Jérémy Scheurer Scheurer, Phil3, tony, Jacques Thibodeau, and David Lindner. Practical pitfalls of causal scrubbing, 2023. URL https://www.alignmentforum.org/posts/DFarDnQjMnjsKvW8s/practical-pitfalls-of-causal-scrubbing

  45. [52]

    Scale efficiently: Insights from pre-training and fine-tuning transformers, 2022

    Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from pre-training and fine-tuning transformers, 2022

  46. [53]

    Residual networks behave like ensembles of relatively shallow networks

    Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016

  47. [54]

    Investigating gender bias in language models using causal mediation analysis

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33: 0 12388--12401, 2020

  48. [55]

    Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022