Recognition: 1 theorem link
· Lean TheoremInterpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3
The pith
GPT-2 small solves indirect object identification using a circuit of 26 attention heads in seven classes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GPT-2 small performs indirect object identification by routing information through a specific circuit of 26 attention heads organized into seven main classes, located via systematic causal interventions on attention patterns and residual streams, and shown to satisfy quantitative criteria for faithfulness, completeness, and minimality while leaving some explanatory gaps.
What carries the argument
The IOI circuit: a collection of 26 attention heads divided into seven classes that implement name mover, previous token, and induction heads to track and select the indirect object.
If this is right
- Interventions on the 26 heads can be used to predict and control the model's output on indirect object identification examples.
- The same causal-intervention workflow can be applied to reverse-engineer other natural language behaviors inside the same model.
- Gaps identified by the completeness and minimality checks indicate specific places where additional heads or mechanisms remain to be explained.
- The circuit provides a concrete template for scaling mechanistic explanations to larger models and more complex tasks.
- Similar circuits may appear in other transformer models that perform comparable syntactic tracking.
Where Pith is reading between the lines
- Editing or removing heads inside the circuit could allow targeted suppression of the indirect-object behavior without broadly disrupting language modeling.
- The approach may transfer to understanding how models handle other syntactic dependencies such as subject-verb agreement or coreference resolution.
- If circuits of this size prove common, automated search methods for circuits could become practical for routine interpretability work.
- The existence of a compact circuit for this task suggests that many natural behaviors may be implemented by relatively sparse subnetworks rather than diffuse whole-model activity.
Load-bearing premise
The three criteria of faithfulness, completeness, and minimality are enough to certify that the identified set of heads forms the complete and minimal explanation rather than one of several circuits that could achieve similar task performance.
What would settle it
Locating a different collection of heads that achieves equal or higher scores on the faithfulness, completeness, and minimality metrics while using fewer heads or different attention patterns would show that the reported circuit is not the minimal explanation.
read the original abstract
Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to reverse-engineer the indirect object identification (IOI) task in GPT-2 small by identifying a circuit of 26 attention heads grouped into 7 classes. The circuit is discovered via causal interventions (activation and path patching) on a curated IOI dataset and validated using three quantitative criteria: faithfulness (circuit patching degrades performance), completeness (circuit alone recovers most accuracy), and minimality (ablating any additional circuit head further degrades results). The authors note that the criteria support the explanation but also indicate remaining gaps.
Significance. If the circuit identification holds, the work is significant as the largest end-to-end mechanistic account of a natural language behavior in a transformer. The reliance on causal interventions rather than correlational methods provides direct evidence for head roles, and the explicit use of quantitative criteria (faithfulness, completeness, minimality) sets a replicable standard for future circuit discovery. This bridges small-model toy tasks and broad descriptions of larger models, supporting the feasibility of scaling mechanistic interpretability.
major comments (3)
- [Evaluation section / Abstract] The completeness criterion recovers most accuracy via the 26-head circuit, but the manuscript does not quantify the exact residual error attributable to unpatched components or higher-order interactions outside the circuit (see evaluation section and abstract statement on remaining gaps). This leaves open whether the circuit is complete or merely one sufficient subset.
- [Minimality tests (quantitative criteria section)] The minimality criterion shows performance degradation when ablating heads inside the circuit, but does not compare against alternative partitions of heads (including some labeled non-circuit) or test whether other subsets achieve statistically indistinguishable faithfulness and completeness scores. This undermines the claim that the identified circuit is the minimal explanation rather than one of several possible circuits.
- [Faithfulness evaluation] Faithfulness is demonstrated by patching the circuit, yet the paper does not report variance across patching orders, dataset subsets, or multiple random seeds, nor does it test whether the performance drop is specific to the discovered circuit versus any comparably sized set of heads.
minor comments (2)
- [Circuit diagram figure] The diagram of the 7 head classes would benefit from an accompanying table that explicitly lists each class, its heads, and the functional role assigned to it.
- [Notation and methods] Notation for attention heads (e.g., layer and index) should be standardized in a single table early in the paper to aid readability when referring to the 26 heads.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback on our manuscript. We appreciate the recognition of the significance of our work in providing an end-to-end mechanistic account of the IOI task. We have carefully considered each major comment and made revisions to the manuscript to address the concerns about the evaluation criteria. Our responses are detailed below.
read point-by-point responses
-
Referee: The completeness criterion recovers most accuracy via the 26-head circuit, but the manuscript does not quantify the exact residual error attributable to unpatched components or higher-order interactions outside the circuit (see evaluation section and abstract statement on remaining gaps). This leaves open whether the circuit is complete or merely one sufficient subset.
Authors: We agree that quantifying the residual error more precisely would strengthen the completeness analysis. The abstract already states that the criteria point to remaining gaps, indicating the circuit is sufficient but not necessarily complete. In the revised evaluation section, we have added a detailed analysis of the residual performance, including estimates of contributions from unpatched heads and a discussion of potential higher-order interactions based on further ablation experiments. revision: yes
-
Referee: The minimality criterion shows performance degradation when ablating heads inside the circuit, but does not compare against alternative partitions of heads (including some labeled non-circuit) or test whether other subsets achieve statistically indistinguishable faithfulness and completeness scores. This undermines the claim that the identified circuit is the minimal explanation rather than one of several possible circuits.
Authors: We acknowledge the value of comparing to alternative partitions for a stronger minimality claim. However, a full search over all possible subsets of heads is computationally intractable. In the revised minimality tests section, we have included comparisons to random subsets of comparable size and to select alternative groupings of heads. These show that our circuit performs better on the minimality criterion than the alternatives tested, supporting our identification while noting that other viable circuits cannot be ruled out without exhaustive search. revision: partial
-
Referee: Faithfulness is demonstrated by patching the circuit, yet the paper does not report variance across patching orders, dataset subsets, or multiple random seeds, nor does it test whether the performance drop is specific to the discovered circuit versus any comparably sized set of heads.
Authors: We thank the referee for this suggestion to improve the robustness of our faithfulness results. The revised manuscript now reports performance metrics averaged over multiple random seeds and across different dataset subsets, including variance measures. Furthermore, we have added experiments comparing the circuit patching to patching random sets of 26 heads, demonstrating that the performance degradation is substantially larger and more consistent for our discovered circuit than for random selections. revision: yes
Circularity Check
No circularity: circuit discovered and validated via independent causal interventions
full rationale
The paper identifies the 26-head IOI circuit through causal interventions (activation and path patching) on GPT-2 small activations, then validates it with faithfulness (patching degrades performance), completeness (circuit recovers accuracy), and minimality (ablating extra heads hurts results) on a curated dataset. These steps are empirical measurements on the model's own behavior rather than any fitted parameter or self-referential definition. No equations reduce a 'prediction' to an input by construction, no uniqueness theorem is imported from self-citations, and no ansatz or renaming occurs. The abstract explicitly notes remaining gaps, confirming the criteria support but do not tautologically define the result. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Causal interventions on attention heads reveal their functional roles in the computation
Forward citations
Cited by 28 Pith papers
-
Dissecting Jet-Tagger Through Mechanistic Interpretability
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
-
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
CURE:Circuit-Aware Unlearning for LLM-based Recommendation
CURE disentangles LLM recommendation circuits into forget-specific, retain-specific, and task-shared modules with tailored update rules to achieve more effective unlearning than weighted baselines.
-
Eliciting Latent Predictions from Transformers with the Tuned Lens
Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
-
How to Interpret Agent Behavior
ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
Architecture, Not Scale: Circuit Localization in Large Language Models
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
Future-rhyme information is linearly decodable at line boundaries across model families and strengthens with scale, yet only Gemma-3-27B causally depends on it, with the driver migrating to the boundary around layer 3...
-
Hallucination Detection via Activations of Open-Weight Proxy Analyzers
A framework using activation-based features from small open-weight proxy models detects LLM hallucinations with higher AUC than ReDeEP on RAGTruth, performing consistently across seven analyzer architectures.
-
The Position Curse: LLMs Struggle to Locate the Last Few Items in a List
LLMs exhibit the Position Curse, with backward position retrieval in lists lagging far behind forward retrieval, showing only partial gains from PosBench fine-tuning.
-
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while pr...
-
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
-
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
-
PhiNet: Speaker Verification with Phonetic Interpretability
PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.
-
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
-
Graph Memory Transformer (GMT)
Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M de...
-
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions th...
-
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Gemma Scope supplies trained sparse autoencoders for all layers of Gemma 2 2B and 9B plus select 27B layers, with public weights and benchmark scores.
-
Speaking of Language: Reflections on Metalanguage Research in NLP
This reflection paper highlights metalanguage in NLP, links it to LLMs, and lists understudied future directions.
-
High-Dimensional Statistics: Reflections on Progress and Open Problems
A survey synthesizing representative advances, common themes, and open problems in high-dimensional statistics while pointing to key entry-point works.
Reference graph
Works this paper leans on
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 1901
-
[4]
A literature survey of recent advances in chatbots
Guendalina Caldarini, Sardar Jaf, and Kenneth McGarry. A literature survey of recent advances in chatbots. Information, 13 0 (1): 0 41, 2022
work page 2022
-
[5]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
work page 2021
-
[6]
Causal abstractions of neural networks
Atticus Geiger, Hanson Lu, Thomas F Icard, and Christopher Potts. Causal abstractions of neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=RmuXDtjDhG
work page 2021
-
[8]
X-risk analysis for ai research
Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. arXiv, abs/2206.05862, 2022
-
[9]
Natural language descriptions of deep visual features
Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, and Jacob Andreas. Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2021
work page 2021
-
[12]
Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019
work page 2019
-
[13]
Compositional explanations of neurons
Jesse Mu and Jacob Andreas. Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33: 0 17153--17163, 2020
work page 2020
-
[14]
A mechanistic interpretability analysis of grokking, 2022
Neel Nanda and Tom Lieberum. A mechanistic interpretability analysis of grokking, 2022. URL https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking
work page 2022
-
[15]
Mechanistic interpretability, variables, and the importance of interpretable bases
Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://www.transformer-circuits.pub/2022/mech-interp-essay, 2022. Accessed: 2022-15-09
work page 2022
-
[16]
Zoom in: An introduction to circuits
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi:10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in
-
[18]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019
work page 2019
-
[19]
Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2022. URL https://arxiv.org/abs/2207.13243
-
[20]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 5998--6008, 2017
work page 2017
-
[21]
Investigating gender bias in language models using causal mediation analysis
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. Advances in Neural Information Processing Systems, 33: 0 12388--12401, 2020
work page 2020
-
[22]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. ArXiv, abs/2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
Shifting machine learning for healthcare from development to deployment and from models to data
Angela Zhang, Lei Xing, James Zou, and Joseph C Wu. Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomedical Engineering, pp.\ 1--16, 2022
work page 2022
-
[24]
arXiv preprint arXiv:2106.06087 , year=
Finlayson, Matthew and Mueller, Aaron and Gehrmann, Sebastian and Shieber, Stuart and Linzen, Tal and Belinkov, Yonatan , keywords =. Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.06087 , url =
-
[25]
BERT Rediscovers the Classical NLP Pipeline , publisher =
Tenney, Ian and Das, Dipanjan and Pavlick, Ellie , keywords =. BERT Rediscovers the Classical NLP Pipeline , publisher =. 2019 , copyright =. doi:10.48550/ARXIV.1905.05950 , url =
- [26]
-
[27]
Learning to Generate Reviews and Discovering Sentiment
Radford, Alec and Jozefowicz, Rafal and Sutskever, Ilya , keywords =. Learning to Generate Reviews and Discovering Sentiment , publisher =. 2017 , copyright =. doi:10.48550/ARXIV.1704.01444 , url =
-
[28]
arXiv preprint arXiv:2106.00737 , year=
Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , keywords =. Implicit Representations of Meaning in Neural Language Models , publisher =. 2021 , copyright =. doi:10.48550/ARXIV.2106.00737 , url =
- [29]
-
[30]
BERT Rediscovers the Classical NLP Pipeline
Tenney, Ian and Das, Dipanjan and Pavlick, Ellie. BERT Rediscovers the Classical NLP Pipeline. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1452
-
[31]
International Conference on Machine Learning , pages=
Inductive biases and variable creation in self-attention mechanisms , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[32]
arXiv preprint arXiv:2206.04301 , year=
Unveiling Transformers with LEGO: a synthetic reasoning task , author=. arXiv preprint arXiv:2206.04301 , year=
-
[33]
Hidden progress in deep learning: Sgd learns parities near the computational limit
Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , author=. arXiv preprint arXiv:2207.08799 , year=
- [34]
-
[35]
In-context Learning and Induction Heads
In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
A literature survey of recent advances in chatbots , author=. Information , volume=. 2022 , publisher=
work page 2022
-
[37]
Nature Biomedical Engineering , pages=
Shifting machine learning for healthcare from development to deployment and from models to data , author=. Nature Biomedical Engineering , pages=. 2022 , publisher=
work page 2022
- [38]
-
[39]
Unsolved problems in ml safety , author=. arXiv preprint arXiv:2109.13916 , year=
-
[40]
International Conference on Learning Representations , year=
Natural Language Descriptions of Deep Visual Features , author=. International Conference on Learning Representations , year=
-
[41]
Advances in Neural Information Processing Systems , volume=
Compositional explanations of neurons , author=. Advances in Neural Information Processing Systems , volume=
-
[42]
arXiv preprint arXiv:2106.00737 , year=
Implicit representations of meaning in neural language models , author=. arXiv preprint arXiv:2106.00737 , year=
-
[43]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Toward a visual concept vocabulary for gan latent space , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[44]
Advances in Neural Information Processing Systems , volume=
Investigating gender bias in language models using causal mediation analysis , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
arXiv preprint arXiv:2110.07483 , year=
On the pitfalls of analyzing individual neurons in language models , author=. arXiv preprint arXiv:2110.07483 , year=
-
[46]
arXiv preprint arXiv:2106.06087 , year=
Causal analysis of syntactic agreement mechanisms in neural language models , author=. arXiv preprint arXiv:2106.06087 , year=
-
[47]
On the Opportunities and Risks of Foundation Models , author=. ArXiv , year=
-
[48]
Advances in neural information processing systems , volume=
Are sixteen heads really better than one? , author=. Advances in neural information processing systems , volume=
-
[49]
Räuker, Tilman and Ho, Anson and Casper, Stephen and Hadfield-Menell, Dylan , keywords =. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2207.13243 , url =
-
[50]
Analyzing transformers in embedding space
Analyzing Transformers in Embedding Space , author=. arXiv preprint arXiv:2209.02535 , year=
-
[51]
Transformer Feed-Forward Layers Are Key-Value Memories
Transformer feed-forward layers are key-value memories , author=. arXiv preprint arXiv:2012.14913 , year=
work page internal anchor Pith review arXiv 2012
-
[52]
Locating and Editing Factual Associations in GPT, January 2023
Locating and Editing Factual Associations in GPT , author=. arXiv preprint arXiv:2202.05262 , year=
-
[53]
Sturmfels, Pascal and Lundberg, Scott and Lee, Su-In , title =. Distill , year =
-
[54]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[55]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [56]
-
[57]
Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases , author=
-
[58]
A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=
work page 2021
-
[59]
A transparency and interpretability tech tree , author=
-
[60]
Advances in Neural Information Processing Systems , editor=
Causal Abstractions of Neural Networks , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=
work page 2021
-
[61]
Advances in Neural Information Processing Systems , description =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in Neural Information Processing Systems , description =
-
[62]
Language Models are Few-Shot Learners , url =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[63]
Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan , title =. Distill , year =
-
[64]
Language Models are Unsupervised Multitask Learners , author=
-
[65]
Sanity Checks for Saliency Maps , url =
Adebayo, Julius and Gilmer, Justin and Muelly, Michael and Goodfellow, Ian and Hardt, Moritz and Kim, Been , booktitle =. Sanity Checks for Saliency Maps , url =
-
[66]
GitHub repository , howpublished =
Nanda, Neel , title =. GitHub repository , howpublished =. 2022 , publisher =
work page 2022
-
[67]
Nanda, Neel and Lieberum, Tom , title =
-
[68]
nostalgebraist , title =
-
[69]
Hubinger, Evan , keywords =. An overview of 11 proposals for building safe advanced AI , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2012.07532 , url =
-
[70]
A ttention is not E xplanation
Jain, Sarthak and Wallace, Byron C. A ttention is not E xplanation. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1357
-
[71]
LeCun, Yann and Denker, John and Solla, Sara , booktitle =. Optimal Brain Damage , url =
- [72]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.