pith. machine review for the scientific record. sign in

arxiv: 2403.19647 · v3 · submitted 2024-03-28 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords sparse feature circuitsmechanistic interpretabilitylanguage modelscausal graphsfeature ablationunsupervised discoverymodel editing
0
0 comments X

The pith

Sparse feature circuits map language model behaviors to causally implicated networks of human-interpretable features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces sparse feature circuits as subnetworks built from fine-grained interpretable features rather than polysemantic neurons or attention heads. These circuits are shown to be causally linked to specific model behaviors through interventions. The authors apply them in a method called SHIFT to improve a classifier by removing features judged irrelevant by a human. They also present an unsupervised pipeline that automatically finds thousands of such circuits across model behaviors.

Core claim

Sparse feature circuits are causally implicated subnetworks of human-interpretable features that explain language model behaviors. Unlike earlier circuits built from polysemantic units, these circuits support detailed mechanistic understanding of unanticipated behaviors and enable direct editing through ablation.

What carries the argument

Sparse feature circuits, defined as causally implicated subnetworks composed of fine-grained human-interpretable features, replace polysemantic units to carry causal explanations and support interventions such as ablation.

If this is right

  • Model behaviors can be explained at the level of individual interpretable features instead of opaque units.
  • Ablating task-irrelevant features improves generalization of downstream classifiers.
  • Thousands of circuits can be discovered automatically without human supervision for many model behaviors.
  • Causal editing becomes feasible for unanticipated mechanisms inside the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the circuits prove stable across different prompts, they could support persistent model edits that survive retraining.
  • The same discovery process might be applied to detect and isolate circuits tied to undesirable outputs such as hallucinations.
  • Scaling the pipeline could produce a partial wiring diagram of the entire model for targeted capability control.
  • Combining these circuits with activation patching might reveal how features interact across layers.

Load-bearing premise

The extracted features are reliably human-interpretable and interventions on them produce the claimed behavioral changes without new unintended effects.

What would settle it

A controlled test in which human judges rate the features as uninterpretable or in which ablating the identified features fails to improve classifier generalization on held-out data would falsify the central claims.

read the original abstract

We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces sparse feature circuits as causally implicated subnetworks of human-interpretable features extracted via sparse autoencoders, contrasting them with prior circuits based on polysemantic neurons or attention heads. It presents methods for their discovery and applies them in the SHIFT task to improve classifier generalization by ablating human-judged task-irrelevant features, while also demonstrating an unsupervised scalable pipeline that identifies thousands of such circuits for automatically discovered model behaviors.

Significance. If the causal claims and quantitative results hold, the work would advance mechanistic interpretability by shifting from coarse, polysemantic units to finer-grained interpretable features, enabling more precise causal analysis, editing, and scalable unsupervised pipelines for understanding LM behaviors.

major comments (2)
  1. [§4] §4 (SHIFT evaluation): The central claim that ablating human-judged irrelevant features improves generalization without unintended effects is load-bearing for the editing application, yet the manuscript provides insufficient controls for residual correlations or incomplete disentanglement in the underlying SAEs; ablation effects could propagate indirectly, confounding the reported gains. Include ablation specificity metrics (e.g., change in other feature activations) and comparison to random or correlated-feature baselines.
  2. [§3] §3 (circuit discovery and causality validation): The assertion that sparse feature circuits are 'causally implicated' relies on interventions whose isolation is not fully demonstrated; given known SAE polysemanticity, provide explicit tests (e.g., do-no-harm checks on unrelated behaviors or mutual information between features) to rule out confounding before claiming detailed understanding of unanticipated mechanisms.
minor comments (2)
  1. [§2] Clarify notation for feature activation thresholds and circuit extraction criteria in the methods; inconsistent use of 'sparse' vs. 'interpretable' risks ambiguity.
  2. [§4, §5] Add error bars, statistical significance, and exact dataset sizes to all quantitative results in the SHIFT and unsupervised pipeline sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address the major concerns regarding the SHIFT evaluation and the causality validation in circuit discovery below. We agree that additional controls and tests will strengthen the manuscript and plan to incorporate them in the revised version.

read point-by-point responses
  1. Referee: [§4] §4 (SHIFT evaluation): The central claim that ablating human-judged irrelevant features improves generalization without unintended effects is load-bearing for the editing application, yet the manuscript provides insufficient controls for residual correlations or incomplete disentanglement in the underlying SAEs; ablation effects could propagate indirectly, confounding the reported gains. Include ablation specificity metrics (e.g., change in other feature activations) and comparison to random or correlated-feature baselines.

    Authors: We recognize the importance of demonstrating that the ablations in SHIFT are specific and do not lead to unintended effects through residual correlations in the SAEs. The original experiments used human judgment to select irrelevant features and showed generalization improvements, but we agree that more rigorous controls are needed. In the revised manuscript, we will include ablation specificity metrics, such as the change in activation of other features when ablating the selected ones, to show minimal interference. Additionally, we will add baselines comparing to random feature ablations and ablations of features that are correlated with the irrelevant ones. This will help confirm that the gains are due to the targeted ablations. revision: yes

  2. Referee: [§3] §3 (circuit discovery and causality validation): The assertion that sparse feature circuits are 'causally implicated' relies on interventions whose isolation is not fully demonstrated; given known SAE polysemanticity, provide explicit tests (e.g., do-no-harm checks on unrelated behaviors or mutual information between features) to rule out confounding before claiming detailed understanding of unanticipated mechanisms.

    Authors: We appreciate the referee pointing out the need for stronger evidence of intervention isolation, especially considering potential polysemanticity in SAEs. Our method identifies circuits by finding features that causally affect the behavior via patching experiments, and we show that these circuits explain unanticipated mechanisms. To address the concern, we will add explicit do-no-harm checks in the revised §3, where we test that ablating the discovered circuits does not harm performance on unrelated tasks or behaviors. We will also compute and report metrics such as mutual information between the features in the circuit to assess their independence. These additions will provide better support for the causal claims. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological pipeline is self-contained with no derivations reducing to inputs

full rationale

The paper presents an empirical methodology for discovering sparse feature circuits via SAEs and applying them in SHIFT ablations, without any equations, first-principles derivations, or predictions that reduce by construction to fitted parameters or self-citations. Claims rest on external validation through human interpretability judgments and measured generalization improvements, which are falsifiable outside the fitted values. No load-bearing self-citation chains or ansatz smuggling appear in the provided text; the unsupervised pipeline and causal editing steps are independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review uses abstract only; no explicit free parameters, axioms, or invented entities beyond the core concept are stated. Sparse feature circuits are treated as the primary new construct.

invented entities (1)
  • sparse feature circuits no independent evidence
    purpose: Causally implicated subnetworks of human-interpretable features for explaining language model behaviors
    Introduced as the central new object in the abstract; no independent evidence provided within the abstract

pith-pipeline@v0.9.0 · 5431 in / 1147 out tokens · 76223 ms · 2026-05-13T13:09:39.264755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.LawOfExistence defect_zero_iff_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.

  2. WriteSAE: Sparse Autoencoders for Recurrent State

    cs.LG 2026-05 unverdicted novelty 8.0

    WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.

  3. Crafting Reversible SFT Behaviors in Large Language Models

    cs.LG 2026-05 unverdicted novelty 8.0

    LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

  4. When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

    cs.LG 2026-05 unverdicted novelty 7.0

    Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

  5. GKnow: Measuring the Entanglement of Gender Bias and Factual Gender

    cs.CL 2026-05 unverdicted novelty 7.0

    Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.

  6. fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

    cs.LG 2026-05 conditional novelty 7.0

    fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...

  7. What Cohort INRs Encode and Where to Freeze Them

    cs.LG 2026-05 unverdicted novelty 7.0

    Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

  8. A framework for analyzing concept representations in neural models

    cs.CL 2026-05 unverdicted novelty 7.0

    A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...

  9. Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

    cs.CV 2026-04 unverdicted novelty 7.0

    Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.

  10. Emotion Concepts and their Function in a Large Language Model

    cs.AI 2026-04 unverdicted novelty 7.0

    Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

  11. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  12. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  13. Why Retrieval-Augmented Generation Fails: A Graph Perspective

    cs.CL 2026-05 unverdicted novelty 6.0

    Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.

  14. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  15. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  16. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...

  17. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...

  18. Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

    cs.LG 2026-05 unverdicted novelty 6.0

    Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.

  19. From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features

    cs.AI 2026-05 conditional novelty 6.0

    Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.

  20. The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.

  21. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  22. Towards Understanding the Robustness of Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 6.0

    Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.

  23. What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

    cs.LG 2026-04 unverdicted novelty 6.0

    Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.

  24. Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

    cs.LG 2026-04 unverdicted novelty 6.0

    Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.

  25. The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

    cs.CL 2026-03 unverdicted novelty 6.0

    Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.

  26. Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    cs.CL 2025-07 unverdicted novelty 6.0

    Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.

  27. Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

    cs.LG 2026-05 unverdicted novelty 5.0

    Feature rivalry in SAE representations strengthens with model uncertainty on high-entropy questions, enables output steering, and predicts answer correctness with AUROC 0.689 in Gemma-2-2B.

  28. Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

    cs.CL 2026-05 unverdicted novelty 4.0

    Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 26 Pith papers · 4 internal anchors

  1. [1]

    Probing classifiers: Promises, shortcomings, and advances

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022

  2. [2]

    LEACE : Perfect linear concept erasure in closed form

    Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, https://openreview.net/forum?id=awIpKpwTwF LEACE : Perfect linear concept erasure in closed form

  3. [3]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.\ 2397--2430. PMLR, 2023

  4. [4]

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  5. [5]

    Understanding disentangling in $\beta$-VAE

    Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in - VAE , 2017, https://arxiv.org/abs/1804.03599 Understanding disentangling in - VAE

  6. [6]

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision https://...

  7. [7]

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J \'e r \'e my Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Mi...

  8. [8]

    Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLM s

    Angelica Chen, Ravid Shwartz-Ziv, Kyunghyun Cho, Matthew L Leavitt, and Naomi Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLM s. In The Twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum?id=MO5PiKHELW Sudden Drops in the Loss: Syntax Acquisition, Phase Transi...

  9. [9]

    Isolating sources of disentanglement in variational autoencoders, 2018, Isolating Sources of Disentanglement in Variational Autoencoders https://openreview.net/forum?id=BJdMRoCIf

    Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders, 2018, Isolating Sources of Disentanglement in Variational Autoencoders https://openreview.net/forum?id=BJdMRoCIf

  10. [10]

    Infogan: interpretable representation learning by information maximizing generative adversarial nets

    Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pp.\ 2180–2188, Red Hook, NY, USA, 2016. Curran Associates Inc. ISB...

  11. [11]

    2017 , month = dec, journal =

    Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Computing Research Repository, arXiv:1706.03741, 2023

  12. [12]

    Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, Towards Automated Circuit Discovery for Mechanistic Interpretability https://openreview.net/pdf?id=89ia77nZ8u

  13. [13]

    Environment inference for invariant learning

    Elliot Creager, Joern-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 2189--2200. PMLR, 18--24 Jul 2021, Environment Inference for Invariant Learning htt...

  14. [14]

    Sparse autoencoders find highly interpretable features in language models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024, Sparse Autoencoders Find Highly Interpretable Features in Language Models https://openreview.net/forum?id=F76bwRSLeK

  15. [15]

    Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

    Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* '19, pp.\ 120–128, Ne...

  16. [16]

    Disentangling factors of variation via generative entangling

    Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via generative entangling. Computing Research Repository, arXiv:1210.5474, 2012

  17. [17]

    Toy models of superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/t...

  18. [18]

    Causal analysis of syntactic agreement mechanisms in neural language models

    Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internat...

  19. [19]

    Efros, and Jacob Steinhardt

    Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt. Interpreting CLIP 's image representation via text-based decomposition. Computing Research Repository, arXiv:2310.05916, 2024

  20. [20]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The P ile: An 800 GB dataset of diverse text for language modeling. Computing Research Repository, arXiv:2101.00027, 2020

  21. [21]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. Computing Research Repository, arXiv:2406.04093, 2024

  22. [22]

    Causal abstractions of neural networks

    Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 9574--9586. Curran Associates, Inc., 2021, Causal Abstractions of Neural Networks https://proceeding...

  23. [23]

    Inducing causal structure for interpretable neural networks

    Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volu...

  24. [24]

    Causal abstraction for faithful model interpretation

    Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation. Computing Research Repository, arXiv:2301.04709, 2023

  25. [25]

    Dissecting recall of factual associations in auto-regressive language models

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12216--12235, Singapore, December 2023. Association for Computationa...

  26. [26]

    Successor heads: Recurring, interpretable attention heads in the wild

    Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, interpretable attention heads in the wild. Computing Research Repository, arXiv:2312.09230, 2023

  27. [27]

    How does GPT -2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

    Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT -2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, https://openreview.net/forum?id=p4PckNQR8k How does GPT -2 compute greater-than?: Interpreting mathematical abilities in...

  28. [28]

    Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms

    Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms. In ICML 2024 Workshop on Mechanistic Interpretability, 2024, Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms https://openreview.net/forum?id=grXgesr5dT

  29. [29]

    The unreasonable effectiveness of easy training data for hard tasks

    Peter Hase, Mohit Bansal, Peter Clark, and Sarah Wiegreffe. The unreasonable effectiveness of easy training data for hard tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7002--7024, Bangkok, Thailand, August 2024. Associati...

  30. [30]

    T. He, Z. Li, Y. Gong, Y. Yao, X. Nie, and Y. Yin. Exploring linear feature disentanglement for neural networks. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pp.\ 1--6, Los Alamitos, CA, USA, jul 2022. IEEE Computer Society, Exploring Linear Feature Disentanglement for Neural Networks https://doi.ieeecomputersociety.org/10.1109/ICM...

  31. [31]

    beta- VAE : Learning basic visual concepts with a constrained variational framework

    Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta- VAE : Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017, https://openreview.net/forum?id=Sy2fzU9gl beta- VAE : Learning Basic Visual...

  32. [32]

    Simple data balancing achieves competitive worst-group-accuracy

    Badr Youbi Idrissi, Martin Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. In Bernhard Schölkopf, Caroline Uhler, and Kun Zhang (eds.), Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp.\ 336--351. PMLR, 11--13 ...

  33. [33]

    Shielded representations: Protecting sensitive attributes through iterative gradient-based projection

    Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 5961--5977, Toronto, Canada, July 2023. Association for Computat...

  34. [34]

    Leveraging prototypical representations for mitigating social bias without demographic information

    Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. Leveraging prototypical representations for mitigating social bias without demographic information. Computing Research Repository, 2403.09516, 2024

  35. [35]

    Interpretability beyond feature attribution: Quantitative testing with concept activation vectors ( TCAV )

    Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors ( TCAV ). In Proceedings of the 35th International Conference on Machine Learning, pp.\ 2668--2677. PMLR, 2018

  36. [36]

    Disentangling by factorising

    Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2649--2658. PMLR, 10--15 Jul 2018, Disentangling by Factorising https://proceedings.mlr.press/v80/kim18b.html

  37. [37]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, Adam: A Method for Stochastic Optimization https://api.semanticscholar.org/CorpusID:6628106. CoRR, abs/1412.6980, 2014

  38. [38]

    Last layer re-training is sufficient for robustness to spurious correlations

    Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. Computing Research Repository, arXiv:2204.02937, 2023

  39. [39]

    arXiv preprint arXiv:2403.00745 , year=

    János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. AtP *: An efficient and scalable method for localizing llm behaviour to components. Computing Research Repository, arXiv:2403.00745, 2024

  40. [40]

    David K. Lewis. Counterfactuals. Blackwell, Malden, Mass., 1973

  41. [41]

    Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

  42. [42]

    Johnny Lin and Joseph Bloom. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023, Neuronpedia: Interactive Reference and Tooling for Analyzing Neural Networks https://www.neuronpedia.org. Software available from neuronpedia.org

  43. [43]

    Just train twice: Improving group robustness without training group information

    Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning ...

  44. [44]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017, Decoupled Weight Decay Regularization https://api.semanticscholar.org/CorpusID:53592270

  45. [45]

    Alireza Makhzani and Brendan J. Frey. k-sparse autoencoders, k-Sparse Autoencoders https://api.semanticscholar.org/CorpusID:14850799. Computing Research Repository, abs/1312.5663, 2013

  46. [46]

    Locating and Editing Factual Associations in GPT, January 2023

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262

  47. [47]

    The quantization model of neural scaling

    Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, The Quantization Model of Neural Scaling https://openreview.net/forum?id=3tbTw2ga8K

  48. [48]

    Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, and Yonatan Belinkov. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability, 2024, The Quest for the Right Mediator: A History, ...

  49. [49]

    Learning from failure: T raining debiased classifier from biased classifier

    Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: T raining debiased classifier from biased classifier. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

  50. [50]

    Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation, 2022

    Junhyun Nam, Jaehyung Kim, Jaeho Lee, and Jinwoo Shin. Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation, 2022

  51. [51]

    Neel Nanda. Attribution patching: Activation patching at industrial scale, 2022, Attribution Patching: Activation Patching At Industrial Scale https://www.neelnanda.io/mechanistic-interpretability/attribution-patching

  52. [52]

    Neel Nanda. Open source replication & commentary on A nthropic's dictionary learning paper, 2023, https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s Open Source Replication & Commentary on A nthropic's Dictionary Learning Paper

  53. [53]

    Neel Nanda, Senthooran Rajamanoharan, János Kramár, and Rohin Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, 2023, Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

  54. [54]

    The alignment problem from a deep learning perspective

    Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. Computing Research Repository, arXiv:2209.00626, 2024

  55. [55]

    Nguyen, and Tsui-Wei Weng

    Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2023, Label-free Concept Bottleneck Models https://openreview.net/forum?id=FlCg47MNvBA

  56. [56]

    In-context learning and induction heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  57. [57]

    Hashimoto, and Percy Liang

    Yonatan Oren, Shiori Sagawa, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust language modeling. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp...

  58. [58]

    BLIND : Bias removal with no demographics

    Hadas Orgad and Yonatan Belinkov. BLIND : Bias removal with no demographics. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8801--8821, Toronto, Canada, July 2023. Association for Computational Linguistics, https://aclantho...

  59. [59]

    Direct and indirect effects

    Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI'01, pp.\ 411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001

  60. [60]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

  61. [61]

    Efros, and Antonio Torralba

    William Peebles, John Peebles, Jun-Yan Zhu, Alexei A. Efros, and Antonio Torralba. The hessian penalty: A weak prior for unsupervised disentanglement. In Proceedings of European Conference on Computer Vision (ECCV), 2020

  62. [62]

    Fine-tuning enhances existing mechanisms: A case study on entity tracking

    Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, and David Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024. arXiv:2402.14811

  63. [63]

    arXiv preprint arXiv:2404.16014 , year=

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. Computing Research Repository, arXiv:2404.16014, 2024 a

  64. [64]

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders https://arxiv.org/abs/2407.14435. Computing Research Repository, arXiv:24...

  65. [65]

    Null it out: Guarding protected attributes by iterative nullspace projection

    Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 7237--7256, Online, July 2020. As...

  66. [66]

    Linear adversarial concept erasure

    Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D Cotterell. Linear adversarial concept erasure. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 18400--18421. PML...

  67. [67]

    Adversarial concept erasure in kernel space

    Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, and Ryan Cotterell. Adversarial concept erasure in kernel space. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 6034--6055, Abu Dhabi, United Arab Emirates, December 2022 b . Association for Computation...

  68. [68]

    Robins and Sander Greenland

    James M. Robins and Sander Greenland. Identifiability and exchangeability for direct and indirect effects, Identifiability and Exchangeability for Direct and Indirect Effects http://www.jstor.org/stable/3702894. Epidemiology, 3 0 (2): 0 143--155, 1992. ISSN 10443983

  69. [69]

    Hashimoto, and Percy Liang

    Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2020, Distributionally Robust Neural Networks https://openreview.net/forum?id=ryxGuJrFvS

  70. [70]

    Learning Factorial Codes by Predictability Minimization , https://doi.org/10.1162/neco.1992.4.6.863 Learning Factorial Codes by Predictability Minimization

    Jürgen Schmidhuber. Learning Factorial Codes by Predictability Minimization , https://doi.org/10.1162/neco.1992.4.6.863 Learning Factorial Codes by Predictability Minimization . Neural Computation, 4 0 (6): 0 863--879, 11 1992. ISSN 0899-7667

  71. [71]

    Explaining neural networks by decoding layer activations

    Johannes Schneider and Michalis Vlachos. Explaining neural networks by decoding layer activations. In Advances in Intelligent Data Analysis XIX: 19th International Symposium on Intelligent Data Analysis, IDA 2021, Porto, Portugal, April 26–28, 2021, Proceedings, pp.\ 63–75, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-74250-8, Explaining Neur...

  72. [72]

    BARACK : Partially supervised group robustness with guarantees

    Nimit Sharad Sohoni, Maziar Sanjabi, Nicolas Ballas, Aditya Grover, Shaoliang Nie, Hamed Firooz, and Christopher Re. BARACK : Partially supervised group robustness with guarantees. In ICML 2022: Workshop on Spurious Correlations, Invariance and Stability, 2022, https://openreview.net/forum?id=Rn9POk3wOiV BARACK : Partially Supervised Group Robustness With...

  73. [73]

    Axiomatic attribution for deep networks

    Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 3319–3328. JMLR.org, 2017

  74. [74]

    Attribution patching outperforms automated circuit discovery

    Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. In NeurIPS Workshop on Attributing Model Behavior at Scale, 2023, Attribution Patching Outperforms Automated Circuit Discovery https://openreview.net/forum?id=tiLbFR4bJW

  75. [75]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

  76. [76]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  77. [77]

    Li, Arnab Sen Sharma, Aaron Mueller, Byron C

    Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. In Proceedings of the 2024 International Conference on Learning Representations, 2024

  78. [78]

    Towards debiasing NLU models from unknown biases

    Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. Towards debiasing NLU models from unknown biases. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 7597--7610, Online, November 2020. Association for Computational Linguistics, htt...

  79. [79]

    Investigating gender bias in language models using causal mediation analysis

    Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 12388--12401. Curran Associat...

  80. [80]

    Interpretability in the wild: a circuit for indirect object identification in GPT -2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT -2 small. In The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=NpsVSN6o4ul Interpretability in the Wild: a Circuit for Indirect Obj...

Showing first 80 references.