arxiv: 2403.19647 · v3 · submitted 2024-03-28 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Samuel Marks , Can Rager , Eric J. Michaud , Yonatan Belinkov , David Bau , Aaron Mueller

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords sparse feature circuitsmechanistic interpretabilitylanguage modelscausal graphsfeature ablationunsupervised discoverymodel editing

0 comments

The pith

Sparse feature circuits map language model behaviors to causally implicated networks of human-interpretable features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces sparse feature circuits as subnetworks built from fine-grained interpretable features rather than polysemantic neurons or attention heads. These circuits are shown to be causally linked to specific model behaviors through interventions. The authors apply them in a method called SHIFT to improve a classifier by removing features judged irrelevant by a human. They also present an unsupervised pipeline that automatically finds thousands of such circuits across model behaviors.

Core claim

Sparse feature circuits are causally implicated subnetworks of human-interpretable features that explain language model behaviors. Unlike earlier circuits built from polysemantic units, these circuits support detailed mechanistic understanding of unanticipated behaviors and enable direct editing through ablation.

What carries the argument

Sparse feature circuits, defined as causally implicated subnetworks composed of fine-grained human-interpretable features, replace polysemantic units to carry causal explanations and support interventions such as ablation.

If this is right

Model behaviors can be explained at the level of individual interpretable features instead of opaque units.
Ablating task-irrelevant features improves generalization of downstream classifiers.
Thousands of circuits can be discovered automatically without human supervision for many model behaviors.
Causal editing becomes feasible for unanticipated mechanisms inside the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the circuits prove stable across different prompts, they could support persistent model edits that survive retraining.
The same discovery process might be applied to detect and isolate circuits tied to undesirable outputs such as hallucinations.
Scaling the pipeline could produce a partial wiring diagram of the entire model for targeted capability control.
Combining these circuits with activation patching might reveal how features interact across layers.

Load-bearing premise

The extracted features are reliably human-interpretable and interventions on them produce the claimed behavioral changes without new unintended effects.

What would settle it

A controlled test in which human judges rate the features as uninterpretable or in which ablating the identified features fails to improve classifier generalization on held-out data would falsify the central claims.

read the original abstract

We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sparse feature circuits swap polysemantic neurons for SAE features to make circuits more readable and editable, but the SHIFT ablation results need tighter checks against residual correlations.

read the letter

The main takeaway is that this work replaces the usual neuron or head units in circuit analysis with sparse features from autoencoders. That shift produces subnetworks that are easier for humans to interpret and that support direct editing applications like SHIFT. The unsupervised pipeline for discovering thousands of circuits on automatically found behaviors is the clearest practical step forward from earlier circuit papers. They also show concrete examples where ablating human-labeled irrelevant features improves classifier generalization, which prior work did not emphasize as much. The approach stays grounded in the existing sparse autoencoder and circuit literature without overclaiming novelty on the feature extraction side itself. The soft spot is the causal strength of the ablations. SAE features still carry some correlations even after training, so removing one feature can leak effects into other computations. The abstract gives no quantitative ablation tables or controls for indirect impacts, and the stress-test concern about confounded causality looks like it could land once the full results are examined. Human judgment for labeling features as irrelevant also adds a reproducibility wrinkle that is not addressed in the summary. This paper is for mechanistic interpretability researchers who already work with circuits or SAEs and want a scalable route to editing. A reader focused on model editing or safety applications would find the SHIFT section useful even if the causal claims need more evidence. It is coherent on its own terms and shows clear engagement with the literature, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces sparse feature circuits as causally implicated subnetworks of human-interpretable features extracted via sparse autoencoders, contrasting them with prior circuits based on polysemantic neurons or attention heads. It presents methods for their discovery and applies them in the SHIFT task to improve classifier generalization by ablating human-judged task-irrelevant features, while also demonstrating an unsupervised scalable pipeline that identifies thousands of such circuits for automatically discovered model behaviors.

Significance. If the causal claims and quantitative results hold, the work would advance mechanistic interpretability by shifting from coarse, polysemantic units to finer-grained interpretable features, enabling more precise causal analysis, editing, and scalable unsupervised pipelines for understanding LM behaviors.

major comments (2)

[§4] §4 (SHIFT evaluation): The central claim that ablating human-judged irrelevant features improves generalization without unintended effects is load-bearing for the editing application, yet the manuscript provides insufficient controls for residual correlations or incomplete disentanglement in the underlying SAEs; ablation effects could propagate indirectly, confounding the reported gains. Include ablation specificity metrics (e.g., change in other feature activations) and comparison to random or correlated-feature baselines.
[§3] §3 (circuit discovery and causality validation): The assertion that sparse feature circuits are 'causally implicated' relies on interventions whose isolation is not fully demonstrated; given known SAE polysemanticity, provide explicit tests (e.g., do-no-harm checks on unrelated behaviors or mutual information between features) to rule out confounding before claiming detailed understanding of unanticipated mechanisms.

minor comments (2)

[§2] Clarify notation for feature activation thresholds and circuit extraction criteria in the methods; inconsistent use of 'sparse' vs. 'interpretable' risks ambiguity.
[§4, §5] Add error bars, statistical significance, and exact dataset sizes to all quantitative results in the SHIFT and unsupervised pipeline sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address the major concerns regarding the SHIFT evaluation and the causality validation in circuit discovery below. We agree that additional controls and tests will strengthen the manuscript and plan to incorporate them in the revised version.

read point-by-point responses

Referee: [§4] §4 (SHIFT evaluation): The central claim that ablating human-judged irrelevant features improves generalization without unintended effects is load-bearing for the editing application, yet the manuscript provides insufficient controls for residual correlations or incomplete disentanglement in the underlying SAEs; ablation effects could propagate indirectly, confounding the reported gains. Include ablation specificity metrics (e.g., change in other feature activations) and comparison to random or correlated-feature baselines.

Authors: We recognize the importance of demonstrating that the ablations in SHIFT are specific and do not lead to unintended effects through residual correlations in the SAEs. The original experiments used human judgment to select irrelevant features and showed generalization improvements, but we agree that more rigorous controls are needed. In the revised manuscript, we will include ablation specificity metrics, such as the change in activation of other features when ablating the selected ones, to show minimal interference. Additionally, we will add baselines comparing to random feature ablations and ablations of features that are correlated with the irrelevant ones. This will help confirm that the gains are due to the targeted ablations. revision: yes
Referee: [§3] §3 (circuit discovery and causality validation): The assertion that sparse feature circuits are 'causally implicated' relies on interventions whose isolation is not fully demonstrated; given known SAE polysemanticity, provide explicit tests (e.g., do-no-harm checks on unrelated behaviors or mutual information between features) to rule out confounding before claiming detailed understanding of unanticipated mechanisms.

Authors: We appreciate the referee pointing out the need for stronger evidence of intervention isolation, especially considering potential polysemanticity in SAEs. Our method identifies circuits by finding features that causally affect the behavior via patching experiments, and we show that these circuits explain unanticipated mechanisms. To address the concern, we will add explicit do-no-harm checks in the revised §3, where we test that ablating the discovered circuits does not harm performance on unrelated tasks or behaviors. We will also compute and report metrics such as mutual information between the features in the circuit to assess their independence. These additions will provide better support for the causal claims. revision: yes

Circularity Check

0 steps flagged

No circularity: methodological pipeline is self-contained with no derivations reducing to inputs

full rationale

The paper presents an empirical methodology for discovering sparse feature circuits via SAEs and applying them in SHIFT ablations, without any equations, first-principles derivations, or predictions that reduce by construction to fitted parameters or self-citations. Claims rest on external validation through human interpretability judgments and measured generalization improvements, which are falsifiable outside the fitted values. No load-bearing self-citation chains or ansatz smuggling appear in the provided text; the unsupervised pipeline and causal editing steps are independent of the target results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review uses abstract only; no explicit free parameters, axioms, or invented entities beyond the core concept are stated. Sparse feature circuits are treated as the primary new construct.

invented entities (1)

sparse feature circuits no independent evidence
purpose: Causally implicated subnetworks of human-interpretable features for explaining language model behaviors
Introduced as the central new object in the abstract; no independent evidence provided within the abstract

pith-pipeline@v0.9.0 · 5431 in / 1147 out tokens · 76223 ms · 2026-05-13T13:09:39.264755+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE decomposes recurrent model cache writes into substitutable atoms with a closed-form logit shift, achieving high substitution success and targeted behavioral installs on models like Qwen3.5 and Mamba-2.
WriteSAE: Sparse Autoencoders for Recurrent State
cs.LG 2026-05 unverdicted novelty 8.0

WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
Crafting Reversible SFT Behaviors in Large Language Models
cs.LG 2026-05 unverdicted novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
cs.CL 2026-05 unverdicted novelty 7.0

Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
cs.LG 2026-05 conditional novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
What Cohort INRs Encode and Where to Freeze Them
cs.LG 2026-05 unverdicted novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
A framework for analyzing concept representations in neural models
cs.CL 2026-05 unverdicted novelty 7.0

A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
cs.CV 2026-04 unverdicted novelty 7.0

Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
Emotion Concepts and their Function in a Large Language Model
cs.AI 2026-04 unverdicted novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Why Retrieval-Augmented Generation Fails: A Graph Perspective
cs.CL 2026-05 unverdicted novelty 6.0

Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
cs.LG 2026-05 unverdicted novelty 6.0

Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
cs.AI 2026-05 conditional novelty 6.0

Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
cs.AI 2026-05 unverdicted novelty 6.0

LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Towards Understanding the Robustness of Sparse Autoencoders
cs.LG 2026-04 unverdicted novelty 6.0

Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
cs.LG 2026-04 unverdicted novelty 6.0

Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
cs.LG 2026-04 unverdicted novelty 6.0

Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious
cs.CL 2026-03 unverdicted novelty 6.0

Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
cs.CL 2025-07 unverdicted novelty 6.0

Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.
Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
cs.LG 2026-05 unverdicted novelty 5.0

Feature rivalry in SAE representations strengthens with model uncertainty on high-entropy questions, enables output steering, and predicts answer correctness with AUROC 0.689 in Gemma-2-2B.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
cs.CL 2026-05 unverdicted novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 26 Pith papers · 4 internal anchors

[1]

Probing classifiers: Promises, shortcomings, and advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022

work page 2022
[2]

LEACE : Perfect linear concept erasure in closed form

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, https://openreview.net/forum?id=awIpKpwTwF LEACE : Perfect linear concept erasure in closed form

work page 2023
[3]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.\ 2397--2430. PMLR, 2023

work page 2023
[4]

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

work page 2023
[5]

Understanding disentangling in $\beta$-VAE

Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in - VAE , 2017, https://arxiv.org/abs/1804.03599 Understanding disentangling in - VAE

work page Pith review arXiv 2017
[6]

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision https://...

work page arXiv 2023
[7]

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J \'e r \'e my Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Mi...

work page 2023
[8]

Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLM s

Angelica Chen, Ravid Shwartz-Ziv, Kyunghyun Cho, Matthew L Leavitt, and Naomi Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLM s. In The Twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum?id=MO5PiKHELW Sudden Drops in the Loss: Syntax Acquisition, Phase Transi...

work page 2024
[9]

Isolating sources of disentanglement in variational autoencoders, 2018, Isolating Sources of Disentanglement in Variational Autoencoders https://openreview.net/forum?id=BJdMRoCIf

Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders, 2018, Isolating Sources of Disentanglement in Variational Autoencoders https://openreview.net/forum?id=BJdMRoCIf

work page 2018
[10]

Infogan: interpretable representation learning by information maximizing generative adversarial nets

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pp.\ 2180–2188, Red Hook, NY, USA, 2016. Curran Associates Inc. ISB...

work page 2016
[11]

2017 , month = dec, journal =

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Computing Research Repository, arXiv:1706.03741, 2023

work page arXiv 2023
[12]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, Towards Automated Circuit Discovery for Mechanistic Interpretability https://openreview.net/pdf?id=89ia77nZ8u

work page 2023
[13]

Environment inference for invariant learning

Elliot Creager, Joern-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 2189--2200. PMLR, 18--24 Jul 2021, Environment Inference for Invariant Learning htt...

work page 2021
[14]

Sparse autoencoders find highly interpretable features in language models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024, Sparse Autoencoders Find Highly Interpretable Features in Language Models https://openreview.net/forum?id=F76bwRSLeK

work page 2024
[15]

Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* '19, pp.\ 120–128, Ne...

work page doi:10.1145/3287560.3287572 2019
[16]

Disentangling factors of variation via generative entangling

Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via generative entangling. Computing Research Repository, arXiv:1210.5474, 2012

work page arXiv 2012
[17]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/t...

work page 2022
[18]

Causal analysis of syntactic agreement mechanisms in neural language models

Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internat...

work page 2021
[19]

Efros, and Jacob Steinhardt

Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt. Interpreting CLIP 's image representation via text-based decomposition. Computing Research Repository, arXiv:2310.05916, 2024

work page arXiv 2024
[20]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The P ile: An 800 GB dataset of diverse text for language modeling. Computing Research Repository, arXiv:2101.00027, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2020
[21]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. Computing Research Repository, arXiv:2406.04093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Causal abstractions of neural networks

Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 9574--9586. Curran Associates, Inc., 2021, Causal Abstractions of Neural Networks https://proceeding...

work page 2021
[23]

Inducing causal structure for interpretable neural networks

Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volu...

work page 2022
[24]

Causal abstraction for faithful model interpretation

Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation. Computing Research Repository, arXiv:2301.04709, 2023

work page arXiv 2023
[25]

Dissecting recall of factual associations in auto-regressive language models

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12216--12235, Singapore, December 2023. Association for Computationa...

work page 2023
[26]

Successor heads: Recurring, interpretable attention heads in the wild

Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, interpretable attention heads in the wild. Computing Research Repository, arXiv:2312.09230, 2023

work page arXiv 2023
[27]

How does GPT -2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT -2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, https://openreview.net/forum?id=p4PckNQR8k How does GPT -2 compute greater-than?: Interpreting mathematical abilities in...

work page 2023
[28]

Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms

Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms. In ICML 2024 Workshop on Mechanistic Interpretability, 2024, Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms https://openreview.net/forum?id=grXgesr5dT

work page 2024
[29]

The unreasonable effectiveness of easy training data for hard tasks

Peter Hase, Mohit Bansal, Peter Clark, and Sarah Wiegreffe. The unreasonable effectiveness of easy training data for hard tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7002--7024, Bangkok, Thailand, August 2024. Associati...

work page 2024
[30]

T. He, Z. Li, Y. Gong, Y. Yao, X. Nie, and Y. Yin. Exploring linear feature disentanglement for neural networks. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pp.\ 1--6, Los Alamitos, CA, USA, jul 2022. IEEE Computer Society, Exploring Linear Feature Disentanglement for Neural Networks https://doi.ieeecomputersociety.org/10.1109/ICM...

work page doi:10.1109/icme52920.2022.9859978 2022
[31]

beta- VAE : Learning basic visual concepts with a constrained variational framework

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta- VAE : Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017, https://openreview.net/forum?id=Sy2fzU9gl beta- VAE : Learning Basic Visual...

work page 2017
[32]

Simple data balancing achieves competitive worst-group-accuracy

Badr Youbi Idrissi, Martin Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. In Bernhard Schölkopf, Caroline Uhler, and Kun Zhang (eds.), Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp.\ 336--351. PMLR, 11--13 ...

work page 2022
[33]

Shielded representations: Protecting sensitive attributes through iterative gradient-based projection

Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 5961--5977, Toronto, Canada, July 2023. Association for Computat...

work page 2023
[34]

Leveraging prototypical representations for mitigating social bias without demographic information

Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. Leveraging prototypical representations for mitigating social bias without demographic information. Computing Research Repository, 2403.09516, 2024

work page arXiv 2024
[35]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors ( TCAV )

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors ( TCAV ). In Proceedings of the 35th International Conference on Machine Learning, pp.\ 2668--2677. PMLR, 2018

work page 2018
[36]

Disentangling by factorising

Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2649--2658. PMLR, 10--15 Jul 2018, Disentangling by Factorising https://proceedings.mlr.press/v80/kim18b.html

work page 2018
[37]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, Adam: A Method for Stochastic Optimization https://api.semanticscholar.org/CorpusID:6628106. CoRR, abs/1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[38]

Last layer re-training is sufficient for robustness to spurious correlations

Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. Computing Research Repository, arXiv:2204.02937, 2023

work page arXiv 2023
[39]

arXiv preprint arXiv:2403.00745 , year=

János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. AtP *: An efficient and scalable method for localizing llm behaviour to components. Computing Research Repository, arXiv:2403.00745, 2024

work page arXiv 2024
[40]

David K. Lewis. Counterfactuals. Blackwell, Malden, Mass., 1973

work page 1973
[41]

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024

work page 2024
[42]

Johnny Lin and Joseph Bloom. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023, Neuronpedia: Interactive Reference and Tooling for Analyzing Neural Networks https://www.neuronpedia.org. Software available from neuronpedia.org

work page 2023
[43]

Just train twice: Improving group robustness without training group information

Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning ...

work page 2021
[44]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017, Decoupled Weight Decay Regularization https://api.semanticscholar.org/CorpusID:53592270

work page 2017
[45]

Alireza Makhzani and Brendan J. Frey. k-sparse autoencoders, k-Sparse Autoencoders https://api.semanticscholar.org/CorpusID:14850799. Computing Research Repository, abs/1312.5663, 2013

work page Pith review arXiv 2013
[46]

Locating and Editing Factual Associations in GPT, January 2023

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262

work page arXiv 2022
[47]

The quantization model of neural scaling

Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, The Quantization Model of Neural Scaling https://openreview.net/forum?id=3tbTw2ga8K

work page 2023
[48]

Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, and Yonatan Belinkov. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability, 2024, The Quest for the Right Mediator: A History, ...

work page arXiv 2024
[49]

Learning from failure: T raining debiased classifier from biased classifier

Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: T raining debiased classifier from biased classifier. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020
[50]

Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation, 2022

Junhyun Nam, Jaehyung Kim, Jaeho Lee, and Jinwoo Shin. Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation, 2022

work page 2022
[51]

Neel Nanda. Attribution patching: Activation patching at industrial scale, 2022, Attribution Patching: Activation Patching At Industrial Scale https://www.neelnanda.io/mechanistic-interpretability/attribution-patching

work page 2022
[52]

Neel Nanda. Open source replication & commentary on A nthropic's dictionary learning paper, 2023, https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s Open Source Replication & Commentary on A nthropic's Dictionary Learning Paper

work page 2023
[53]

Neel Nanda, Senthooran Rajamanoharan, János Kramár, and Rohin Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, 2023, Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall

work page 2023
[54]

The alignment problem from a deep learning perspective

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. Computing Research Repository, arXiv:2209.00626, 2024

work page arXiv 2024
[55]

Nguyen, and Tsui-Wei Weng

Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2023, Label-free Concept Bottleneck Models https://openreview.net/forum?id=FlCg47MNvBA

work page 2023
[56]

In-context learning and induction heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page 2022
[57]

Hashimoto, and Percy Liang

Yonatan Oren, Shiori Sagawa, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust language modeling. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp...

work page 2019
[58]

BLIND : Bias removal with no demographics

Hadas Orgad and Yonatan Belinkov. BLIND : Bias removal with no demographics. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8801--8821, Toronto, Canada, July 2023. Association for Computational Linguistics, https://aclantho...

work page 2023
[59]

Direct and indirect effects

Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI'01, pp.\ 411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001

work page 2001
[60]

Pedregosa, G

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

work page 2011
[61]

Efros, and Antonio Torralba

William Peebles, John Peebles, Jun-Yan Zhu, Alexei A. Efros, and Antonio Torralba. The hessian penalty: A weak prior for unsupervised disentanglement. In Proceedings of European Conference on Computer Vision (ECCV), 2020

work page 2020
[62]

Fine-tuning enhances existing mechanisms: A case study on entity tracking

Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, and David Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024. arXiv:2402.14811

work page arXiv 2024
[63]

arXiv preprint arXiv:2404.16014 , year=

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. Computing Research Repository, arXiv:2404.16014, 2024 a

work page arXiv 2024
[64]

Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders https://arxiv.org/abs/2407.14435. Computing Research Repository, arXiv:24...

work page arXiv 2024
[65]

Null it out: Guarding protected attributes by iterative nullspace projection

Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 7237--7256, Online, July 2020. As...

work page 2020
[66]

Linear adversarial concept erasure

Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D Cotterell. Linear adversarial concept erasure. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 18400--18421. PML...

work page 2022
[67]

Adversarial concept erasure in kernel space

Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, and Ryan Cotterell. Adversarial concept erasure in kernel space. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 6034--6055, Abu Dhabi, United Arab Emirates, December 2022 b . Association for Computation...

work page 2022
[68]

Robins and Sander Greenland

James M. Robins and Sander Greenland. Identifiability and exchangeability for direct and indirect effects, Identifiability and Exchangeability for Direct and Indirect Effects http://www.jstor.org/stable/3702894. Epidemiology, 3 0 (2): 0 143--155, 1992. ISSN 10443983

work page arXiv 1992
[69]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2020, Distributionally Robust Neural Networks https://openreview.net/forum?id=ryxGuJrFvS

work page 2020
[70]

Learning Factorial Codes by Predictability Minimization , https://doi.org/10.1162/neco.1992.4.6.863 Learning Factorial Codes by Predictability Minimization

Jürgen Schmidhuber. Learning Factorial Codes by Predictability Minimization , https://doi.org/10.1162/neco.1992.4.6.863 Learning Factorial Codes by Predictability Minimization . Neural Computation, 4 0 (6): 0 863--879, 11 1992. ISSN 0899-7667

work page doi:10.1162/neco.1992.4.6.863 1992
[71]

Explaining neural networks by decoding layer activations

Johannes Schneider and Michalis Vlachos. Explaining neural networks by decoding layer activations. In Advances in Intelligent Data Analysis XIX: 19th International Symposium on Intelligent Data Analysis, IDA 2021, Porto, Portugal, April 26–28, 2021, Proceedings, pp.\ 63–75, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-74250-8, Explaining Neur...

work page doi:10.1007/978-3-030-74251-5_6 2021
[72]

BARACK : Partially supervised group robustness with guarantees

Nimit Sharad Sohoni, Maziar Sanjabi, Nicolas Ballas, Aditya Grover, Shaoliang Nie, Hamed Firooz, and Christopher Re. BARACK : Partially supervised group robustness with guarantees. In ICML 2022: Workshop on Spurious Correlations, Invariance and Stability, 2022, https://openreview.net/forum?id=Rn9POk3wOiV BARACK : Partially Supervised Group Robustness With...

work page 2022
[73]

Axiomatic attribution for deep networks

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 3319–3328. JMLR.org, 2017

work page 2017
[74]

Attribution patching outperforms automated circuit discovery

Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. In NeurIPS Workshop on Attributing Model Behavior at Scale, 2023, Attribution Patching Outperforms Automated Circuit Discovery https://openreview.net/forum?id=tiLbFR4bJW

work page 2023
[75]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...

work page 2024
[76]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024
[77]

Li, Arnab Sen Sharma, Aaron Mueller, Byron C

Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. In Proceedings of the 2024 International Conference on Learning Representations, 2024

work page 2024
[78]

Towards debiasing NLU models from unknown biases

Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. Towards debiasing NLU models from unknown biases. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 7597--7610, Online, November 2020. Association for Computational Linguistics, htt...

work page 2020
[79]

Investigating gender bias in language models using causal mediation analysis

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 12388--12401. Curran Associat...

work page 2020
[80]

Interpretability in the wild: a circuit for indirect object identification in GPT -2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT -2 small. In The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=NpsVSN6o4ul Interpretability in the Wild: a Circuit for Indirect Obj...

work page 2023

Showing first 80 references.