Recognition: 3 theorem links
· Lean TheoremSparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Pith reviewed 2026-05-13 13:09 UTC · model grok-4.3
The pith
Sparse feature circuits map language model behaviors to causally implicated networks of human-interpretable features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sparse feature circuits are causally implicated subnetworks of human-interpretable features that explain language model behaviors. Unlike earlier circuits built from polysemantic units, these circuits support detailed mechanistic understanding of unanticipated behaviors and enable direct editing through ablation.
What carries the argument
Sparse feature circuits, defined as causally implicated subnetworks composed of fine-grained human-interpretable features, replace polysemantic units to carry causal explanations and support interventions such as ablation.
If this is right
- Model behaviors can be explained at the level of individual interpretable features instead of opaque units.
- Ablating task-irrelevant features improves generalization of downstream classifiers.
- Thousands of circuits can be discovered automatically without human supervision for many model behaviors.
- Causal editing becomes feasible for unanticipated mechanisms inside the model.
Where Pith is reading between the lines
- If the circuits prove stable across different prompts, they could support persistent model edits that survive retraining.
- The same discovery process might be applied to detect and isolate circuits tied to undesirable outputs such as hallucinations.
- Scaling the pipeline could produce a partial wiring diagram of the entire model for targeted capability control.
- Combining these circuits with activation patching might reveal how features interact across layers.
Load-bearing premise
The extracted features are reliably human-interpretable and interventions on them produce the claimed behavioral changes without new unintended effects.
What would settle it
A controlled test in which human judges rate the features as uninterpretable or in which ablating the identified features fails to improve classifier generalization on held-out data would falsify the central claims.
read the original abstract
We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces sparse feature circuits as causally implicated subnetworks of human-interpretable features extracted via sparse autoencoders, contrasting them with prior circuits based on polysemantic neurons or attention heads. It presents methods for their discovery and applies them in the SHIFT task to improve classifier generalization by ablating human-judged task-irrelevant features, while also demonstrating an unsupervised scalable pipeline that identifies thousands of such circuits for automatically discovered model behaviors.
Significance. If the causal claims and quantitative results hold, the work would advance mechanistic interpretability by shifting from coarse, polysemantic units to finer-grained interpretable features, enabling more precise causal analysis, editing, and scalable unsupervised pipelines for understanding LM behaviors.
major comments (2)
- [§4] §4 (SHIFT evaluation): The central claim that ablating human-judged irrelevant features improves generalization without unintended effects is load-bearing for the editing application, yet the manuscript provides insufficient controls for residual correlations or incomplete disentanglement in the underlying SAEs; ablation effects could propagate indirectly, confounding the reported gains. Include ablation specificity metrics (e.g., change in other feature activations) and comparison to random or correlated-feature baselines.
- [§3] §3 (circuit discovery and causality validation): The assertion that sparse feature circuits are 'causally implicated' relies on interventions whose isolation is not fully demonstrated; given known SAE polysemanticity, provide explicit tests (e.g., do-no-harm checks on unrelated behaviors or mutual information between features) to rule out confounding before claiming detailed understanding of unanticipated mechanisms.
minor comments (2)
- [§2] Clarify notation for feature activation thresholds and circuit extraction criteria in the methods; inconsistent use of 'sparse' vs. 'interpretable' risks ambiguity.
- [§4, §5] Add error bars, statistical significance, and exact dataset sizes to all quantitative results in the SHIFT and unsupervised pipeline sections.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper. We address the major concerns regarding the SHIFT evaluation and the causality validation in circuit discovery below. We agree that additional controls and tests will strengthen the manuscript and plan to incorporate them in the revised version.
read point-by-point responses
-
Referee: [§4] §4 (SHIFT evaluation): The central claim that ablating human-judged irrelevant features improves generalization without unintended effects is load-bearing for the editing application, yet the manuscript provides insufficient controls for residual correlations or incomplete disentanglement in the underlying SAEs; ablation effects could propagate indirectly, confounding the reported gains. Include ablation specificity metrics (e.g., change in other feature activations) and comparison to random or correlated-feature baselines.
Authors: We recognize the importance of demonstrating that the ablations in SHIFT are specific and do not lead to unintended effects through residual correlations in the SAEs. The original experiments used human judgment to select irrelevant features and showed generalization improvements, but we agree that more rigorous controls are needed. In the revised manuscript, we will include ablation specificity metrics, such as the change in activation of other features when ablating the selected ones, to show minimal interference. Additionally, we will add baselines comparing to random feature ablations and ablations of features that are correlated with the irrelevant ones. This will help confirm that the gains are due to the targeted ablations. revision: yes
-
Referee: [§3] §3 (circuit discovery and causality validation): The assertion that sparse feature circuits are 'causally implicated' relies on interventions whose isolation is not fully demonstrated; given known SAE polysemanticity, provide explicit tests (e.g., do-no-harm checks on unrelated behaviors or mutual information between features) to rule out confounding before claiming detailed understanding of unanticipated mechanisms.
Authors: We appreciate the referee pointing out the need for stronger evidence of intervention isolation, especially considering potential polysemanticity in SAEs. Our method identifies circuits by finding features that causally affect the behavior via patching experiments, and we show that these circuits explain unanticipated mechanisms. To address the concern, we will add explicit do-no-harm checks in the revised §3, where we test that ablating the discovered circuits does not harm performance on unrelated tasks or behaviors. We will also compute and report metrics such as mutual information between the features in the circuit to assess their independence. These additions will provide better support for the causal claims. revision: yes
Circularity Check
No circularity: methodological pipeline is self-contained with no derivations reducing to inputs
full rationale
The paper presents an empirical methodology for discovering sparse feature circuits via SAEs and applying them in SHIFT ablations, without any equations, first-principles derivations, or predictions that reduce by construction to fitted parameters or self-citations. Claims rest on external validation through human interpretability judgments and measured generalization improvements, which are falsifiable outside the fitted values. No load-bearing self-citation chains or ansatz smuggling appear in the provided text; the unsupervised pipeline and causal editing steps are independent of the target results.
Axiom & Free-Parameter Ledger
invented entities (1)
-
sparse feature circuits
no independent evidence
Lean theorems connected to this paper
-
Foundation.LawOfExistencedefect_zero_iff_one echoesWe introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant.
Forward citations
Cited by 24 Pith papers
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE is the first sparse autoencoder that factors decoder atoms into the native d_k x d_v cache write shape of recurrent models and supplies a closed-form per-token logit shift for atom substitution.
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
-
What Cohort INRs Encode and Where to Freeze Them
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
-
A framework for analyzing concept representations in neural models
A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...
-
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
-
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
-
From Token Lists to Graph Motifs: Weisfeiler-Lehman Analysis of Sparse Autoencoder Features
Graph-motif clustering of SAE features via a frequency-binned WL kernel recovers structural families not captured by decoder cosine similarity or token histograms.
-
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models
LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Towards Understanding the Robustness of Sparse Autoencoders
Integrating pretrained sparse autoencoders into LLM residual streams reduces jailbreak success rates by up to 5x across multiple models and attacks.
-
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
-
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
-
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.
-
Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs
Feature rivalry in SAE representations strengthens with model uncertainty on high-entropy questions, enables output steering, and predicts answer correctness with AUROC 0.689 in Gemma-2-2B.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
Reference graph
Works this paper leans on
-
[1]
Probing classifiers: Promises, shortcomings, and advances
Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48 0 (1): 0 207--219, 2022
work page 2022
-
[2]
LEACE : Perfect linear concept erasure in closed form
Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, https://openreview.net/forum?id=awIpKpwTwF LEACE : Perfect linear concept erasure in closed form
work page 2023
-
[3]
Pythia: A suite for analyzing large language models across training and scaling
Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.\ 2397--2430. PMLR, 2023
work page 2023
-
[4]
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...
work page 2023
-
[5]
P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A
Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in - VAE , 2017, https://arxiv.org/abs/1804.03599 Understanding disentangling in - VAE
-
[6]
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision https://...
-
[7]
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, J \'e r \'e my Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphael Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J Mi...
work page 2023
-
[8]
Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLM s
Angelica Chen, Ravid Shwartz-Ziv, Kyunghyun Cho, Matthew L Leavitt, and Naomi Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLM s. In The Twelfth International Conference on Learning Representations, 2024, https://openreview.net/forum?id=MO5PiKHELW Sudden Drops in the Loss: Syntax Acquisition, Phase Transi...
work page 2024
-
[9]
Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders, 2018, Isolating Sources of Disentanglement in Variational Autoencoders https://openreview.net/forum?id=BJdMRoCIf
work page 2018
-
[10]
Infogan: interpretable representation learning by information maximizing generative adversarial nets
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pp.\ 2180–2188, Red Hook, NY, USA, 2016. Curran Associates Inc. ISB...
work page 2016
-
[11]
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Computing Research Repository, arXiv:1706.03741, 2023
-
[12]
Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, Towards Automated Circuit Discovery for Mechanistic Interpretability https://openreview.net/pdf?id=89ia77nZ8u
work page 2023
-
[13]
Environment inference for invariant learning
Elliot Creager, Joern-Henrik Jacobsen, and Richard Zemel. Environment inference for invariant learning. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 2189--2200. PMLR, 18--24 Jul 2021, Environment Inference for Invariant Learning htt...
work page 2021
-
[14]
Sparse autoencoders find highly interpretable features in language models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024, Sparse Autoencoders Find Highly Interpretable Features in Language Models https://openreview.net/forum?id=F76bwRSLeK
work page 2024
-
[15]
Bias in bios: A case study of semantic representation bias in a high-stakes setting
Maria De-Arteaga, Alexey Romanov, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, and Adam Tauman Kalai. Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* '19, pp.\ 120–128, Ne...
-
[16]
Disentangling factors of variation via generative entangling
Guillaume Desjardins, Aaron Courville, and Yoshua Bengio. Disentangling factors of variation via generative entangling. Computing Research Repository, arXiv:1210.5474, 2012
-
[17]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/t...
work page 2022
-
[18]
Causal analysis of syntactic agreement mechanisms in neural language models
Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov. Causal analysis of syntactic agreement mechanisms in neural language models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internat...
work page 2021
-
[19]
Yossi Gandelsman, Alexei A. Efros, and Jacob Steinhardt. Interpreting CLIP 's image representation via text-based decomposition. Computing Research Repository, arXiv:2310.05916, 2024
-
[20]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The P ile: An 800 GB dataset of diverse text for language modeling. Computing Research Repository, arXiv:2101.00027, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[21]
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. Computing Research Repository, arXiv:2406.04093, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Causal abstractions of neural networks
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.\ 9574--9586. Curran Associates, Inc., 2021, Causal Abstractions of Neural Networks https://proceeding...
work page 2021
-
[23]
Inducing causal structure for interpretable neural networks
Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volu...
work page 2022
-
[24]
Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation. Computing Research Repository, arXiv:2301.04709, 2023
-
[25]
Dissecting recall of factual associations in auto-regressive language models
Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 12216--12235, Singapore, December 2023. Association for Computationa...
work page 2023
-
[26]
Successor heads: Recurring, interpretable attention heads in the wild
Rhys Gould, Euan Ong, George Ogden, and Arthur Conmy. Successor heads: Recurring, interpretable attention heads in the wild. Computing Research Repository, arXiv:2312.09230, 2023
-
[27]
Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT -2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, https://openreview.net/forum?id=p4PckNQR8k How does GPT -2 compute greater-than?: Interpreting mathematical abilities in...
work page 2023
-
[28]
Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms
Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms. In ICML 2024 Workshop on Mechanistic Interpretability, 2024, Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms https://openreview.net/forum?id=grXgesr5dT
work page 2024
-
[29]
The unreasonable effectiveness of easy training data for hard tasks
Peter Hase, Mohit Bansal, Peter Clark, and Sarah Wiegreffe. The unreasonable effectiveness of easy training data for hard tasks. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 7002--7024, Bangkok, Thailand, August 2024. Associati...
work page 2024
-
[30]
T. He, Z. Li, Y. Gong, Y. Yao, X. Nie, and Y. Yin. Exploring linear feature disentanglement for neural networks. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pp.\ 1--6, Los Alamitos, CA, USA, jul 2022. IEEE Computer Society, Exploring Linear Feature Disentanglement for Neural Networks https://doi.ieeecomputersociety.org/10.1109/ICM...
-
[31]
beta- VAE : Learning basic visual concepts with a constrained variational framework
Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta- VAE : Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017, https://openreview.net/forum?id=Sy2fzU9gl beta- VAE : Learning Basic Visual...
work page 2017
-
[32]
Simple data balancing achieves competitive worst-group-accuracy
Badr Youbi Idrissi, Martin Arjovsky, Mohammad Pezeshki, and David Lopez-Paz. Simple data balancing achieves competitive worst-group-accuracy. In Bernhard Schölkopf, Caroline Uhler, and Kun Zhang (eds.), Proceedings of the First Conference on Causal Learning and Reasoning, volume 177 of Proceedings of Machine Learning Research, pp.\ 336--351. PMLR, 11--13 ...
work page 2022
-
[33]
Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 5961--5977, Toronto, Canada, July 2023. Association for Computat...
work page 2023
-
[34]
Leveraging prototypical representations for mitigating social bias without demographic information
Shadi Iskander, Kira Radinsky, and Yonatan Belinkov. Leveraging prototypical representations for mitigating social bias without demographic information. Computing Research Repository, 2403.09516, 2024
-
[35]
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors ( TCAV ). In Proceedings of the 35th International Conference on Machine Learning, pp.\ 2668--2677. PMLR, 2018
work page 2018
-
[36]
Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 2649--2658. PMLR, 10--15 Jul 2018, Disentangling by Factorising https://proceedings.mlr.press/v80/kim18b.html
work page 2018
-
[37]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, Adam: A Method for Stochastic Optimization https://api.semanticscholar.org/CorpusID:6628106. CoRR, abs/1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[38]
Last layer re-training is sufficient for robustness to spurious correlations
Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wilson. Last layer re-training is sufficient for robustness to spurious correlations. Computing Research Repository, arXiv:2204.02937, 2023
-
[39]
arXiv preprint arXiv:2403.00745 , year=
János Kramár, Tom Lieberum, Rohin Shah, and Neel Nanda. AtP *: An efficient and scalable method for localizing llm behaviour to components. Computing Research Repository, arXiv:2403.00745, 2024
-
[40]
David K. Lewis. Counterfactuals. Blackwell, Malden, Mass., 1973
work page 1973
-
[41]
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024
Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2, 2024
work page 2024
-
[42]
Johnny Lin and Joseph Bloom. Neuronpedia: Interactive reference and tooling for analyzing neural networks, 2023, Neuronpedia: Interactive Reference and Tooling for Analyzing Neural Networks https://www.neuronpedia.org. Software available from neuronpedia.org
work page 2023
-
[43]
Just train twice: Improving group robustness without training group information
Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning ...
work page 2021
-
[44]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017, Decoupled Weight Decay Regularization https://api.semanticscholar.org/CorpusID:53592270
work page 2017
-
[45]
Alireza Makhzani and Brendan J. Frey. k-sparse autoencoders, k-Sparse Autoencoders https://api.semanticscholar.org/CorpusID:14850799. Computing Research Repository, abs/1312.5663, 2013
work page Pith review arXiv 2013
-
[46]
arXiv preprint arXiv:2202.05262 , year=
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . Advances in Neural Information Processing Systems, 36, 2022. arXiv:2202.05262
-
[47]
The quantization model of neural scaling
Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023, The Quantization Model of Neural Scaling https://openreview.net/forum?id=3tbTw2ga8K
work page 2023
-
[48]
Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, and Yonatan Belinkov. The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability, 2024, The Quest for the Right Mediator: A History, ...
-
[49]
Learning from failure: T raining debiased classifier from biased classifier
Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: T raining debiased classifier from biased classifier. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546
work page 2020
-
[50]
Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation, 2022
Junhyun Nam, Jaehyung Kim, Jaeho Lee, and Jinwoo Shin. Spread spurious attribute: Improving worst-group accuracy with spurious attribute estimation, 2022
work page 2022
-
[51]
Neel Nanda. Attribution patching: Activation patching at industrial scale, 2022, Attribution Patching: Activation Patching At Industrial Scale https://www.neelnanda.io/mechanistic-interpretability/attribution-patching
work page 2022
-
[52]
Neel Nanda. Open source replication & commentary on A nthropic's dictionary learning paper, 2023, https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s Open Source Replication & Commentary on A nthropic's Dictionary Learning Paper
work page 2023
-
[53]
Neel Nanda, Senthooran Rajamanoharan, János Kramár, and Rohin Shah. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, 2023, Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
work page 2023
-
[54]
The alignment problem from a deep learning perspective
Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. Computing Research Repository, arXiv:2209.00626, 2024
-
[55]
Tuomas Oikarinen, Subhro Das, Lam M. Nguyen, and Tsui-Wei Weng. Label-free concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2023, Label-free Concept Bottleneck Models https://openreview.net/forum?id=FlCg47MNvBA
work page 2023
-
[56]
In-context learning and induction heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[57]
Yonatan Oren, Shiori Sagawa, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust language modeling. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp...
work page 2019
-
[58]
BLIND : Bias removal with no demographics
Hadas Orgad and Yonatan Belinkov. BLIND : Bias removal with no demographics. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8801--8821, Toronto, Canada, July 2023. Association for Computational Linguistics, https://aclantho...
work page 2023
-
[59]
Judea Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI'01, pp.\ 411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001
work page 2001
-
[60]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011
work page 2011
-
[61]
William Peebles, John Peebles, Jun-Yan Zhu, Alexei A. Efros, and Antonio Torralba. The hessian penalty: A weak prior for unsupervised disentanglement. In Proceedings of European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[62]
Fine-tuning enhances existing mechanisms: A case study on entity tracking
Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, and David Bau. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In Proceedings of the 2024 International Conference on Learning Representations, 2024. arXiv:2402.14811
-
[63]
Improving dictionary learning with gated sparse autoen- coders.arXiv preprint arXiv:2404.16014,
Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders. Computing Research Repository, arXiv:2404.16014, 2024 a
-
[64]
Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders, Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders https://arxiv.org/abs/2407.14435. Computing Research Repository, arXiv:24...
-
[65]
Null it out: Guarding protected attributes by iterative nullspace projection
Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected attributes by iterative nullspace projection. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 7237--7256, Online, July 2020. As...
work page 2020
-
[66]
Linear adversarial concept erasure
Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D Cotterell. Linear adversarial concept erasure. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 18400--18421. PML...
work page 2022
-
[67]
Adversarial concept erasure in kernel space
Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, and Ryan Cotterell. Adversarial concept erasure in kernel space. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 6034--6055, Abu Dhabi, United Arab Emirates, December 2022 b . Association for Computation...
work page 2022
-
[68]
James M. Robins and Sander Greenland. Identifiability and exchangeability for direct and indirect effects, Identifiability and Exchangeability for Direct and Indirect Effects http://www.jstor.org/stable/3702894. Epidemiology, 3 0 (2): 0 143--155, 1992. ISSN 10443983
-
[69]
Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. In International Conference on Learning Representations, 2020, Distributionally Robust Neural Networks https://openreview.net/forum?id=ryxGuJrFvS
work page 2020
-
[70]
Jürgen Schmidhuber. Learning Factorial Codes by Predictability Minimization , https://doi.org/10.1162/neco.1992.4.6.863 Learning Factorial Codes by Predictability Minimization . Neural Computation, 4 0 (6): 0 863--879, 11 1992. ISSN 0899-7667
-
[71]
Explaining neural networks by decoding layer activations
Johannes Schneider and Michalis Vlachos. Explaining neural networks by decoding layer activations. In Advances in Intelligent Data Analysis XIX: 19th International Symposium on Intelligent Data Analysis, IDA 2021, Porto, Portugal, April 26–28, 2021, Proceedings, pp.\ 63–75, Berlin, Heidelberg, 2021. Springer-Verlag. ISBN 978-3-030-74250-8, Explaining Neur...
-
[72]
BARACK : Partially supervised group robustness with guarantees
Nimit Sharad Sohoni, Maziar Sanjabi, Nicolas Ballas, Aditya Grover, Shaoliang Nie, Hamed Firooz, and Christopher Re. BARACK : Partially supervised group robustness with guarantees. In ICML 2022: Workshop on Spurious Correlations, Invariance and Stability, 2022, https://openreview.net/forum?id=Rn9POk3wOiV BARACK : Partially Supervised Group Robustness With...
work page 2022
-
[73]
Axiomatic attribution for deep networks
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 3319–3328. JMLR.org, 2017
work page 2017
-
[74]
Attribution patching outperforms automated circuit discovery
Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. In NeurIPS Workshop on Attributing Model Behavior at Scale, 2023, Attribution Patching Outperforms Automated Circuit Discovery https://openreview.net/forum?id=tiLbFR4bJW
work page 2023
-
[75]
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, ...
work page 2024
-
[76]
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...
work page 2024
-
[77]
Li, Arnab Sen Sharma, Aaron Mueller, Byron C
Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models. In Proceedings of the 2024 International Conference on Learning Representations, 2024
work page 2024
-
[78]
Towards debiasing NLU models from unknown biases
Prasetya Ajie Utama, Nafise Sadat Moosavi, and Iryna Gurevych. Towards debiasing NLU models from unknown biases. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 7597--7610, Online, November 2020. Association for Computational Linguistics, htt...
work page 2020
-
[79]
Investigating gender bias in language models using causal mediation analysis
Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 12388--12401. Curran Associat...
work page 2020
-
[80]
Interpretability in the wild: a circuit for indirect object identification in GPT -2 small
Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT -2 small. In The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=NpsVSN6o4ul Interpretability in the Wild: a Circuit for Indirect Obj...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.