GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
hub Canonical reference
Distill , year =
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.
Re-derivation of activation patching NIE reveals it captures interaction effects in addition to direct causal effects, demonstrated via GPT-2 IOI circuit where INT explains component ranking issues and faithfulness instability.
Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.
Identifiable sparse autoencoders (iSAEs) are created from TopK SAEs via architecture and training tweaks, yielding improved stability and lower error by linking to dictionary learning where learned dictionaries satisfy an approximate restricted isometry condition.
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
Attribute retrieval in LLMs follows non-contiguous, redundant layer paths identified via iterative patching, implying highly distributed knowledge storage.
Proposes distribution-level unsupervised feature discovery for LLMs by clustering continuations on semantic content and mechanistic attributions without target outputs.
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.
SAEs exhibit a rate-distortion-polysemanticity tradeoff where monosemanticity increases rate and distortion, with optimal polysemanticity set by feature co-occurrence probabilities in the data.
Shapley value analysis identifies powerful adjectives that steer MMLU performance in model-family-specific patterns, with non-additive interactions emerging in larger models.
Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.
Eigenanalysis of the empirical NTK surfaces feature directions that align with Fourier features in modular addition networks and grammatical features in Gemma-3-270M, outperforming PCA baselines on activations.
citing papers explorer
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Toy Models of Superposition
Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.
-
The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
Re-derivation of activation patching NIE reveals it captures interaction effects in addition to direct causal effects, demonstrated via GPT-2 IOI circuit where INT explains component ranking issues and faithfulness instability.
-
Toward Identifiable Sparse Autoencoders
Identifiable sparse autoencoders (iSAEs) are created from TopK SAEs via architecture and training tweaks, yielding improved stability and lower error by linking to dictionary learning where learned dictionaries satisfy an approximate restricted isometry condition.
-
Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
-
Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol
Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.
-
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
-
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
-
From Mechanistic to Compositional Interpretability
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
-
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models
LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.
-
Improving Dictionary Learning with Gated Sparse Autoencoders
Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
-
From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks
XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.
-
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
SAEs exhibit a rate-distortion-polysemanticity tradeoff where monosemanticity increases rate and distortion, with optimal polysemanticity set by feature co-occurrence probabilities in the data.
-
Feature Identification via the Empirical NTK
Eigenanalysis of the empirical NTK surfaces feature directions that align with Fourier features in modular addition networks and grammatical features in Gemma-3-270M, outperforming PCA baselines on activations.
-
Linear Representations of Sentiment in Large Language Models
Sentiment is represented as a single linear direction in LLM activation space that is causally relevant across tasks and is summarized at punctuation and names in addition to charged words.
-
Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images
Sparse autoencoders resolve superposition in image-based neural representations of neurons, recovering metric geometry and enabling de novo cross-modal alignment to scRNA-seq via Gromov-Wasserstein transport.
-
Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics
Case study applies SAE probing with enstrophy triage to a continuum-dynamics foundation model and reports intermittent feature consistency that does not align with standard physics while linking some output discrepancies to specific feature changes.
-
Towards Effective Theory of LLMs: A Representation Learning Approach
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
-
"Faithful to What?" On the Limits of Fidelity-Based Explanations
High-fidelity surrogate explanations for neural networks often fail to recover the networks' predictive advantages over linear models in regression tasks.
-
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
Gemma Scope supplies trained sparse autoencoders for all layers of Gemma 2 2B and 9B plus select 27B layers, with public weights and benchmark scores.