hub Canonical reference

Distill , year =

URL https://aclanthology · 2020 · DOI 10.23915/distill.00024.001

Canonical reference. 100% of citing Pith papers cite this work as background.

36 Pith papers citing it

Background 100% of classified citations

open at publisher browse 36 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Toy Models of Superposition

cs.LG · 2022-09-21 · accept · novelty 8.0

Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

cs.LG · 2026-06-25 · unverdicted · novelty 7.0

Re-derivation of activation patching NIE reveals it captures interaction effects in addition to direct causal effects, demonstrated via GPT-2 IOI circuit where INT explains component ranking issues and faithfulness instability.

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.

Toward Identifiable Sparse Autoencoders

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

Identifiable sparse autoencoders (iSAEs) are created from TopK SAEs via architecture and training tweaks, yielding improved stability and lower error by linking to dictionary learning where learned dictionaries satisfy an approximate restricted isometry condition.

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

cs.LG · 2026-05-24 · unverdicted · novelty 7.0

Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.

Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.

Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.

Data-driven Circuit Discovery for Interpretability of Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.

Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.

Improving Dictionary Learning with Gated Sparse Autoencoders

cs.LG · 2024-04-24 · unverdicted · novelty 7.0

Gated SAEs decouple which features to use from how large their activations should be, applying the L1 penalty only to selection and thereby eliminating shrinkage while halving the number of firing features needed for good fidelity.

In-context Learning and Induction Heads

cs.LG · 2022-09-24 · unverdicted · novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.

Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images

cs.LG · 2026-06-30 · unverdicted · novelty 6.0 · 2 refs

Sparse autoencoders resolve superposition in image-based neuron representations, recovering geometric fidelity and enabling scRNA-seq adaptation plus GW-map alignment to reconstruct pathology pathways without spatial transcriptomics.

Factual Retrieval in LLMs Is a Redundant, Distributed and Non-Contiguous Process

cs.CL · 2026-06-19 · unverdicted · novelty 6.0

Attribute retrieval in LLMs follows non-contiguous, redundant layer paths identified via iterative patching, implying highly distributed knowledge storage.

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

cs.CL · 2026-06-06 · unverdicted · novelty 6.0 · 2 refs

Introduces distribution-level unsupervised feature discovery for LLMs by clustering continuations using semantic embeddings and prefix-to-continuation attribution signatures via rate-distortion optimization.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.

From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.

The Rate-Distortion-Polysemanticity Tradeoff in SAEs

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

SAEs exhibit a rate-distortion-polysemanticity tradeoff where monosemanticity increases rate and distortion, with optimal polysemanticity set by feature co-occurrence probabilities in the data.

Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Shapley value analysis identifies powerful adjectives that steer MMLU performance in model-family-specific patterns, with non-additive interactions emerging in larger models.

Composer Vector: Style-steering Symbolic Music Generation in a Latent Space

cs.SD · 2026-04-03 · unverdicted · novelty 6.0

Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.

citing papers explorer

Showing 27 of 27 citing papers after filters.

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching cs.LG · 2026-06-25 · unverdicted · none · ref 12
Re-derivation of activation patching NIE reveals it captures interaction effects in addition to direct causal effects, demonstrated via GPT-2 IOI circuit where INT explains component ranking issues and faithfulness instability.
Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery cs.CL · 2026-06-04 · unverdicted · none · ref 77
Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.
Toward Identifiable Sparse Autoencoders cs.LG · 2026-05-29 · unverdicted · none · ref 2
Identifiable sparse autoencoders (iSAEs) are created from TopK SAEs via architecture and training tweaks, yielding improved stability and lower error by linking to dictionary learning where learned dictionaries satisfy an approximate restricted isometry condition.
Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability cs.LG · 2026-05-24 · unverdicted · none · ref 2
Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.
Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol cs.LG · 2026-05-23 · unverdicted · none · ref 10
Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m cs.LG · 2026-05-23 · unverdicted · none · ref 11
Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2 cs.LG · 2026-05-13 · unverdicted · none · ref 13
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Data-driven Circuit Discovery for Interpretability of Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 20
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 50 · 2 links
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models cs.LG · 2026-05-06 · unverdicted · none · ref 5
LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction cs.AI · 2026-05-04 · unverdicted · none · ref 17
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images cs.LG · 2026-06-30 · unverdicted · none · ref 32 · 2 links
Sparse autoencoders resolve superposition in image-based neuron representations, recovering geometric fidelity and enabling scRNA-seq adaptation plus GW-map alignment to reconstruct pathology pathways without spatial transcriptomics.
Factual Retrieval in LLMs Is a Redundant, Distributed and Non-Contiguous Process cs.CL · 2026-06-19 · unverdicted · none · ref 22
Attribute retrieval in LLMs follows non-contiguous, redundant layer paths identified via iterative patching, implying highly distributed knowledge storage.
Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms cs.CL · 2026-06-06 · unverdicted · none · ref 7 · 2 links
Introduces distribution-level unsupervised feature discovery for LLMs by clustering continuations using semantic embeddings and prefix-to-continuation attribution signatures via rate-distortion optimization.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet cs.AI · 2026-05-28 · unverdicted · none · ref 57
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 12
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks cs.LG · 2026-05-14 · unverdicted · none · ref 19
XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.
The Rate-Distortion-Polysemanticity Tradeoff in SAEs cs.LG · 2026-05-14 · unverdicted · none · ref 13
SAEs exhibit a rate-distortion-polysemanticity tradeoff where monosemanticity increases rate and distortion, with optimal polysemanticity set by feature co-occurrence probabilities in the data.
Investigating Linguistic Steering: An Analysis of Adjectival Effects Across Large Language Model Architectures cs.CL · 2026-04-28 · unverdicted · none · ref 11
Shapley value analysis identifies powerful adjectives that steer MMLU performance in model-family-specific patterns, with non-additive interactions emerging in larger models.
Composer Vector: Style-steering Symbolic Music Generation in a Latent Space cs.SD · 2026-04-03 · unverdicted · none · ref 15
Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.
Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics cs.LG · 2026-06-10 · unverdicted · none · ref 7
Case study applies SAE probing with enstrophy triage to a continuum-dynamics foundation model and reports intermittent feature consistency that does not align with standard physics while linking some output discrepancies to specific feature changes.
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces cs.CL · 2026-06-05 · unverdicted · none · ref 55
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers cs.CV · 2026-06-04 · unverdicted · none · ref 55
ViSAE supplies a 64K-image probing suite with 16K concepts, top-down/bottom-up circuit algorithms, and editing methods that raise WaterBirds worst-group accuracy by 48.2% over baselines.
Rare Events, Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks cs.NE · 2026-05-21 · unverdicted · none · ref 32
In spiking ResNets, 1FC ensembles defined by pairwise correlations show ReLU-like cofiring-to-response mapping whose gain scales with ensemble size, with reliable class encoding restricted to infrequent high-cofiring events.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 42
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach cs.SE · 2026-04-11 · unverdicted · none · ref 76
A concept-based pruning method for DNNs guided by interpretable concepts and system requirements produces smaller, computationally efficient models that maintain effectiveness on image classification tasks.
Interpreting "Interpretability" and Explaining "Explainability" in Machine Learning in Physics physics.data-an · 2026-06-24 · unverdicted · none · ref 119
The paper defines interpretability as model structural transparency and explainability as scientific content mapping, discusses their trade-offs, and frames both as deliberate modeling choices for ML in physics.

Distill , year =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer