hub

https://distill.pub/2020/circuits/zoom-in

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, Shan Carter · 2020 · DOI 10.23915/distill.00024.001

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

open at publisher browse 11 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

Toy Models of Superposition

cs.LG · 2022-09-21 · accept · novelty 8.0

Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.

Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.

Data-driven Circuit Discovery for Interpretability of Language Models

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.

Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.

In-context Learning and Induction Heads

cs.LG · 2022-09-24 · unverdicted · novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.

Composer Vector: Style-steering Symbolic Music Generation in a Latent Space

cs.SD · 2026-04-03 · unverdicted · novelty 6.0

Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach

cs.SE · 2026-04-11 · unverdicted · novelty 5.0

A concept-based pruning method for DNNs guided by interpretable concepts and system requirements produces smaller, computationally efficient models that maintain effectiveness on image classification tasks.

citing papers explorer

Showing 11 of 11 citing papers.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small cs.LG · 2022-11-01 · conditional · none · ref 16
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
Toy Models of Superposition cs.LG · 2022-09-21 · accept · none · ref 1
Toy models demonstrate that polysemanticity arises when neural networks store more sparse features than neurons via superposition, producing a phase transition tied to polytope geometry and increased adversarial vulnerability.
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2 cs.LG · 2026-05-13 · unverdicted · none · ref 13
Projecting LLM hidden states onto F2 algebra with 42 pairs yields 93% zero-shot accuracy on logical relations and identifies prompt-preventable late-layer collapse.
Data-driven Circuit Discovery for Interpretability of Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 20
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 204
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models cs.LG · 2026-05-06 · unverdicted · none · ref 5
LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion model that are clearer than those from entangled baselines.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction cs.AI · 2026-05-04 · unverdicted · none · ref 17
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
In-context Learning and Induction Heads cs.LG · 2022-09-24 · unverdicted · none · ref 19
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning in transformers.
Composer Vector: Style-steering Symbolic Music Generation in a Latent Space cs.SD · 2026-04-03 · unverdicted · none · ref 15
Composer Vector steers symbolic music generation models in latent space at inference time to control and blend composer styles without retraining.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 42
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Engineering Resource-constrained Software Systems with DNN Components: a Concept-based Pruning Approach cs.SE · 2026-04-11 · unverdicted · none · ref 76
A concept-based pruning method for DNNs guided by interpretable concepts and system requirements produces smaller, computationally efficient models that maintain effectiveness on image classification tasks.

https://distill.pub/2020/circuits/zoom-in

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer