pith. sign in

arxiv: 2606.09607 · v1 · pith:ZYRBERTYnew · submitted 2026-06-08 · 💻 cs.LG · cs.AI

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Pith reviewed 2026-06-27 17:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords attention headscircuit discoverycausal ablationco-activation clusteringinterpretabilitymixture of expertsclosure validation
0
0 comments X

The pith

Co-activation clusters propose attention-head circuits while ablation closure confirms or rejects them as causal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether clustering attention heads by co-activation statistics identifies functional circuits rather than mere correlations. It adapts a clustering approach and subjects the resulting communities to a closure test that measures loss increase after ablation against matched random controls. In two 1B-scale dense models and two input distributions the discovered groups survive this test, indicating they operate as circuits. In an MoE model the same procedure yields statistically detectable clusters whose ablation instead reduces loss, showing the opposite outcome. The work concludes that co-activation supplies candidate circuits while causal validation is required to establish actual function.

Core claim

Adapting sparse-autoencoder-style clustering to attention heads and validating by causal ablation rather than reconstruction, the discovered communities pass closure tests across two dense 1B-scale models and two input distributions, whereas route-conditional clusters in an MoE model recover a signal that fails closure because ablation improves loss.

What carries the argument

The closure test that ablates a co-activation community and compares per-example damage to matched-random controls.

If this is right

  • In dense transformers, co-activation communities identified by clustering function as causal circuits under the closure criterion.
  • In MoE models, route-conditional clustering recovers a detectable signal whose ablation improves rather than harms loss.
  • Attention-target selectivity and participation ratio decouple from functional circuit membership both during and across training.
  • A cheap co-activation signal remains only a circuit proposal until closure validation is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same closure procedure could be applied to other component types such as MLP neurons or residual streams to test generality.
  • Failure of closure in MoE settings points to the value of incorporating routing information directly into the clustering objective.
  • Longitudinal application of closure across checkpoints could track when candidate communities become or cease to be causal.

Load-bearing premise

That ablation damage relative to matched-random controls is a valid measure of whether a co-activation community functions as a causal circuit.

What would settle it

A dense-model experiment in which a co-activation community identified by clustering produces less damage under ablation than its matched-random controls would falsify the claim that the communities pass closure.

Figures

Figures reproduced from arXiv: 2606.09607 by Yongzhong Xu.

Figure 1
Figure 1. Figure 1: The Pythia 1B natural-text redundancy signature. The candidate ablation produces [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The MoE story arc. (a) On natural text, OLMoE’s marginal Ising ARI collapses to [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-metric verdict across the five closure tests. For each test, the colored bar shows [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that co-activation clustering of attention heads yields circuit proposals rather than confirmed circuits, and that a closure test (ablating the community and comparing per-example loss increase to size-matched random controls) is required for validation. It reports that the discovered communities pass this test in two dense 1B-scale models (Pythia 1B, OLMo 1B) across input distributions, but fail in an MoE model (where ablation improves loss). It further shows that attention-target selectivity and participation ratio decouple from functional impact when closure is tracked across training checkpoints.

Significance. If the closure test is valid, the work supplies a concrete, falsifiable distinction between correlational proposals and causally effective circuits, with the MoE counter-example and training-time decoupling providing useful negative results. The choice to validate by ablation rather than reconstruction error is a methodological strength, as is the multi-model, multi-distribution design. The approach could inform future clustering-based interpretability pipelines by emphasizing the need for causal checks.

major comments (1)
  1. [Abstract] Abstract (closure test paragraph): The central claim that communities 'pass closure' rests on greater ablation damage relative to matched-random controls. This comparison supports the circuit interpretation only if the random sets adequately control for confounders such as per-head importance, layer distribution, and activation magnitude; otherwise the test can succeed whenever clustering simply aggregates individually salient heads. The manuscript must specify the exact matching procedure and report verification that the controls are balanced on these metrics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for explicit controls in the closure test. The comment correctly identifies a point where additional methodological detail will strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (closure test paragraph): The central claim that communities 'pass closure' rests on greater ablation damage relative to matched-random controls. This comparison supports the circuit interpretation only if the random sets adequately control for confounders such as per-head importance, layer distribution, and activation magnitude; otherwise the test can succeed whenever clustering simply aggregates individually salient heads. The manuscript must specify the exact matching procedure and report verification that the controls are balanced on these metrics.

    Authors: We agree that the current description of the matched-random controls is insufficiently precise. The revised manuscript will add an explicit subsection detailing the matching procedure: random sets are sampled to match the discovered community on (i) layer distribution (exact layer counts), (ii) mean activation magnitude across the evaluation set, and (iii) a binned histogram of per-head importance scores (measured as mean ablation damage when heads are removed individually). We will also include supplementary figures verifying that the control distributions are statistically indistinguishable from the community on these three metrics (Kolmogorov-Smirnov tests, p > 0.1). These additions will be placed in the Methods section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity; validation uses independent causal ablation

full rationale

The paper clusters heads via co-activation statistics then validates via a separate ablation-based closure test against matched-random controls. This test is an external causal intervention whose outcome is not algebraically or statistically forced by the clustering inputs. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the method or claims. The result that communities pass closure is therefore an empirical finding rather than a definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that ablation provides a causal test of circuit function; no free parameters or invented entities are identifiable from the abstract.

axioms (1)
  • domain assumption Ablation of attention-head communities measures their causal contribution to model behavior
    This premise underpins the closure test that separates proposals from confirmed circuits.

pith-pipeline@v0.9.1-grok · 5707 in / 1059 out tokens · 42161 ms · 2026-06-27T17:18:08.906750+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    J. Besag. Statistical analysis of non-lattice data.The Statistician, 24(3):179–195, 1975

  2. [2]

    Do Sparse Autoencoders Capture Concept Manifolds?

    U. Bhalla, T. Fel, C. Rager, S. Feucht, T. Haklay, D. Wurgaft, S. Boppana, M. Kowal, V. Shyam, O. Lewis, T. McGrath, J. Merullo, A. Geiger, and E. S. Lubana. Do sparse autoencoders capture concept manifolds?arXiv preprint arXiv:2604.28119, 2026

  3. [3]

    Conmy, A

    A. Conmy, A. N. Mavor-Parker, A. Lynch, S. Heimersheim, and A. Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  4. [4]

    arXiv preprint arXiv:2405.14860 , year=

    J. Engels, E. J. Michaud, I. Liao, W. Gurnee, and M. Tegmark. Not all language model features are one-dimensionally linear.arXiv preprint arXiv:2405.14860, 2024

  5. [5]

    Interpreting language model parameters

    Goodfire. Interpreting language model parameters. Goodfire research note, 2026. https: //www.goodfire.ai/research/interpreting-lm-parameters

  6. [6]

    Kantamneni and M

    S. Kantamneni and M. Tegmark. Language models use trigonometry to do addition.arXiv preprint arXiv:2502.00873, 2025

  7. [7]

    Olsson, N

    C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, et al. In-context learning and induction heads.Transformer Circuits Thread, 2022

  8. [8]

    K. Park, Y. J. Choe, and V. Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658, 2023

  9. [9]

    K. Park, Y. J. Choe, Y. Jiang, and V. Veitch. The geometry of categorical and hierarchical concepts in large language models.arXiv preprint arXiv:2406.01506, 2024

  10. [10]

    Ravikumar, M

    P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional Ising model selection usingℓ1-regularized logistic regression.The Annals of Statistics, 38(3):1287–1319, 2010

  11. [11]

    K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. InInternational Conference on Learning Representations (ICLR), 2023. 21

  12. [12]

    Y. Xu. Spectral probe-circuits: a three-step recipe for identifying attention-head circuits in pretrained transformers.arXiv preprint arXiv:2605.24059, 2026

  13. [13]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. A Route-cluster routing-entropy diagnostic The four route clusters’ mean per-layer routing entropy on the OLMoE natural-text batch: ClusternMean per-layerH(nats) Fraction of ...