pith. sign in

arxiv: 2606.02378 · v2 · pith:G4DMJWPXnew · submitted 2026-06-01 · 💻 cs.LG · cs.AI

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

Pith reviewed 2026-06-28 15:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords attention headsinduction circuitsattention sinksBOS attractorstraining dynamicsparticipation ratiomechanistic interpretabilitylanguage model development
0
0 comments X

The pith

In 1B-class models trained on DCLM, induction-circuit formation precedes attention-sink formation by an order of magnitude in tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks attention-head circuit formation at multiple checkpoints in three 1B-scale models from two architecture families. It finds that induction and previous-token heads appear well before BOS-attractor heads, with the two groups showing separate timing and emergence shapes. The separation holds across models but the exact shape of BOS emergence varies by architecture. A reader would care because the results indicate that capability circuits and attention-sink behavior arise as distinct developmental stages rather than a single transition. The work also shows that the final induction circuit can be identified from early checkpoints alone.

Core claim

In the three 1B-class models, layers 0 and 1 produce zero BOS-classified heads at every revision. The whole-model BOS-attractor fraction follows model-specific shapes: gradual ramp, sharp phase transition, or gradual ramp. In DCLM-trained models the induction transition precedes the BOS-attractor transition by 10-20 times the number of tokens and the two transitions have different shapes. The capability-specific screen reaches the final induction circuit within 0.3-2 percent of total training tokens, and per-head participation ratio is already elevated when a head first crosses its capability-selectivity threshold.

What carries the argument

The participation-ratio spectral signal combined with the all-head capability-specific selectivity screen that classifies induction, previous-token, and BOS-attractor heads at each checkpoint.

If this is right

  • Layers 0 and 1 never produce BOS-classified heads in any of the three models at any revision.
  • BOS-attractor emergence takes one of three distinct shapes depending on the model and data combination.
  • The capability-specific screen identifies the final induction circuit after only a small fraction of total training tokens.
  • Elevated participation ratio appears at or before the point where an induction head crosses its selectivity threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Circuit monitoring methods could be applied during training to observe capability emergence without completing a full run.
  • Basic induction capability does not require the presence of attention-sink heads.
  • The same staged timing pattern may appear in models larger than 1B or trained on different data mixtures.

Load-bearing premise

The participation-ratio spectral signal together with the capability-specific selectivity screen correctly labels head types at early revisions without false positives and without needing the final model state.

What would settle it

A DCLM training run on a 1B-class model in which the token count at which induction heads reach high selectivity is within a factor of two of the token count at which BOS-attractor heads emerge would falsify the claimed separation of transitions.

Figures

Figures reproduced from arXiv: 2606.02378 by Yongzhong Xu.

Figure 1
Figure 1. Figure 1: Capability circuits form early in pretraining and the per-head spectral signal precedes formation. Reproduced from the companion methodology paper [Xu, 2026a]. (A) Per-head spectral signal max(PRt − 1, 0) across training for three identified heads in Pythia 1B (an induction head L4·H4, a previous-token head L3·H5, a BOS-attractor head L4·H1). X markers indicate the formation event, the first revision at wh… view at source ↗
Figure 2
Figure 2. Figure 2: PR rises at or before capability-selectivity formation, across three 1B-class configurations. Per-head spectral signal at each checkpoint, max(PRt − 1, 0) (the integrand of the PR-integral ranking statistic, plotted per-checkpoint to preserve temporal structure; same y-axis as Figure 1A). For each of three 1B-class configurations — Pythia 1B (Pile, dense), OLMo 1B (DCLM, dense), and OLMoE 1B-7B (DCLM, MoE)… view at source ↗
Figure 3
Figure 3. Figure 3: Induction-head turnover during pretraining. Induction selectivity (relative to a uniform-position baseline) of every head that passes the 50× screen at any revision, for each model. Heads that persist from formation to the final checkpoint (blue), join late (green), or decay out after passing early (red); the red category is invisible to a figure that plots only the final-circuit heads. Markers denote memb… view at source ↗
read the original abstract

We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture families (dense transformer, mixture-of-experts) and two pretraining corpora (The Pile, DCLM): Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. At each of 10 log-spaced revisions per model -- 30 mechanistic-interpretability runs in total -- we apply a participation-ratio (PR) spectral signal and an all-head capability-specific selectivity screen to track induction, previous-token, and BOS-attractor heads as they emerge. Five findings. (F1) Layers 0 and 1 produce zero BOS-classified heads at every revision in every model: the L0/L1 zero-BOS floor is an architectural property, not a learned outcome. (F2) The whole-model BOS-attractor fraction follows three distinct emergence shapes -- a gradual ramp in Pythia 1B, a sharp phase transition in OLMo 1B (7% to 70% between adjacent checkpoints), and a gradual ramp in OLMoE 1B-7B. (F3) In DCLM models, induction-circuit formation precedes BOS-attractor formation by 10-20x in tokens; capability-circuit formation and attention-sink formation are two transitions, not one. (F4) The capability-specific screen converges to the final induction circuit within 0.3-2% of total training tokens -- circuit identification does not require the final model. (F5) For every final-checkpoint induction head sampled across all three models, per-head PR is elevated at or before the first revision at which that head crosses its capability-selectivity threshold. The results refine the induction-phase-transition framing: in 1B-class models trained on DCLM, the induction transition and the attention-sink transition are separated by an order of magnitude in tokens and have qualitatively different shapes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript tracks attention-head circuit formation in three 1B-class models (Pythia 1B, OLMo 1B, OLMoE 1B) across 10 log-spaced revisions each using a participation-ratio (PR) spectral signal and an all-head capability-specific selectivity screen. It reports five findings: L0/L1 produce zero BOS heads in all models; BOS-attractor fractions show model-specific emergence shapes; in DCLM models induction precedes BOS by 10-20x tokens with distinct shapes; the selectivity screen converges to the final induction circuit within 0.3-2% of training tokens; and final induction heads show elevated PR at or before their selectivity threshold crossing. The central claim is that capability and attention-sink transitions are separate, not coincident.

Significance. If the PR/selectivity classification is robust at early checkpoints, the work supplies concrete empirical timelines separating induction-circuit and BOS-attractor emergence across architectures and corpora. The 30-run design and observation that circuit identification does not require the final model are strengths that could inform future mechanistic studies of pretraining dynamics.

major comments (3)
  1. [F3] F3: The headline separation claim (induction precedes BOS by 10-20x tokens with qualitatively different shapes) is measured by the first revision at which heads cross the capability-selectivity threshold and exhibit elevated PR. The manuscript provides no external validation of this joint screen against functional behavior at those early revisions, which is load-bearing for the temporal-offset result.
  2. [F5] F5 and abstract: The internal consistency check that per-head PR is elevated at or before the selectivity threshold crossing does not rule out false positives or delayed detections caused by training noise or a moving capability baseline at early checkpoints.
  3. [Abstract] Abstract and methods description: The capability-selectivity threshold is listed as a free parameter; no sensitivity analysis or justification for its value is reported, directly affecting which heads are classified as induction versus BOS at each revision.
minor comments (2)
  1. The abstract states that 30 runs were performed but reports no error bars, exclusion criteria, or how variability across runs is aggregated in the emergence curves.
  2. Clarify whether the PR spectral signal is computed on the full attention matrix or per-head and whether any preprocessing (e.g., centering) is applied before the participation-ratio calculation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on validation and parameter choices. We respond point-by-point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: The headline separation claim (induction precedes BOS by 10-20x tokens with qualitatively different shapes) is measured by the first revision at which heads cross the capability-selectivity threshold and exhibit elevated PR. The manuscript provides no external validation of this joint screen against functional behavior at those early revisions, which is load-bearing for the temporal-offset result.

    Authors: We agree that functional validation (e.g., via patching or ablation) at early checkpoints would strengthen the temporal-offset claim. The current evidence rests on the uniform application of the selectivity screen and the independent PR spectral measure across 30 runs. The offset appears consistently in both DCLM models. In revision we will add an explicit limitations paragraph acknowledging the lack of early functional tests and their implications for interpreting the separation. revision: partial

  2. Referee: F5 and abstract: The internal consistency check that per-head PR is elevated at or before the selectivity threshold crossing does not rule out false positives or delayed detections caused by training noise or a moving capability baseline at early checkpoints.

    Authors: F5 is presented strictly as an internal consistency observation, not a full defense against noise or baseline drift. We accept that checkpoint variability could affect early detections. We will revise the wording in F5 and the abstract to clarify its correlational nature and briefly note potential confounds from training dynamics. revision: yes

  3. Referee: Abstract and methods description: The capability-selectivity threshold is listed as a free parameter; no sensitivity analysis or justification for its value is reported, directly affecting which heads are classified as induction versus BOS at each revision.

    Authors: The threshold was chosen to recover the induction heads known to exist in the final checkpoint of each model. We will add a methods paragraph justifying this choice and include a sensitivity analysis (varying the threshold by ±10%) as a new appendix figure in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational tracking of circuit emergence via fixed screens

full rationale

The paper applies a participation-ratio spectral signal and capability-specific selectivity screen to log-spaced model checkpoints across three architectures. All five findings (F1-F5) are direct measurements of when heads cross thresholds or exhibit PR elevation at specific revisions. No equations derive a target quantity from fitted parameters, no predictions reduce to inputs by construction, and no self-citations are invoked as load-bearing uniqueness theorems. The separation claim (induction precedes BOS by 10-20x tokens) is an empirical timing observation, not a self-referential derivation. The method is applied uniformly without the final model state being required for classification (F4), confirming the analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the PR spectral signal and selectivity screen validly identify circuit types at intermediate checkpoints; no free parameters or invented entities are described in the abstract.

free parameters (1)
  • capability-selectivity threshold
    Threshold used to mark when a head crosses into the capability-specific category; value not stated but required for emergence timing claims.
axioms (1)
  • domain assumption The participation-ratio spectral signal reliably detects attention-head types across training revisions.
    Invoked when applying the signal to classify heads at each of the 10 revisions per model.

pith-pipeline@v0.9.1-grok · 5915 in / 1244 out tokens · 36185 ms · 2026-06-28T15:44:54.326784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 11 canonical work pages · 9 internal anchors

  1. [1]

    Transformer Circuits Thread , year =

    In-context Learning and Induction Heads , author =. Transformer Circuits Thread , year =

  2. [2]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike , title =. International Conference on Learning Representations , year =. 2309.17453 , archivePrefix =

  3. [3]

    International Conference on Machine Learning (ICML) , year =

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. International Conference on Machine Learning (ICML) , year =

  4. [4]

    OLMo: Accelerating the Science of Language Models

    OLMo: Accelerating the Science of Language Models , author =. arXiv preprint arXiv:2402.00838 , year =

  5. [5]

    OLMoE: Open Mixture-of-Experts Language Models

    OLMoE: Open Mixture-of-Experts Language Models , author =. arXiv preprint arXiv:2409.02060 , year =

  6. [6]

    DataComp-LM: In search of the next generation of training sets for language models

    DataComp-LM: In Search of the Next Generation of Training Sets for Language Models , author =. arXiv preprint arXiv:2406.11794 , year =

  7. [7]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author =. arXiv preprint arXiv:2101.00027 , year =

  8. [8]

    International Conference on Learning Representations (ICLR) , year =

    Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small , author =. International Conference on Learning Representations (ICLR) , year =

  9. [9]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  10. [10]

    Transformer Circuits Thread , year =

    A Mathematical Framework for Transformer Circuits , author =. Transformer Circuits Thread , year =

  11. [11]

    2023 , url =

    Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla , author =. arXiv preprint arXiv:2307.09458 , year =

  12. [12]

    The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts , author =. arXiv preprint arXiv:2604.11962 , year =

  13. [13]

    arXiv preprint , year =

    Spectral Edge Dynamics of Training Trajectories: Signal--Noise Geometry Across Scales , author =. arXiv preprint arXiv:2603.15678 , year =

  14. [14]

    Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

    Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories , author =. arXiv preprint arXiv:2604.25143 , year =

  15. [15]

    Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

    Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers , author =. arXiv preprint , year =. 2605.24059 , archivePrefix =

  16. [16]

    Pattern Selectivity vs Task-Causal Structure: Composed-Task Circuits across Three 1

    Xu, Yongzhong , year =. Pattern Selectivity vs Task-Causal Structure: Composed-Task Circuits across Three 1

  17. [17]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  18. [18]

    International Conference on Learning Representations (ICLR) , year =

    Progress Measures for Grokking via Mechanistic Interpretability , author =. International Conference on Learning Representations (ICLR) , year =

  19. [19]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author =. arXiv preprint arXiv:2403.19647 , year =