pith. sign in

arxiv: 2509.05291 · v2 · submitted 2025-09-05 · 💻 cs.CL · cs.AI· cs.LG

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Pith reviewed 2026-05-18 18:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords crosscodersLLM pretrainingfeature emergencelinguistic representationssparse alignmentcausal importancemodel checkpointsmechanistic interpretability
0
0 comments X

The pith

Sparse crosscoders trained on LLM checkpoint triplets can align features to track when specific linguistic concepts emerge, persist, or drop out during pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to reveal the precise training stages at which LLMs acquire particular linguistic abilities, something standard benchmarks do not show. Researchers train sparse crosscoders on triplets of open checkpoints that mark big shifts in performance and representations. These crosscoders produce aligned features across time, and a new metric called Relative Indirect Effects quantifies when each feature starts to causally drive task behavior. The results demonstrate that features can be observed emerging, staying active, or disappearing as pretraining continues. Because the method requires no architecture-specific changes, it offers a practical route to finer-grained study of representation learning.

Core claim

By training sparse crosscoders between triplets of model checkpoints drawn from LLM pretraining runs, the same underlying linguistic features can be recovered and aligned across training time; the Relative Indirect Effects metric then identifies the stages at which each aligned feature becomes causally relevant to downstream task performance, thereby exposing patterns of feature emergence, maintenance, and discontinuation.

What carries the argument

Sparse crosscoders trained on checkpoint triplets, which discover and align features across pretraining stages so that the same linguistic concept can be followed as it changes in importance.

If this is right

  • Individual linguistic features can be monitored for the exact pretraining step at which they first appear and begin to affect behavior.
  • Once a feature emerges it can be observed to remain active or to lose relevance at later stages.
  • Relative Indirect Effects supplies a numeric trace of when each feature's causal contribution rises or falls.
  • The same alignment procedure applies without modification to models of different sizes and architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The temporal tracking could be applied to non-linguistic capabilities such as reasoning or safety-related behaviors to locate their acquisition points.
  • Knowing when a feature stabilizes might let practitioners intervene at the right checkpoint to strengthen or suppress that feature in later training.
  • The method supplies a concrete way to test whether representation changes are gradual or abrupt for any given concept.

Load-bearing premise

The feature alignments produced by crosscoders on different checkpoints correspond to identical underlying linguistic concepts rather than to correlations created by the crosscoder training procedure itself.

What would settle it

If crosscoders trained on the same checkpoint triplets produce alignments that fail to predict changes in causal importance on held-out tasks or produce inconsistent alignments when the triplets are reordered, the claim that the method tracks genuine feature evolution would not hold.

Figures

Figures reproduced from arXiv: 2509.05291 by Aaron Mueller, Antoine Bosselut, Deniz Bayazit.

Figure 1
Figure 1. Figure 1: Capturing the evolution of features. Given a task, our pipeline selects the relevant checkpoints during pretraining, learns a joint feature space with crosscoders, and then analyzes feature differences across checkpoints. This allows to analyze how models learn, maintain, or unlearn particular representations over time. In particular, we lack a clear understanding of when and how specific linguistic abilit… view at source ↗
Figure 2
Figure 2. Figure 2: Checkpoint selection with task performance (top) and middle-layer activation cosine similarity (bottom). The Pythia-1B (left) and OLMo-1B (middle) performance and activations patterns are calculated over BLiMP whereas BLOOM-1B (right) uses MultiBLiMP. All columns have the same x-axis (number of training tokens). We highlight checkpoints identified as critical in purple vertical lines. While some activation… view at source ↗
Figure 3
Figure 3. Figure 3: IE evolution of Top-5 & Bottom-5 IE Fea￾tures for Pythia checkpoints 1B & 286B. IEs are calculated using BLiMP subject–verb agreement tasks. Missing annotation means the feature was not inter￾pretable. We observe that low-level or uninterpretable features fade over time, while high-level grammar de￾tectors emerge and strengthen by 286B. 6.2 Crosscoder Learnability We train crosscoders on pairs and triplets… view at source ↗
Figure 4
Figure 4. Figure 4: RELIE of Top-100 IE features on BLiMP for Pythia-1B checkpoints 1B, 4B, 286B. Distinct clusters at each corner show attribution of checkpoint-specific features. 4B–286B share more abstract features, while 1B–286B overlaps are sparse and lower-level features. RelIE FeatID Activated Languages Interpreted Function Top Activating Sequence 6B specific [1.00, 0.00, 0.00] 3672 arb,eng,fra,hin,por,spa Detects elli… view at source ↗
Figure 5
Figure 5. Figure 5: Top-10 MultiBLiMP Number Agreement Task Feature Overlap per Checkpoint for Languages with 3-way comparisons. Cross-lingual feature overlap increases by 341B, especially for Latin-script languages, reflecting shared syntactic patterns, while greater morphological complexity in Arabic and Hindi limits alignment. prominence of that noun in our IE dataset. In the final stages we find crosslingual features for … view at source ↗
Figure 6
Figure 6. Figure 6: IE evolution of Top-5 & Bottom-5 Features for OLMo-1B checkpoints 4B & 3T. IEs are calculated using BLiMP subject–verb agreement tasks. “–” means the feature was not interpretable. In some cases, a feature can belong to multiple categories at once. Some low-level features, such as newline detectors, persist across training, whereas the usage of simpler lexical detectors fade as more abstract grammatical pa… view at source ↗
Figure 7
Figure 7. Figure 7: RELIE of Top-10 and 100 IE features on BLiMP for Pythia-1B checkpoints {1B, 4B, 286B}. Distinct clusters near each corner indicate checkpoint-specific features. In the Top-10 row, the 4B-286B pair and features shared across all three checkpoints dominate, whereas the 1B-286B pair have relatively fewer shared features. Additionally, the checkpoint-specific regions for 4B and 286B are noticeably denser, sugg… view at source ↗
Figure 8
Figure 8. Figure 8: Top-10 IE feature overlap in BLOOM-1B across languages and subtasks with the 3-way comparison. Across BLOOM-1B checkpoints, feature overlap is generally higher among script-sharing languages (e.g., English, French, Spanish, Portuguese) and increases across pretraining, while languages like Arabic and Hindi, which are less frequent in the training data and use different scripts, show relatively less cross-l… view at source ↗
Figure 9
Figure 9. Figure 9: Top-10 IE feature overlap (per checkpoint) in BLOOM-1B across languages and subtasks with the 2-way comparison. The 2-way comparison shows a similar pattern to the 3-way analysis, with high feature overlap among related languages (e.g., English, French, Spanish, Portuguese). Notably, in the 55B vs. 341B comparison, Arabic—despite not being Indo-European—shares more features than Hindi, suggesting better cr… view at source ↗
read the original abstract

Large language models (LLMs) learn non-trivial abstractions during pretraining, such as detecting irregular plural noun subjects. However, because traditional evaluation methods (e.g., benchmarking) fail to reveal how models acquire these concepts and capabilities, it is not well understood when and how these specific linguistic abilities emerge. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that sparse crosscoders trained on activations from LLM pretraining checkpoint triplets can be used to discover and align linguistic features across training stages. By introducing a Relative Indirect Effects (RelIE) metric, the authors track when individual features become causally important for task performance and report that the method detects feature emergence, maintenance, and discontinuation during pretraining.

Significance. If the feature alignments prove faithful rather than training artifacts, the approach would offer a scalable and architecture-agnostic way to analyze concept-level dynamics in LLM pretraining, addressing a gap left by standard benchmarking. The introduction of RelIE as a causal tracing tool is a constructive addition if its dependence on prior alignment quality is validated.

major comments (2)
  1. Abstract: the claim that crosscoders detect feature emergence, maintenance, and discontinuation is presented without any quantitative results, ablation studies, or direct validation that recovered features correspond to stable, human-interpretable linguistic concepts rather than correlations induced by the joint training objective.
  2. Methods (crosscoder training on checkpoint triplets): the central claim requires that jointly optimized crosscoders recover the same underlying linguistic features across checkpoints. The shared dictionary and reconstruction loss can induce spurious alignments; because RelIE is computed downstream of these alignments, any misalignment directly undermines the reported emergence/maintenance/discontinuation conclusions. A concrete test (e.g., intervention or human-interpretability check on aligned features) is needed to rule out this artifact.
minor comments (1)
  1. The definition and computation of RelIE should be given explicitly (ideally as an equation) rather than described only at a high level, to allow readers to assess its dependence on the crosscoder dictionary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying the evidence already present in the manuscript while agreeing to strengthen validation where appropriate.

read point-by-point responses
  1. Referee: Abstract: the claim that crosscoders detect feature emergence, maintenance, and discontinuation is presented without any quantitative results, ablation studies, or direct validation that recovered features correspond to stable, human-interpretable linguistic concepts rather than correlations induced by the joint training objective.

    Authors: The abstract is a high-level summary; the quantitative results demonstrating emergence, maintenance, and discontinuation via RelIE scores across checkpoint triplets and tasks are reported in Sections 3 and 4, along with ablation studies on the crosscoder objective in the appendix. Examples of aligned features mapping to interpretable linguistic concepts (e.g., subject-verb agreement) are provided in Appendix B. We will revise the abstract to reference these quantitative findings more explicitly and expand the main text with additional interpretability checks. revision: yes

  2. Referee: Methods (crosscoder training on checkpoint triplets): the central claim requires that jointly optimized crosscoders recover the same underlying linguistic features across checkpoints. The shared dictionary and reconstruction loss can induce spurious alignments; because RelIE is computed downstream of these alignments, any misalignment directly undermines the reported emergence/maintenance/discontinuation conclusions. A concrete test (e.g., intervention or human-interpretability check on aligned features) is needed to rule out this artifact.

    Authors: We agree that demonstrating the faithfulness of alignments is essential. The manuscript uses checkpoint-specific encoders/decoders with a shared dictionary and balanced sparsity to encourage recovery of corresponding features, and RelIE provides downstream causal validation. To further rule out artifacts, we will add intervention experiments that ablate aligned features and measure consistent task-performance effects across stages, plus a comparison of joint vs. independent SAE alignments, in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical measurement pipeline that trains sparse crosscoders on checkpoint triplets from LLM pretraining and applies a novel RelIE metric to trace causal importance of aligned features. No equations, predictions, or first-principles derivations are claimed that reduce the reported emergence/maintenance/discontinuation findings to quantities defined by the crosscoder's fitted parameters or by self-referential construction. The central claims rest on observable outputs of the joint training and evaluation process, which remain externally falsifiable via replication of the described architecture-agnostic training and benchmarking steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that crosscoder features correspond to stable linguistic concepts.

pith-pipeline@v0.9.0 · 5710 in / 1076 out tokens · 35539 ms · 2026-05-18T18:38:48.181211+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

    cs.LG 2026-05 conditional novelty 7.0

    fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...

  2. Features have life history. And we should care

    q-bio.NC 2026-05 unverdicted novelty 5.0

    Language model features form an early stable carrier scaffold of about 50 sparse features that is load-bearing, predictable from onset firing, and recruits most later features.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    InInternational 9 Conference on Machine Learning, pages 2397–2430

    Pythia: A suite for analyzing large language models across training and scaling. InInternational 9 Conference on Machine Learning, pages 2397–2430. PMLR. Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. 2024. Identifying functionally im- portant features with end-to-end sparse dictionary learning.Preprint, arXiv:2405.12241. Trenton Bric...

  2. [2]

    Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

    Batchtopk sparse autoencoders.Preprint, arXiv:2412.06410. Boxi Cao, Qiaoyu Tang, Hongyu Lin, Shanshan Jiang, Bin Dong, Xianpei Han, Jiawei Chen, Tianshu Wang, and Le Sun. 2024. Retentive or forgetful? diving into the knowledge memorizing mechanism of lan- guage models. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics...

  3. [3]

    InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352

    Towards automated circuit discovery for mech- anistic interpretability. InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352. Curran Associates, Inc. Ionut Constantinescu, Tiago Pimentel, Ryan Cotterell, and Alex Warstadt. 2025. Investigating critical pe- riod effects in language acquisition through neural language models.T...

  4. [4]

    Gaussian Error Linear Units (GELUs)

    Have faith in faithfulness: Going beyond cir- cuit overlap when finding model mechanisms. In First Conference on Language Modeling. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian er- ror linear units (gelus).Preprint, arXiv:1606.08415. John Hewitt, Robert Geirhos, and Been Kim. 2025. We can’t understand ai using our existing vocabulary. Preprint, arXiv:25...

  5. [5]

    MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

    Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs.Preprint, arXiv:2504.02768. Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra

  6. [6]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B

    Hidden breakthroughs in language model train- ing.Preprint, arXiv:2506.15872. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.Preprint, arXiv:2001.08361. Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik...

  7. [7]

    InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118

    Measuring progress in dictionary learning for language model interpretability with board game models. InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118. Curran Associates, Inc. Vedang Lad, Wes Gurnee, and Max Tegmark. 2024. The remarkable robustness of LLMs: Stages of in- ference? InICML 2024 Workshop on Mechanistic Inter...

  8. [8]

    Preprint, arXiv:2502.11196

    How do llms acquire new knowledge? a knowl- edge circuits perspective on continual pre-training. Preprint, arXiv:2502.11196. Judea Pearl. 2001. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Un- certainty in Artificial Intelligence, UAI’01, page 411–420, San Francisco, CA, USA. Morgan Kauf- mann Publishers Inc. Ofir Press, No...

  9. [9]

    GLU Variants Improve Transformer

    Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Lin- guistics. Noam Shazeer. 2020. Glu variants improve transformer. Preprint, arXiv:2002.05202. David Silver, Juli...

  10. [10]

    Description:To the best of your extent, de- scribe the behavior of this feature’s activation

  11. [11]

    Interpretability:On a scale of 0.0 to 1.0, how coherent are the examples shown with the de- scription you wrote? Is it consistently activating on similar tokens or promoting/demoting similar tokens?

  12. [12]

    Complexity:On a scale of 0.0 to 1.0, how com- plex is the feature behavior? How broad is the topic that the feature fires on? Does the feature activate on or promote/demote diverse tokens or similar tokens all over again?

  13. [13]

    do” auxiliary in compound verbs (root + another light-verb + conjugation) hin6B-341B shared [0.64,0.09,0.27]3969 Feminine possessive markerkF(“of

    (if BLOOM)Languages:Which languages have this feature activated most on? F.3 Complete Annotations We provide here the full set of annotation tables omitted from the main paper: Pythia’s two-way and three-way comparisons (Tables 8 and 10), OLMo’s three-way comparison (Table 11), and BLOOM’s two-way and three -way comparisons (Tables 9 and 12). Language-spe...