Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Pith reviewed 2026-05-18 18:38 UTC · model grok-4.3
The pith
Sparse crosscoders trained on LLM checkpoint triplets can align features to track when specific linguistic concepts emerge, persist, or drop out during pretraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training sparse crosscoders between triplets of model checkpoints drawn from LLM pretraining runs, the same underlying linguistic features can be recovered and aligned across training time; the Relative Indirect Effects metric then identifies the stages at which each aligned feature becomes causally relevant to downstream task performance, thereby exposing patterns of feature emergence, maintenance, and discontinuation.
What carries the argument
Sparse crosscoders trained on checkpoint triplets, which discover and align features across pretraining stages so that the same linguistic concept can be followed as it changes in importance.
If this is right
- Individual linguistic features can be monitored for the exact pretraining step at which they first appear and begin to affect behavior.
- Once a feature emerges it can be observed to remain active or to lose relevance at later stages.
- Relative Indirect Effects supplies a numeric trace of when each feature's causal contribution rises or falls.
- The same alignment procedure applies without modification to models of different sizes and architectures.
Where Pith is reading between the lines
- The temporal tracking could be applied to non-linguistic capabilities such as reasoning or safety-related behaviors to locate their acquisition points.
- Knowing when a feature stabilizes might let practitioners intervene at the right checkpoint to strengthen or suppress that feature in later training.
- The method supplies a concrete way to test whether representation changes are gradual or abrupt for any given concept.
Load-bearing premise
The feature alignments produced by crosscoders on different checkpoints correspond to identical underlying linguistic concepts rather than to correlations created by the crosscoder training procedure itself.
What would settle it
If crosscoders trained on the same checkpoint triplets produce alignments that fail to predict changes in causal importance on held-out tasks or produce inconsistent alignments when the triplets are reordered, the claim that the method tracks genuine feature evolution would not hold.
Figures
read the original abstract
Large language models (LLMs) learn non-trivial abstractions during pretraining, such as detecting irregular plural noun subjects. However, because traditional evaluation methods (e.g., benchmarking) fail to reveal how models acquire these concepts and capabilities, it is not well understood when and how these specific linguistic abilities emerge. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that sparse crosscoders trained on activations from LLM pretraining checkpoint triplets can be used to discover and align linguistic features across training stages. By introducing a Relative Indirect Effects (RelIE) metric, the authors track when individual features become causally important for task performance and report that the method detects feature emergence, maintenance, and discontinuation during pretraining.
Significance. If the feature alignments prove faithful rather than training artifacts, the approach would offer a scalable and architecture-agnostic way to analyze concept-level dynamics in LLM pretraining, addressing a gap left by standard benchmarking. The introduction of RelIE as a causal tracing tool is a constructive addition if its dependence on prior alignment quality is validated.
major comments (2)
- Abstract: the claim that crosscoders detect feature emergence, maintenance, and discontinuation is presented without any quantitative results, ablation studies, or direct validation that recovered features correspond to stable, human-interpretable linguistic concepts rather than correlations induced by the joint training objective.
- Methods (crosscoder training on checkpoint triplets): the central claim requires that jointly optimized crosscoders recover the same underlying linguistic features across checkpoints. The shared dictionary and reconstruction loss can induce spurious alignments; because RelIE is computed downstream of these alignments, any misalignment directly undermines the reported emergence/maintenance/discontinuation conclusions. A concrete test (e.g., intervention or human-interpretability check on aligned features) is needed to rule out this artifact.
minor comments (1)
- The definition and computation of RelIE should be given explicitly (ideally as an equation) rather than described only at a high level, to allow readers to assess its dependence on the crosscoder dictionary.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying the evidence already present in the manuscript while agreeing to strengthen validation where appropriate.
read point-by-point responses
-
Referee: Abstract: the claim that crosscoders detect feature emergence, maintenance, and discontinuation is presented without any quantitative results, ablation studies, or direct validation that recovered features correspond to stable, human-interpretable linguistic concepts rather than correlations induced by the joint training objective.
Authors: The abstract is a high-level summary; the quantitative results demonstrating emergence, maintenance, and discontinuation via RelIE scores across checkpoint triplets and tasks are reported in Sections 3 and 4, along with ablation studies on the crosscoder objective in the appendix. Examples of aligned features mapping to interpretable linguistic concepts (e.g., subject-verb agreement) are provided in Appendix B. We will revise the abstract to reference these quantitative findings more explicitly and expand the main text with additional interpretability checks. revision: yes
-
Referee: Methods (crosscoder training on checkpoint triplets): the central claim requires that jointly optimized crosscoders recover the same underlying linguistic features across checkpoints. The shared dictionary and reconstruction loss can induce spurious alignments; because RelIE is computed downstream of these alignments, any misalignment directly undermines the reported emergence/maintenance/discontinuation conclusions. A concrete test (e.g., intervention or human-interpretability check on aligned features) is needed to rule out this artifact.
Authors: We agree that demonstrating the faithfulness of alignments is essential. The manuscript uses checkpoint-specific encoders/decoders with a shared dictionary and balanced sparsity to encourage recovery of corresponding features, and RelIE provides downstream causal validation. To further rule out artifacts, we will add intervention experiments that ablate aligned features and measure consistent task-performance effects across stages, plus a comparison of joint vs. independent SAE alignments, in the revised version. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents an empirical measurement pipeline that trains sparse crosscoders on checkpoint triplets from LLM pretraining and applies a novel RelIE metric to trace causal importance of aligned features. No equations, predictions, or first-principles derivations are claimed that reduce the reported emergence/maintenance/discontinuation findings to quantities defined by the crosscoder's fitted parameters or by self-referential construction. The central claims rest on observable outputs of the joint training and evaluation process, which remain externally falsifiable via replication of the described architecture-agnostic training and benchmarking steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We train crosscoders between open-sourced checkpoint triplets... introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sparse crosscoders to discover and align features across model checkpoints... detect feature emergence, maintenance, and discontinuation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
-
Features have life history. And we should care
Language model features form an early stable carrier scaffold of about 50 sparse features that is load-bearing, predictable from onset firing, and recruits most later features.
Reference graph
Works this paper leans on
-
[1]
InInternational 9 Conference on Machine Learning, pages 2397–2430
Pythia: A suite for analyzing large language models across training and scaling. InInternational 9 Conference on Machine Learning, pages 2397–2430. PMLR. Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. 2024. Identifying functionally im- portant features with end-to-end sparse dictionary learning.Preprint, arXiv:2405.12241. Trenton Bric...
-
[2]
Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024
Batchtopk sparse autoencoders.Preprint, arXiv:2412.06410. Boxi Cao, Qiaoyu Tang, Hongyu Lin, Shanshan Jiang, Bin Dong, Xianpei Han, Jiawei Chen, Tianshu Wang, and Le Sun. 2024. Retentive or forgetful? diving into the knowledge memorizing mechanism of lan- guage models. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics...
-
[3]
InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352
Towards automated circuit discovery for mech- anistic interpretability. InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352. Curran Associates, Inc. Ionut Constantinescu, Tiago Pimentel, Ryan Cotterell, and Alex Warstadt. 2025. Investigating critical pe- riod effects in language acquisition through neural language models.T...
-
[4]
Gaussian Error Linear Units (GELUs)
Have faith in faithfulness: Going beyond cir- cuit overlap when finding model mechanisms. In First Conference on Language Modeling. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian er- ror linear units (gelus).Preprint, arXiv:1606.08415. John Hewitt, Robert Geirhos, and Been Kim. 2025. We can’t understand ai using our existing vocabulary. Preprint, arXiv:25...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs
Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs.Preprint, arXiv:2504.02768. Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
Hidden breakthroughs in language model train- ing.Preprint, arXiv:2506.15872. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.Preprint, arXiv:2001.08361. Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik...
-
[7]
InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118
Measuring progress in dictionary learning for language model interpretability with board game models. InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118. Curran Associates, Inc. Vedang Lad, Wes Gurnee, and Max Tegmark. 2024. The remarkable robustness of LLMs: Stages of in- ference? InICML 2024 Workshop on Mechanistic Inter...
-
[8]
How do llms acquire new knowledge? a knowl- edge circuits perspective on continual pre-training. Preprint, arXiv:2502.11196. Judea Pearl. 2001. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Un- certainty in Artificial Intelligence, UAI’01, page 411–420, San Francisco, CA, USA. Morgan Kauf- mann Publishers Inc. Ofir Press, No...
-
[9]
GLU Variants Improve Transformer
Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Lin- guistics. Noam Shazeer. 2020. Glu variants improve transformer. Preprint, arXiv:2002.05202. David Silver, Juli...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[10]
Description:To the best of your extent, de- scribe the behavior of this feature’s activation
-
[11]
Interpretability:On a scale of 0.0 to 1.0, how coherent are the examples shown with the de- scription you wrote? Is it consistently activating on similar tokens or promoting/demoting similar tokens?
-
[12]
Complexity:On a scale of 0.0 to 1.0, how com- plex is the feature behavior? How broad is the topic that the feature fires on? Does the feature activate on or promote/demote diverse tokens or similar tokens all over again?
-
[13]
(if BLOOM)Languages:Which languages have this feature activated most on? F.3 Complete Annotations We provide here the full set of annotation tables omitted from the main paper: Pythia’s two-way and three-way comparisons (Tables 8 and 10), OLMo’s three-way comparison (Table 11), and BLOOM’s two-way and three -way comparisons (Tables 9 and 12). Language-spe...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.