Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Aaron Mueller; Antoine Bosselut; Deniz Bayazit

arxiv: 2509.05291 · v2 · submitted 2025-09-05 · 💻 cs.CL · cs.AI· cs.LG

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit , Aaron Mueller , Antoine Bosselut This is my paper

Pith reviewed 2026-05-18 18:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords crosscodersLLM pretrainingfeature emergencelinguistic representationssparse alignmentcausal importancemodel checkpointsmechanistic interpretability

0 comments

The pith

Sparse crosscoders trained on LLM checkpoint triplets can align features to track when specific linguistic concepts emerge, persist, or drop out during pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to reveal the precise training stages at which LLMs acquire particular linguistic abilities, something standard benchmarks do not show. Researchers train sparse crosscoders on triplets of open checkpoints that mark big shifts in performance and representations. These crosscoders produce aligned features across time, and a new metric called Relative Indirect Effects quantifies when each feature starts to causally drive task behavior. The results demonstrate that features can be observed emerging, staying active, or disappearing as pretraining continues. Because the method requires no architecture-specific changes, it offers a practical route to finer-grained study of representation learning.

Core claim

By training sparse crosscoders between triplets of model checkpoints drawn from LLM pretraining runs, the same underlying linguistic features can be recovered and aligned across training time; the Relative Indirect Effects metric then identifies the stages at which each aligned feature becomes causally relevant to downstream task performance, thereby exposing patterns of feature emergence, maintenance, and discontinuation.

What carries the argument

Sparse crosscoders trained on checkpoint triplets, which discover and align features across pretraining stages so that the same linguistic concept can be followed as it changes in importance.

If this is right

Individual linguistic features can be monitored for the exact pretraining step at which they first appear and begin to affect behavior.
Once a feature emerges it can be observed to remain active or to lose relevance at later stages.
Relative Indirect Effects supplies a numeric trace of when each feature's causal contribution rises or falls.
The same alignment procedure applies without modification to models of different sizes and architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The temporal tracking could be applied to non-linguistic capabilities such as reasoning or safety-related behaviors to locate their acquisition points.
Knowing when a feature stabilizes might let practitioners intervene at the right checkpoint to strengthen or suppress that feature in later training.
The method supplies a concrete way to test whether representation changes are gradual or abrupt for any given concept.

Load-bearing premise

The feature alignments produced by crosscoders on different checkpoints correspond to identical underlying linguistic concepts rather than to correlations created by the crosscoder training procedure itself.

What would settle it

If crosscoders trained on the same checkpoint triplets produce alignments that fail to predict changes in causal importance on held-out tasks or produce inconsistent alignments when the triplets are reordered, the claim that the method tracks genuine feature evolution would not hold.

Figures

Figures reproduced from arXiv: 2509.05291 by Aaron Mueller, Antoine Bosselut, Deniz Bayazit.

**Figure 1.** Figure 1: Capturing the evolution of features. Given a task, our pipeline selects the relevant checkpoints during pretraining, learns a joint feature space with crosscoders, and then analyzes feature differences across checkpoints. This allows to analyze how models learn, maintain, or unlearn particular representations over time. In particular, we lack a clear understanding of when and how specific linguistic abilit… view at source ↗

**Figure 2.** Figure 2: Checkpoint selection with task performance (top) and middle-layer activation cosine similarity (bottom). The Pythia-1B (left) and OLMo-1B (middle) performance and activations patterns are calculated over BLiMP whereas BLOOM-1B (right) uses MultiBLiMP. All columns have the same x-axis (number of training tokens). We highlight checkpoints identified as critical in purple vertical lines. While some activation… view at source ↗

**Figure 3.** Figure 3: IE evolution of Top-5 & Bottom-5 IE Features for Pythia checkpoints 1B & 286B. IEs are calculated using BLiMP subject–verb agreement tasks. Missing annotation means the feature was not interpretable. We observe that low-level or uninterpretable features fade over time, while high-level grammar detectors emerge and strengthen by 286B. 6.2 Crosscoder Learnability We train crosscoders on pairs and triplets… view at source ↗

**Figure 4.** Figure 4: RELIE of Top-100 IE features on BLiMP for Pythia-1B checkpoints 1B, 4B, 286B. Distinct clusters at each corner show attribution of checkpoint-specific features. 4B–286B share more abstract features, while 1B–286B overlaps are sparse and lower-level features. RelIE FeatID Activated Languages Interpreted Function Top Activating Sequence 6B specific [1.00, 0.00, 0.00] 3672 arb,eng,fra,hin,por,spa Detects elli… view at source ↗

**Figure 5.** Figure 5: Top-10 MultiBLiMP Number Agreement Task Feature Overlap per Checkpoint for Languages with 3-way comparisons. Cross-lingual feature overlap increases by 341B, especially for Latin-script languages, reflecting shared syntactic patterns, while greater morphological complexity in Arabic and Hindi limits alignment. prominence of that noun in our IE dataset. In the final stages we find crosslingual features for … view at source ↗

**Figure 6.** Figure 6: IE evolution of Top-5 & Bottom-5 Features for OLMo-1B checkpoints 4B & 3T. IEs are calculated using BLiMP subject–verb agreement tasks. “–” means the feature was not interpretable. In some cases, a feature can belong to multiple categories at once. Some low-level features, such as newline detectors, persist across training, whereas the usage of simpler lexical detectors fade as more abstract grammatical pa… view at source ↗

**Figure 7.** Figure 7: RELIE of Top-10 and 100 IE features on BLiMP for Pythia-1B checkpoints {1B, 4B, 286B}. Distinct clusters near each corner indicate checkpoint-specific features. In the Top-10 row, the 4B-286B pair and features shared across all three checkpoints dominate, whereas the 1B-286B pair have relatively fewer shared features. Additionally, the checkpoint-specific regions for 4B and 286B are noticeably denser, sugg… view at source ↗

**Figure 8.** Figure 8: Top-10 IE feature overlap in BLOOM-1B across languages and subtasks with the 3-way comparison. Across BLOOM-1B checkpoints, feature overlap is generally higher among script-sharing languages (e.g., English, French, Spanish, Portuguese) and increases across pretraining, while languages like Arabic and Hindi, which are less frequent in the training data and use different scripts, show relatively less cross-l… view at source ↗

**Figure 9.** Figure 9: Top-10 IE feature overlap (per checkpoint) in BLOOM-1B across languages and subtasks with the 2-way comparison. The 2-way comparison shows a similar pattern to the 3-way analysis, with high feature overlap among related languages (e.g., English, French, Spanish, Portuguese). Notably, in the 55B vs. 341B comparison, Arabic—despite not being Indo-European—shares more features than Hindi, suggesting better cr… view at source ↗

read the original abstract

Large language models (LLMs) learn non-trivial abstractions during pretraining, such as detecting irregular plural noun subjects. However, because traditional evaluation methods (e.g., benchmarking) fail to reveal how models acquire these concepts and capabilities, it is not well understood when and how these specific linguistic abilities emerge. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Crosscoders on checkpoint triplets plus the RelIE metric give a workable pipeline for tracking feature changes over pretraining, but joint training makes it hard to trust that the alignments reflect stable linguistic concepts rather than optimization artifacts.

read the letter

The main point is that the authors train sparse crosscoders on triplets of open-sourced checkpoints and add a Relative Indirect Effects metric to mark when individual features become causally relevant to task performance. They report detecting emergence, maintenance, and discontinuation of linguistic features across pretraining stages. The temporal use of crosscoders and the RelIE score are the clearest additions to prior sparse autoencoder work. The setup is presented as architecture-agnostic and scalable, which is a practical framing for people who want to monitor representation shifts without full retraining. Credit is due for focusing on real checkpoint data where performance jumps occur and for trying to move beyond static post-training analysis. The approach could help interpretability researchers look at concept-level dynamics instead of just final benchmarks. The soft spots sit mainly in the validation gap. No quantitative results, feature examples, or ablation checks appear in the available description, so it is unclear whether the crosscoder alignments recover the same underlying linguistic roles or simply reflect correlations induced by the joint reconstruction objective. The shared dictionary across checkpoints can create forced alignments that do not correspond to stable semantics, and any such misalignment would carry straight into the emergence and discontinuation claims. RelIE still depends on those alignments being faithful first. This work is aimed at mechanistic interpretability groups that already use sparse autoencoders and want to extend them to training timelines. Readers interested in tools for fine-grained training monitoring might extract useful ideas, but they will need the full experiments and controls to judge reliability. It deserves a serious referee because the question is worth asking and the method is straightforward to test further, even if the current evidence is preliminary.

Referee Report

2 major / 1 minor

Summary. The paper claims that sparse crosscoders trained on activations from LLM pretraining checkpoint triplets can be used to discover and align linguistic features across training stages. By introducing a Relative Indirect Effects (RelIE) metric, the authors track when individual features become causally important for task performance and report that the method detects feature emergence, maintenance, and discontinuation during pretraining.

Significance. If the feature alignments prove faithful rather than training artifacts, the approach would offer a scalable and architecture-agnostic way to analyze concept-level dynamics in LLM pretraining, addressing a gap left by standard benchmarking. The introduction of RelIE as a causal tracing tool is a constructive addition if its dependence on prior alignment quality is validated.

major comments (2)

Abstract: the claim that crosscoders detect feature emergence, maintenance, and discontinuation is presented without any quantitative results, ablation studies, or direct validation that recovered features correspond to stable, human-interpretable linguistic concepts rather than correlations induced by the joint training objective.
Methods (crosscoder training on checkpoint triplets): the central claim requires that jointly optimized crosscoders recover the same underlying linguistic features across checkpoints. The shared dictionary and reconstruction loss can induce spurious alignments; because RelIE is computed downstream of these alignments, any misalignment directly undermines the reported emergence/maintenance/discontinuation conclusions. A concrete test (e.g., intervention or human-interpretability check on aligned features) is needed to rule out this artifact.

minor comments (1)

The definition and computation of RelIE should be given explicitly (ideally as an equation) rather than described only at a high level, to allow readers to assess its dependence on the crosscoder dictionary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying the evidence already present in the manuscript while agreeing to strengthen validation where appropriate.

read point-by-point responses

Referee: Abstract: the claim that crosscoders detect feature emergence, maintenance, and discontinuation is presented without any quantitative results, ablation studies, or direct validation that recovered features correspond to stable, human-interpretable linguistic concepts rather than correlations induced by the joint training objective.

Authors: The abstract is a high-level summary; the quantitative results demonstrating emergence, maintenance, and discontinuation via RelIE scores across checkpoint triplets and tasks are reported in Sections 3 and 4, along with ablation studies on the crosscoder objective in the appendix. Examples of aligned features mapping to interpretable linguistic concepts (e.g., subject-verb agreement) are provided in Appendix B. We will revise the abstract to reference these quantitative findings more explicitly and expand the main text with additional interpretability checks. revision: yes
Referee: Methods (crosscoder training on checkpoint triplets): the central claim requires that jointly optimized crosscoders recover the same underlying linguistic features across checkpoints. The shared dictionary and reconstruction loss can induce spurious alignments; because RelIE is computed downstream of these alignments, any misalignment directly undermines the reported emergence/maintenance/discontinuation conclusions. A concrete test (e.g., intervention or human-interpretability check on aligned features) is needed to rule out this artifact.

Authors: We agree that demonstrating the faithfulness of alignments is essential. The manuscript uses checkpoint-specific encoders/decoders with a shared dictionary and balanced sparsity to encourage recovery of corresponding features, and RelIE provides downstream causal validation. To further rule out artifacts, we will add intervention experiments that ablate aligned features and measure consistent task-performance effects across stages, plus a comparison of joint vs. independent SAE alignments, in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical measurement pipeline that trains sparse crosscoders on checkpoint triplets from LLM pretraining and applies a novel RelIE metric to trace causal importance of aligned features. No equations, predictions, or first-principles derivations are claimed that reduce the reported emergence/maintenance/discontinuation findings to quantities defined by the crosscoder's fitted parameters or by self-referential construction. The central claims rest on observable outputs of the joint training and evaluation process, which remain externally falsifiable via replication of the described architecture-agnostic training and benchmarking steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that crosscoder features correspond to stable linguistic concepts.

pith-pipeline@v0.9.0 · 5710 in / 1076 out tokens · 35539 ms · 2026-05-18T18:38:48.181211+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We train crosscoders between open-sourced checkpoint triplets... introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sparse crosscoders to discover and align features across model checkpoints... detect feature emergence, maintenance, and discontinuation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
cs.LG 2026-05 conditional novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
Features have life history. And we should care
q-bio.NC 2026-05 unverdicted novelty 5.0

Language model features form an early stable carrier scaffold of about 50 sparse features that is load-bearing, predictable from onset firing, and recruits most later features.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 2 Pith papers · 3 internal anchors

[1]

InInternational 9 Conference on Machine Learning, pages 2397–2430

Pythia: A suite for analyzing large language models across training and scaling. InInternational 9 Conference on Machine Learning, pages 2397–2430. PMLR. Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. 2024. Identifying functionally im- portant features with end-to-end sparse dictionary learning.Preprint, arXiv:2405.12241. Trenton Bric...

work page arXiv 2024
[2]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

Batchtopk sparse autoencoders.Preprint, arXiv:2412.06410. Boxi Cao, Qiaoyu Tang, Hongyu Lin, Shanshan Jiang, Bin Dong, Xianpei Han, Jiawei Chen, Tianshu Wang, and Le Sun. 2024. Retentive or forgetful? diving into the knowledge memorizing mechanism of lan- guage models. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics...

work page arXiv 2024
[3]

InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352

Towards automated circuit discovery for mech- anistic interpretability. InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352. Curran Associates, Inc. Ionut Constantinescu, Tiago Pimentel, Ryan Cotterell, and Alex Warstadt. 2025. Investigating critical pe- riod effects in language acquisition through neural language models.T...

work page arXiv 2025
[4]

Gaussian Error Linear Units (GELUs)

Have faith in faithfulness: Going beyond cir- cuit overlap when finding model mechanisms. In First Conference on Language Modeling. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian er- ror linear units (gelus).Preprint, arXiv:1606.08415. John Hewitt, Robert Geirhos, and Been Kim. 2025. We can’t understand ai using our existing vocabulary. Preprint, arXiv:25...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs.Preprint, arXiv:2504.02768. Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B

Hidden breakthroughs in language model train- ing.Preprint, arXiv:2506.15872. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.Preprint, arXiv:2001.08361. Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik...

work page arXiv 2020
[7]

InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118

Measuring progress in dictionary learning for language model interpretability with board game models. InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118. Curran Associates, Inc. Vedang Lad, Wes Gurnee, and Max Tegmark. 2024. The remarkable robustness of LLMs: Stages of in- ference? InICML 2024 Workshop on Mechanistic Inter...

work page arXiv 2024
[8]

Preprint, arXiv:2502.11196

How do llms acquire new knowledge? a knowl- edge circuits perspective on continual pre-training. Preprint, arXiv:2502.11196. Judea Pearl. 2001. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Un- certainty in Artificial Intelligence, UAI’01, page 411–420, San Francisco, CA, USA. Morgan Kauf- mann Publishers Inc. Ofir Press, No...

work page arXiv 2001
[9]

GLU Variants Improve Transformer

Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Lin- guistics. Noam Shazeer. 2020. Glu variants improve transformer. Preprint, arXiv:2002.05202. David Silver, Juli...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Description:To the best of your extent, de- scribe the behavior of this feature’s activation

work page
[11]

Interpretability:On a scale of 0.0 to 1.0, how coherent are the examples shown with the de- scription you wrote? Is it consistently activating on similar tokens or promoting/demoting similar tokens?

work page
[12]

Complexity:On a scale of 0.0 to 1.0, how com- plex is the feature behavior? How broad is the topic that the feature fires on? Does the feature activate on or promote/demote diverse tokens or similar tokens all over again?

work page
[13]

do” auxiliary in compound verbs (root + another light-verb + conjugation) hin6B-341B shared [0.64,0.09,0.27]3969 Feminine possessive markerkF(“of

(if BLOOM)Languages:Which languages have this feature activated most on? F.3 Complete Annotations We provide here the full set of annotation tables omitted from the main paper: Pythia’s two-way and three-way comparisons (Tables 8 and 10), OLMo’s three-way comparison (Table 11), and BLOOM’s two-way and three -way comparisons (Tables 9 and 12). Language-spe...

work page

[1] [1]

InInternational 9 Conference on Machine Learning, pages 2397–2430

Pythia: A suite for analyzing large language models across training and scaling. InInternational 9 Conference on Machine Learning, pages 2397–2430. PMLR. Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, and Lee Sharkey. 2024. Identifying functionally im- portant features with end-to-end sparse dictionary learning.Preprint, arXiv:2405.12241. Trenton Bric...

work page arXiv 2024

[2] [2]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410, 2024

Batchtopk sparse autoencoders.Preprint, arXiv:2412.06410. Boxi Cao, Qiaoyu Tang, Hongyu Lin, Shanshan Jiang, Bin Dong, Xianpei Han, Jiawei Chen, Tianshu Wang, and Le Sun. 2024. Retentive or forgetful? diving into the knowledge memorizing mechanism of lan- guage models. InProceedings of the 2024 Joint International Conference on Computational Linguis- tics...

work page arXiv 2024

[3] [3]

InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352

Towards automated circuit discovery for mech- anistic interpretability. InAdvances in Neural Infor- mation Processing Systems, volume 36, pages 16318– 16352. Curran Associates, Inc. Ionut Constantinescu, Tiago Pimentel, Ryan Cotterell, and Alex Warstadt. 2025. Investigating critical pe- riod effects in language acquisition through neural language models.T...

work page arXiv 2025

[4] [4]

Gaussian Error Linear Units (GELUs)

Have faith in faithfulness: Going beyond cir- cuit overlap when finding model mechanisms. In First Conference on Language Modeling. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian er- ror linear units (gelus).Preprint, arXiv:1606.08415. John Hewitt, Robert Geirhos, and Been Kim. 2025. We can’t understand ai using our existing vocabulary. Preprint, arXiv:25...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs

Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs.Preprint, arXiv:2504.02768. Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B

Hidden breakthroughs in language model train- ing.Preprint, arXiv:2506.15872. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.Preprint, arXiv:2001.08361. Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik...

work page arXiv 2020

[7] [7]

InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118

Measuring progress in dictionary learning for language model interpretability with board game models. InAdvances in Neural Information Process- ing Systems, volume 37, pages 83091–83118. Curran Associates, Inc. Vedang Lad, Wes Gurnee, and Max Tegmark. 2024. The remarkable robustness of LLMs: Stages of in- ference? InICML 2024 Workshop on Mechanistic Inter...

work page arXiv 2024

[8] [8]

Preprint, arXiv:2502.11196

How do llms acquire new knowledge? a knowl- edge circuits perspective on continual pre-training. Preprint, arXiv:2502.11196. Judea Pearl. 2001. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Un- certainty in Artificial Intelligence, UAI’01, page 411–420, San Francisco, CA, USA. Morgan Kauf- mann Publishers Inc. Ofir Press, No...

work page arXiv 2001

[9] [9]

GLU Variants Improve Transformer

Neural machine translation of rare words with subword units. InProceedings of the 54th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Lin- guistics. Noam Shazeer. 2020. Glu variants improve transformer. Preprint, arXiv:2002.05202. David Silver, Juli...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[10] [10]

Description:To the best of your extent, de- scribe the behavior of this feature’s activation

work page

[11] [11]

Interpretability:On a scale of 0.0 to 1.0, how coherent are the examples shown with the de- scription you wrote? Is it consistently activating on similar tokens or promoting/demoting similar tokens?

work page

[12] [12]

Complexity:On a scale of 0.0 to 1.0, how com- plex is the feature behavior? How broad is the topic that the feature fires on? Does the feature activate on or promote/demote diverse tokens or similar tokens all over again?

work page

[13] [13]

do” auxiliary in compound verbs (root + another light-verb + conjugation) hin6B-341B shared [0.64,0.09,0.27]3969 Feminine possessive markerkF(“of

(if BLOOM)Languages:Which languages have this feature activated most on? F.3 Complete Annotations We provide here the full set of annotation tables omitted from the main paper: Pythia’s two-way and three-way comparisons (Tables 8 and 10), OLMo’s three-way comparison (Table 11), and BLOOM’s two-way and three -way comparisons (Tables 9 and 12). Language-spe...

work page