Understanding the Staged Dynamics of Transformers in Learning Latent Structure

Alona Fyshe; Farzane Aminmansour; Rohan Saha

arxiv: 2511.19328 · v2 · submitted 2025-11-24 · 💻 cs.LG

Understanding the Staged Dynamics of Transformers in Learning Latent Structure

Rohan Saha , Farzane Aminmansour , Alona Fyshe This is my paper

Pith reviewed 2026-05-17 05:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformerslatent structurestaged learningcausal interventionsplasticity windowscompositiondecompositiontraining dynamics

0 comments

The pith

Transformers acquire different components of latent structure in discrete stages during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how transformer models pick up separate elements of hidden structure from context across several controlled tasks. It breaks each task into clear parts and measures when the model masters each part. The results indicate that these parts are learned one after another rather than all at once. The model puts basic pieces together reliably but has difficulty breaking complex cases down to recover the basic pieces. Interventions on the model also show that certain layers are especially changeable at particular moments in training.

Core claim

By factorizing each task into interpretable components, the model learns the different latent structure components in discrete stages. The model composes fundamental transitions robustly but struggles to decompose complex examples to discover the atomic transitions. Using causal interventions, layer-specific plasticity windows are identified during which freezing substantially delays or prevents stage completion.

What carries the argument

Factorization of tasks into separate interpretable components, tracked across training steps and tested with targeted layer freezing to expose plasticity windows.

If this is right

The model builds up its abilities through a predictable sequence of phases rather than acquiring all skills simultaneously.
Composition of simple transitions succeeds more consistently than decomposition of complex sequences.
Freezing specific layers during their identified plasticity window blocks or delays completion of the corresponding learning stage.
Capabilities appear in a staged order that can be observed by monitoring performance on each factored component separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the staged pattern holds beyond the tested tasks, training procedures could be adjusted to emphasize one component at a time for greater efficiency.
The asymmetry between composition and decomposition points to a possible need for targeted data or objectives to strengthen decomposition skills.
Layer-specific windows may appear in other model sizes or architectures and could inform selective updating or modular training approaches.

Load-bearing premise

The chosen tasks and their breakdown into distinct parts accurately reflect the mechanisms of latent structure learning that occur in broader settings.

What would settle it

Repeated training runs in which mastery of each component improves gradually and continuously rather than showing abrupt jumps at separate points.

Figures

Figures reproduced from arXiv: 2511.19328 by Alona Fyshe, Farzane Aminmansour, Rohan Saha.

**Figure 1.** Figure 1: Overview of Alchemy chemistry structure and experimental tasks. Middle: Example chemistry with vertices representing stone states connected with bidirectional edges (potions). A chemistry consists of eight stones, and application of potions changes the stone features. Left: Experiment to investigate the staged dynamics of latent structure learning: given a chemistry, all samples for a randomly selected pot… view at source ↗

**Figure 2.** Figure 2: (a) Validation accuracy for latent structure learning (withheld potion pair with hlsupport = hlquery = 1). Perfect performance demonstrates the model’s ability to learn and generalize latent structures, but via distinct plateaus and jumps, indicating the acquisition of the latent structure through various stages. (b) Validation accuracy after factorizing the task into different events / components. Blue: P… view at source ↗

**Figure 3.** Figure 3: Model performance for composing 1-hop transitions to solve multi-hop queries. There is no noticeable difference in model performance with increasing in the value of hlquery ∈ {2, 3, 4, 5}. The x-axis denotes epochs, and the y-axis denotes the validation accuracy. Performance is averaged over three seeds. The error bars denote the standard error of the mean. We only show the first 500 epochs as all hops rea… view at source ↗

**Figure 4.** Figure 4: Staged learning dynamics for the composition task (first 500 epochs) with hlquery ∈ {2, 3, 4, 5}. (a) Staged dynamics for 2-hop composition. (b) Staged dynamics for 3-hop composition. (c) Staged dynamics for 4-hop composition. (d) Staged dynamics for 5-hop composition. In each subfigure, “orange” plots P[A] (in-support), “purple” plots P[R | A] (within reachable stones given in-support), “colored” plots (f… view at source ↗

**Figure 5.** Figure 5: Decomposition results for various hlsupport ∈ {2, 3, 4, 5}. We see delayed convergence with increasing task complexity (hlsupport), where the 2-hop converges the earliest and the 5-hop converges the latest. The x-axis denotes epochs, and the y-axis denotes the validation accuracy. Performance is averaged over three seeds, and the error bars show the standard error of the mean. the multiplication of which g… view at source ↗

**Figure 6.** Figure 6: Staged learning dynamics for the decomposition experiments with hlsupport ∈ {2, 3, 4, 5} (a) Staged learning dynamics for 2-hop decomposition, (b) Staged learning dynamics 3-hop decomposition. (c) Staged learning dynamics for 4-hop decomposition. (d) Staged learning dynamics for 5-hop decomposition. In each subfigure, “orange” plots Pk[A] (in-support), “purple” plots P[B | A] (correct half given in-support… view at source ↗

**Figure 7.** Figure 7: Prediction accuracy grouped by the reward value of the query. (a) Exact match accuracy given correct half. (b) Exact match accuracy given in-support. We observe that the model learns the exact match accuracy for queries with -3, and +15 reward features faster than those with -1, and +1 reward features. We also show how the model learns the correct set of Tr (within reward adjacent metric in [PITH_FULL_IMA… view at source ↗

**Figure 8.** Figure 8: Adjacency analysis for each individual reward feature value: i) Dark blue shows the learning curve of the adjacent candidate vertices based on only the reward-related latent structure of the 1-hop transitions (correct set of Tr); ii) orange demonstrates the geometrical adjacency (Nr) subset of neighboring vertices, which overlaps the reward adjacency subset, when the start state has reward values of {−3, +… view at source ↗

**Figure 9.** Figure 9: Examples of hyperparameter sensitivity for the 2-hop and 3-hop decomposition experiments. (a) 2-hop decomposition with different weight decay values delays convergence. (b) 3-hop decomposition experiment, where suboptimal weight decay values might cause the model to get stuck in a plateau for extended periods during training, delaying convergence. 0 200 400 600 800 1000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Accura… view at source ↗

**Figure 10.** Figure 10: Example of hyperparameter sensitivity on the 3-hop decomposition learning stages. (a) Stages with weight decay = 0.1. (b) Stages with weight decay = 0.01. (c) Stages with weight decay = 0.001. In each subfigure, ‘orange’ denotes in-support prediction P[A], ‘purple’ denotes correct half prediction P[B | A], and ‘green’ denotes exact match given correct half P[C | A ∩ B]. Suboptimal hyperparameters can caus… view at source ↗

**Figure 11.** Figure 11: We do not observe any noticeable lag between these events, because the cyan and pink curves rise in immediate succession, indicating a sharp phase transition towards predicting the correct half. This indicates that the multi-hop nature of the support prevents the model from successfully extracting the structure of the chemistry, shown by the chance performance of 12.5% in [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

read the original abstract

Language modeling has shown us that transformers can discover latent structure from context, but the dynamics of how they acquire different components of that structure remain poorly understood, leading to assertions that models just remix training data. In this work, we use the Alchemy benchmark in a controlled setting (Wang et al.,2021) to investigate latent structure learning. We train a small decoder-only transformer on three task variants: 1) inferring missing transitions from partial contextual information, 2) composing simple rules to solve multi-transition sequences, and 3) decomposing complex multi-step examples to infer intermediate transitions. By factorizing each task into interpretable components, we show that the model learns the different latent structure components in discrete stages. We also observe an asymmetry: the model composes fundamental transitions robustly, but struggles to decompose complex examples to discover the atomic transitions. Finally, using causal interventions, we identify layer-specific plasticity windows during which freezing substantially delays or prevents stage completion. These findings provide insight into how a transformer model acquires latent structure, offering a detailed view of how capabilities evolve during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports new observations on discrete stages, composition-decomposition asymmetry, and layer plasticity in transformer training on Alchemy tasks.

read the letter

The punchline is that this paper gives some solid empirical observations on discrete stages of learning in a transformer on the Alchemy benchmark, including an asymmetry between composition and decomposition and layer-specific plasticity windows found through interventions. What stands out as new is the tracking of when the model picks up different factorized components during training, plus the causal evidence from freezing layers at different points. It builds directly on the earlier Alchemy work by shifting focus to the dynamics of acquisition rather than just end performance. The setup with the three task variants helps isolate the effects. The paper does well in using a controlled environment to make the stages visible through sub-task performance and in applying interventions to pin down timing. That makes the claims more convincing than pure observational curves would be. The main soft spot is the narrow scope. Everything is on a small decoder-only model and synthetic tasks, so it's not clear how much carries over to bigger models or messier data like natural language. The abstract was thin on methods details like model size and stats, but the full text apparently includes the necessary controls and procedures to support the measurements without obvious post-hoc issues. This kind of work is for people studying training dynamics and interpretability in transformers. A reader looking for concrete examples of how capabilities build up in stages would find it useful, even if it doesn't change the big picture. I'd recommend sending it for peer review. The observations are worth having referees check the details on, and the questions about staged learning and layer roles deserve that level of scrutiny.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates how a small decoder-only transformer acquires latent structure on the Alchemy benchmark. It trains models on three task variants—inferring missing transitions from partial context, composing simple rules for multi-transition sequences, and decomposing complex examples to recover atomic transitions—then factorizes each into interpretable components. The central claims are that learning proceeds in discrete stages, that composition of fundamental transitions is robust while decomposition is difficult, and that layer-specific plasticity windows exist, identified via causal freezing interventions that delay or prevent stage completion.

Significance. If the results hold, the work supplies a granular, intervention-based account of staged capability acquisition that moves beyond black-box training narratives. The controlled benchmark factorization and layer-freezing experiments provide concrete evidence for timing effects in structure learning, which is valuable for mechanistic interpretability and could inform curriculum design or targeted training interventions.

major comments (2)

[§4] The discreteness of stages rests on performance curves over sub-task components; without reported change-point detection, multiple random seeds, or statistical controls for variance, it remains possible that apparent stages reflect noise or post-hoc binning rather than robust transitions (§4, performance plots).
[Causal interventions section] The plasticity-window claim requires showing that freezing a layer at a given step specifically blocks the targeted component rather than globally slowing learning; the current intervention results would be strengthened by a control that matches total compute or capacity reduction across timings.

minor comments (2)

[Abstract] The abstract omits model size, layer count, training procedure, and number of runs; adding one sentence would let readers assess robustness without immediately consulting the methods.
[Figures] Figure legends for the freezing experiments should explicitly state the number of seeds and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for minor revision. The comments identify valuable opportunities to strengthen the evidence for discrete stages and the specificity of plasticity windows. We respond to each major comment below.

read point-by-point responses

Referee: [§4] The discreteness of stages rests on performance curves over sub-task components; without reported change-point detection, multiple random seeds, or statistical controls for variance, it remains possible that apparent stages reflect noise or post-hoc binning rather than robust transitions (§4, performance plots).

Authors: We agree that objective statistical validation would make the stage transitions more robust. In the revised version we will report performance curves aggregated over at least five random seeds with standard-error shading. We will also apply a standard change-point detection procedure (PELT) to the per-component accuracy trajectories and report the detected transition steps together with their statistical significance. These additions will replace reliance on visual inspection alone. revision: yes
Referee: [Causal interventions section] The plasticity-window claim requires showing that freezing a layer at a given step specifically blocks the targeted component rather than globally slowing learning; the current intervention results would be strengthened by a control that matches total compute or capacity reduction across timings.

Authors: We concur that matched controls are necessary to isolate timing-specific effects. In the revision we will add experiments that apply equivalent capacity reductions (freezing a randomly chosen layer or reducing hidden dimension by the same fraction) at the same training checkpoints while keeping total compute identical. These controls will be compared directly against the original layer-specific freezes to demonstrate that the observed delays are attributable to the plasticity windows rather than nonspecific slowdown. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims rest on empirical observations from training decoder-only transformers on controlled Alchemy benchmark variants, including performance curves on sub-tasks, composition/decomposition asymmetry, and layer-freezing interventions to identify plasticity windows. These are direct measurements from training runs rather than derivations that reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The factorization into interpretable components serves as an analysis lens for staging, not a circular redefinition of the target quantities. No load-bearing steps invoke uniqueness theorems or ansatzes from prior self-work that would force the results by construction. The work is self-contained against its experimental controls and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all claims rest on empirical training and interventions whose details are unavailable.

pith-pipeline@v0.9.0 · 5490 in / 1153 out tokens · 34154 ms · 2026-05-17T05:54:10.576207+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By factorizing each task into interpretable components, we show that the model learns the different latent structure components in discrete stages.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Alchemy benchmark... eight vertices arranged in a cubic structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Software available from wandb.com

URL https://www.wandb.com/. Software available from wandb.com. Chan, S. C. Y ., Santoro, A., Lampinen, A. K., Wang, J. X., Singh, A., Richemond, P. H., McClelland, J., and Hill, F. Data Distributional Properties Drive Emergent In-Context Learning in Transformers, November 2022. Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M. L., and Saphra, N. Sudden Drop...

work page 2022
[2]

doi: 10.18653/v1/2023.bigpicture-1.8

Association for Computational Linguistics. doi: 10.18653/v1/2023.bigpicture-1.8. Hewitt, J. and Manning, C. D. A Structural Probe for Finding Syntax in Word Representations. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language...

work page doi:10.18653/v1/2023.bigpicture-1.8 2023
[3]

Doran, & T

Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. Hupkes, D., Dankers, V ., Mul, M., and Bruni, E. Com- positionality Decomposed: How do Neural Networks Generalise? Journal of Artificial Intelligence Research, 67:757–795, April 2020. ISSN 1076-9757. doi: 10.1613/ jair.1.11674. Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, ...

work page doi:10.18653/v1/n19-1419 2020
[4]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.561. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Rai- son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imper...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.561 2024
[5]

doi: 10.18653/v1/2021.eacl-main.264

Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.264. Reddy, G. The mechanistic basis of data dependence and abrupt learning in an in-context classification task, De- cember 2023. Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., and Farajtabar, M. The Illusion of Thinking: Un- derstanding the Strengths and Limitat...

work page doi:10.18653/v1/2021.eacl-main.264 2021

[1] [1]

Software available from wandb.com

URL https://www.wandb.com/. Software available from wandb.com. Chan, S. C. Y ., Santoro, A., Lampinen, A. K., Wang, J. X., Singh, A., Richemond, P. H., McClelland, J., and Hill, F. Data Distributional Properties Drive Emergent In-Context Learning in Transformers, November 2022. Chen, A., Shwartz-Ziv, R., Cho, K., Leavitt, M. L., and Saphra, N. Sudden Drop...

work page 2022

[2] [2]

doi: 10.18653/v1/2023.bigpicture-1.8

Association for Computational Linguistics. doi: 10.18653/v1/2023.bigpicture-1.8. Hewitt, J. and Manning, C. D. A Structural Probe for Finding Syntax in Word Representations. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language...

work page doi:10.18653/v1/2023.bigpicture-1.8 2023

[3] [3]

Doran, & T

Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. Hupkes, D., Dankers, V ., Mul, M., and Bruni, E. Com- positionality Decomposed: How do Neural Networks Generalise? Journal of Artificial Intelligence Research, 67:757–795, April 2020. ISSN 1076-9757. doi: 10.1613/ jair.1.11674. Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, ...

work page doi:10.18653/v1/n19-1419 2020

[4] [4]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.561. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Rai- son, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imper...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.561 2024

[5] [5]

doi: 10.18653/v1/2021.eacl-main.264

Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.264. Reddy, G. The mechanistic basis of data dependence and abrupt learning in an in-context classification task, De- cember 2023. Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., and Farajtabar, M. The Illusion of Thinking: Un- derstanding the Strengths and Limitat...

work page doi:10.18653/v1/2021.eacl-main.264 2021