Learning Dynamics of Chain-of-Thought State Tracking in a Solvable Transformer Model

Bernd Rosenow; Marcel K\"uhn; Matthias Thamm; Niklas Forner

arxiv: 2606.18164 · v1 · pith:DLQCX4EDnew · submitted 2026-06-16 · ❄️ cond-mat.dis-nn · physics.data-an

Learning Dynamics of Chain-of-Thought State Tracking in a Solvable Transformer Model

Niklas Forner , Marcel K\"uhn , Matthias Thamm , Bernd Rosenow This is my paper

Pith reviewed 2026-06-26 21:47 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn physics.data-an

keywords chain-of-thoughttransformer dynamicsmean-field theoryattention retrievalpermutation compositionorder parametersstaged learningstate tracking

0 comments

The pith

Mean-field dynamics for three order parameters track how attention retrieval and MLP logic co-develop during chain-of-thought training on permutation states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives closed dynamical equations for attention retrieval accuracy, teacher-matrix alignment, and off-target logic overlap in a one-block transformer that learns to track states generated by composing permutations. These equations reproduce the simulated trajectories of the order parameters and, together with a logit-distribution approximation, account for the observed sharp transition to high rollout accuracy. The resulting picture shows staged learning in which the logic module first acquires a mixed heuristic before attention focuses on the relevant action and permits efficient alignment of the MLP weights.

Core claim

In the solvable architecture that cleanly separates fixed-lag action retrieval (via RoPE attention) from a specialized MLP that applies the retrieved permutation, statistical-physics mean-field theory yields deterministic dynamics for three order parameters. The equations match numerical simulations quantitatively for the order parameters themselves and qualitatively predict the abrupt rise in final accuracy once retrieval and alignment cross a threshold.

What carries the argument

The mean-field closure for the three order parameters (attention retrieval, teacher-matrix alignment, off-target logic overlap) obtained by exploiting the architectural separation between attention-based retrieval and MLP-based logic application.

If this is right

The three order parameters obey deterministic dynamics whose solutions reproduce the simulated time courses.
Logic-module alignment occurs in two stages: an early mixed-heuristic phase followed by a later phase enabled by sharpened attention retrieval.
A simple logit-distribution approximation derived from the order parameters locates the location of the sharp accuracy transition.
Quantitative agreement holds for the order parameters while the accuracy prediction remains qualitative.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged-learning sequence may appear in other chain-of-thought tasks whose architecture maintains a modular separation between retrieval and computation.
If the separation assumption is relaxed, the mean-field closure would require additional order parameters that track cross-module interference.
The sharp transition in rollout accuracy suggests the existence of a critical surface in hyperparameter space separating regimes of successful and unsuccessful multi-step tracking.

Load-bearing premise

The architecture cleanly separates fixed-lag action retrieval learned by attention from the MLP module that applies the retrieved permutation, allowing the mean-field equations to close.

What would settle it

A numerical simulation in which the measured trajectories of attention retrieval accuracy, teacher alignment, or off-target overlap deviate persistently from the derived mean-field ODEs would falsify the description.

Figures

Figures reproduced from arXiv: 2606.18164 by Bernd Rosenow, Marcel K\"uhn, Matthias Thamm, Niklas Forner.

**Figure 2.** Figure 2: Dynamics of the order parameters A, R, and S in panels a), b), and c), respectively. Simulations are averaged over 100 seeds, and σ denotes the standard deviation. The theoretical predictions agree very well with the simulations (note different scale in panel c)). where the correct logit is denoted by zi+1,k = p ⋆ i+1 · zi+1. By symmetry, the other logits behave identically on average so that zi+1,k descri… view at source ↗

**Figure 3.** Figure 3: Final rollout accuracy, averaged over 100 model seeds; σ denotes the standard deviation. The theoretical curve uses µk and µk from the order-parameter solutions of Eqs.(15) and constant variances σ 2 k = σ 2 k ≈ 0.044 from initialization. The predicted rise occurs slightly too early because the empirical variances increase during training; see Appendix I. Since the initial token is given, we have P(ˆs0 =… view at source ↗

**Figure 4.** Figure 4: (a) Mean correct and other logits. (b) Mean-field loss ( [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Entries of the attention block as a function of the learning time. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Variances for the mean correct and other logits during training. (b) Simulated rollout [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Order parameters from theory and simulations. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Next-token accuracy and final rollout accuracy for logit variances held fixed at their initial [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Next-token accuracy and final rollout accuracy with the empirical logit variances. [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Mean correct/other logits and loss function. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

read the original abstract

Chain-of-thought generation can turn a multi-step computation into a sequence of locally checkable state updates, but the training dynamics by which transformers acquire such updates remain poorly understood. We study this question in a solvable setting: a simplified one-block transformer trained by supervised next-token prediction on state sequences generated by composing permutations. The architecture separates fixed-lag action retrieval, learned by RoPE attention, from a specialized MLP logic module that applies the retrieved permutation to the current state. Using a statistical-physics mean-field description, we derive dynamics for three order parameters measuring attention retrieval, teacher-matrix alignment, and off-target logic overlap. These equations quantitatively match simulations for the order parameters and, combined with a logit-distribution approximation, qualitatively predict the sharp transition in final rollout accuracy. The analysis reveals staged learning: the logic module first learns a mixed heuristic; attention then locks onto the relevant action, enabling efficient MLP alignment. Together, these results provide a controlled mechanistic account of how attention-based retrieval and MLP-based logic co-develop during chain-of-thought state tracking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Derives closed mean-field dynamics for three order parameters in a toy CoT transformer that match simulations and support a staged-learning account.

read the letter

This paper derives closed mean-field dynamics for three order parameters in a deliberately solvable transformer for chain-of-thought state tracking. Those dynamics match the simulations and support a staged-learning account.

The architecture splits fixed-lag action retrieval via RoPE attention from an MLP that applies the permutation. From the model they get equations for attention retrieval, teacher-matrix alignment, and off-target overlap. The match to numerical trajectories is the main result, and combining it with a logit approximation gives a qualitative prediction for the accuracy transition.

What is new is the explicit co-development picture: the logic module learns a mixed heuristic early, attention then focuses on the relevant action, and that allows efficient alignment in the MLP. This is a controlled mechanistic story that goes beyond the usual high-level descriptions.

The main limitation is that the separation between attention and MLP is an explicit design choice to enable the mean-field closure. It is not an emergent feature, so the staged account may not generalize to standard transformers where those functions are mixed. The paper is upfront about working in a solvable setting, which keeps the claim proportionate.

The evidence is the direct comparison to simulations, which appears independent of the derivation. No sign of circularity.

This work is for people who study training dynamics of transformers with statistical-physics tools. A reader looking for quantitative models of how retrieval and logic modules learn together will find a clear example here. It deserves a serious referee because the quantitative match on the order parameters is a concrete, testable claim.

I would recommend sending it to peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript studies chain-of-thought state tracking in a simplified one-block transformer trained by next-token prediction on sequences generated by composing permutations. The architecture explicitly separates fixed-lag action retrieval (via RoPE attention) from an MLP logic module that applies the retrieved permutation. A statistical-physics mean-field theory is used to derive closed dynamics for three order parameters (attention retrieval, teacher-matrix alignment, off-target logic overlap). These equations are reported to match simulations quantitatively; combined with a logit-distribution approximation they qualitatively predict the sharp transition in final rollout accuracy. The analysis identifies a staged learning process in which the logic module first acquires a mixed heuristic before attention locks onto the relevant action.

Significance. If the reported quantitative agreement between the derived mean-field equations and independent simulations holds, the work supplies a rare controlled mechanistic account of how attention-based retrieval and MLP-based logic co-develop during training. The explicit architectural separation enables closure of the mean-field equations without hidden correlations, and the staged-learning prediction is falsifiable against the simulations. Credit is due for the direct numerical validation of the order-parameter trajectories and for the logit approximation that links the microscopic dynamics to the macroscopic accuracy transition.

minor comments (3)

§2 (model definition): the precise form of the RoPE attention kernel and the MLP weight initialization are not stated explicitly; adding these would allow readers to reproduce the mean-field closure without ambiguity.
Figure 3 caption: the shaded regions around the simulated order-parameter curves are described only as 'standard deviation'; clarifying whether they represent one or two standard errors and over how many independent runs would improve interpretability of the quantitative match.
Eq. (12) (logit-distribution approximation): the Gaussian assumption for the logit distribution is introduced without a supporting derivation or reference to prior work on similar approximations in attention models; a brief justification would strengthen the qualitative prediction of the accuracy transition.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and significance assessment of our manuscript on the learning dynamics of chain-of-thought state tracking. The recommendation for minor revision is noted. No specific major comments were provided in the report, so we have no points requiring point-by-point response or manuscript changes at this stage.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper presents a mean-field derivation of order-parameter dynamics directly from the explicit architectural separation (RoPE attention for retrieval, MLP for logic) and statistical-physics assumptions in a deliberately simplified solvable model. These equations are then compared to independent numerical simulations for quantitative match on the order parameters and qualitative prediction of the accuracy transition. No self-citations, fitted inputs renamed as predictions, or ansatzes smuggled via prior work appear in the abstract or description; the closure relies on the model's built-in design rather than reducing the target result to its own inputs by construction. This is the standard case of an internally consistent controlled analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the mean-field closure for the transformer dynamics and on the architectural separation between attention retrieval and MLP logic; no free parameters or new physical entities are introduced in the abstract.

axioms (2)

domain assumption Mean-field approximation closes the dynamics of the three order parameters without higher-order correlations
Invoked to obtain the differential equations that are then compared to simulations.
domain assumption The transformer architecture cleanly separates fixed-lag RoPE attention retrieval from the MLP logic module
Stated in the model description and used to define the order parameters.

pith-pipeline@v0.9.1-grok · 5724 in / 1386 out tokens · 24213 ms · 2026-06-26T21:47:30.865771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 11 linked inside Pith

[1]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[2]

Xlnet: Generalized autoregressive pretraining for language understanding.Advances in neural information processing systems, 32, 2019

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding.Advances in neural information processing systems, 32, 2019

2019
[3]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023
[4]

Bloom: A 176b-parameter open-access multilingual language model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. 2023

2023
[5]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Technical report, July 2019. arXiv:1907.11692 [cs] type: article

Pith/arXiv arXiv 2019
[6]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022
[7]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[8]

Transformers learn in-context by gradient descent

Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

2023
[9]

Transformers generalize differently from information stored in context vs in weights.arXiv preprint arXiv:2210.05675, 2022

Stephanie CY Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K Lampinen, and Felix Hill. Transformers generalize differently from information stored in context vs in weights.arXiv preprint arXiv:2210.05675, 2022

arXiv 2022
[10]

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

2023
[11]

Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022
[12]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
[13]

https://transformer-circuits.pub/2021/framework/index.html. 10

2021
[14]

Interpreting context look-ups in transformers: Investigating attention-mlp interactions

Clement Neo, Shay B Cohen, and Fazl Barez. Interpreting context look-ups in transformers: Investigating attention-mlp interactions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16681–16697, 2024

2024
[15]

Reddi, and Sanjiv Kumar

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020

2020
[16]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

2021
[17]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022

Pith/arXiv arXiv 2022
[18]

In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Pith/arXiv arXiv 2022
[19]

Mechanistic interpretability for ai safety–a review

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082, 2024

Pith/arXiv arXiv 2024
[20]

Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496, 2025

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496, 2025

Pith/arXiv arXiv 2025
[21]

A toy model of universality: Reverse engineering how networks learn group operations

Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. InInternational Conference on Machine Learning, pages 6243–6267. PMLR, 2023

2023
[22]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

2022
[23]

Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378, 2025

Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378, 2025

arXiv 2025
[24]

Sequential group composition: A window into the mechanics of deep learning.arXiv preprint arXiv:2602.03655, 2026

Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, and Nina Miolane. Sequential group composition: A window into the mechanics of deep learning.arXiv preprint arXiv:2602.03655, 2026

Pith/arXiv arXiv 2026
[25]

Augmenting self-attention with persistent memory.arXiv preprint arXiv:1907.01470, 2019

Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. Augmenting self-attention with persistent memory.arXiv preprint arXiv:1907.01470, 2019

Pith/arXiv arXiv 1907
[26]

Dynamical mean-field theory of self-attention neural networks.arXiv preprint arXiv:2406.07247, 2024

Ángel Poc-López and Miguel Aguilera. Dynamical mean-field theory of self-attention neural networks.arXiv preprint arXiv:2406.07247, 2024

arXiv 2024
[27]

Metastable states in asymmetrically diluted hopfield networks.Journal of Physics A: Mathematical and General, 21(14):3155–3169, 1988

Alessandro Treves and Daniel J Amit. Metastable states in asymmetrically diluted hopfield networks.Journal of Physics A: Mathematical and General, 21(14):3155–3169, 1988

1988
[28]

A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

arXiv 2022
[29]

Asymp- totic theory of in-context learning by linear attention.Proceedings of the National Academy of Sciences, 122(28):e2502599122, 2025

Yue M Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymp- totic theory of in-context learning by linear attention.Proceedings of the National Academy of Sciences, 122(28):e2502599122, 2025

2025
[30]

Training dynamics of transformers to recognize word co-occurrence via gradient flow analysis

Hongru Yang, Bhavya Kailkhura, Zhangyang Wang, and Yingbin Liang. Training dynamics of transformers to recognize word co-occurrence via gradient flow analysis. 2024. URL https://openreview.net/forum?id=w6q46IslSR

2024
[31]

From condensation to rank collapse: A two-stage analysis of trans- former training dynamics

Zheng-An Chen and Tao Luo. From condensation to rank collapse: A two-stage analysis of trans- former training dynamics. 2026. URL https://openreview.net/forum?id=gm5mkiTGOy. 11

2026
[32]

How transformers get rich: Approximation and dynamics analysis.arXiv preprint arXiv:2410.11474, 2025

Mingze Wang, Ruoxi Yu, Weinan E, and Lei Wu. How transformers get rich: Approximation and dynamics analysis.arXiv preprint arXiv:2410.11474, 2025

arXiv 2025
[33]

JoMA: Demystifying multilayer transformers via joint dynamics of MLP and attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Shaolei Du. JoMA: Demystifying multilayer transformers via joint dynamics of MLP and attention. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=LbJqRGNYCf

2024
[34]

Distributional associations vs in-context reasoning: A study of feed-forward and attention layers

Lei Chen, Joan Bruna, and Alberto Bietti. Distributional associations vs in-context reasoning: A study of feed-forward and attention layers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=WCVMqRHWW5

2025
[35]

Time course MechInterp: Analyzing the evolution of components and knowledge in large language models

Ahmad Dawar Hakimi, Ali Modarressi, Philipp Wicke, and Hinrich Schuetze. Time course MechInterp: Analyzing the evolution of components and knowledge in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025. Association for Computa-...

2025
[36]

Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Pith/arXiv arXiv 2021
[37]

Can language models learn from explanations in context? InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, 2022

Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. Can language models learn from explanations in context? InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, 2022

2022
[38]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/ 2022/file/63...

2022
[39]

Why think step by step? reasoning emerges from the locality of experience.Advances in Neural Information Processing Systems, 36: 70926–70947, 2023

Ben Prystawski, Michael Li, and Noah Goodman. Why think step by step? reasoning emerges from the locality of experience.Advances in Neural Information Processing Systems, 36: 70926–70947, 2023

2023
[40]

Transformers learn shortcuts to automata.arXiv preprint arXiv:2210.10749, 2022

Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata.arXiv preprint arXiv:2210.10749, 2022

arXiv 2022
[41]

Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

2023
[42]

How does chain of thought decompose complex tasks?arXiv preprint arXiv:2604.08872, 2026

Amrut Nadgir, Vijay Balasubramanian, and Pratik Chaudhari. How does chain of thought decompose complex tasks?arXiv preprint arXiv:2604.08872, 2026

Pith/arXiv arXiv 2026
[43]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InInternational Conference on Learning Representations, 2024

2024
[44]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[45]

Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2025

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2025

arXiv 2025
[46]

Springer- Verlag, Berlin / New York, 2 edition, 1993

Ernst Hairer, Syvert Paul Nørsett, and Gerhard Wanner.Solving Ordinary Differential Equations I: Nonstiff Problems, volume 8 ofSpringer Series in Computational Mathematics. Springer- Verlag, Berlin / New York, 2 edition, 1993. 12 A Hyperparameters Table 1: Hyperparameters for the training simulations. Variable Value Description N32Magnitude of permutation...

1993

[1] [1]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[2] [2]

Xlnet: Generalized autoregressive pretraining for language understanding.Advances in neural information processing systems, 32, 2019

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding.Advances in neural information processing systems, 32, 2019

2019

[3] [3]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

Pith/arXiv arXiv 2023

[4] [4]

Bloom: A 176b-parameter open-access multilingual language model

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. 2023

2023

[5] [5]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. Technical report, July 2019. arXiv:1907.11692 [cs] type: article

Pith/arXiv arXiv 2019

[6] [6]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

2022

[7] [7]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901

[8] [8]

Transformers learn in-context by gradient descent

Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

2023

[9] [9]

Transformers generalize differently from information stored in context vs in weights.arXiv preprint arXiv:2210.05675, 2022

Stephanie CY Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K Lampinen, and Felix Hill. Transformers generalize differently from information stored in context vs in weights.arXiv preprint arXiv:2210.05675, 2022

arXiv 2022

[10] [10]

Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression.Advances in neural information processing systems, 36:14228–14246, 2023

2023

[11] [11]

Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022

[12] [12]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

[13] [13]

https://transformer-circuits.pub/2021/framework/index.html. 10

2021

[14] [14]

Interpreting context look-ups in transformers: Investigating attention-mlp interactions

Clement Neo, Shay B Cohen, and Fazl Barez. Interpreting context look-ups in transformers: Investigating attention-mlp interactions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16681–16697, 2024

2024

[15] [15]

Reddi, and Sanjiv Kumar

Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations, 2020

2020

[16] [16]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

2021

[17] [17]

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022

Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022

Pith/arXiv arXiv 2022

[18] [18]

In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895, 2022

Pith/arXiv arXiv 2022

[19] [19]

Mechanistic interpretability for ai safety–a review

Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082, 2024

Pith/arXiv arXiv 2024

[20] [20]

Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496, 2025

Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability.arXiv preprint arXiv:2501.16496, 2025

Pith/arXiv arXiv 2025

[21] [21]

A toy model of universality: Reverse engineering how networks learn group operations

Bilal Chughtai, Lawrence Chan, and Neel Nanda. A toy model of universality: Reverse engineering how networks learn group operations. InInternational Conference on Machine Learning, pages 6243–6267. PMLR, 2023

2023

[22] [22]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

2022

[23] [23]

Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378, 2025

Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization.arXiv preprint arXiv:2511.07378, 2025

arXiv 2025

[24] [24]

Sequential group composition: A window into the mechanics of deep learning.arXiv preprint arXiv:2602.03655, 2026

Giovanni Luca Marchetti, Daniel Kunin, Adele Myers, Francisco Acosta, and Nina Miolane. Sequential group composition: A window into the mechanics of deep learning.arXiv preprint arXiv:2602.03655, 2026

Pith/arXiv arXiv 2026

[25] [25]

Augmenting self-attention with persistent memory.arXiv preprint arXiv:1907.01470, 2019

Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. Augmenting self-attention with persistent memory.arXiv preprint arXiv:1907.01470, 2019

Pith/arXiv arXiv 1907

[26] [26]

Dynamical mean-field theory of self-attention neural networks.arXiv preprint arXiv:2406.07247, 2024

Ángel Poc-López and Miguel Aguilera. Dynamical mean-field theory of self-attention neural networks.arXiv preprint arXiv:2406.07247, 2024

arXiv 2024

[27] [27]

Metastable states in asymmetrically diluted hopfield networks.Journal of Physics A: Mathematical and General, 21(14):3155–3169, 1988

Alessandro Treves and Daniel J Amit. Metastable states in asymmetrically diluted hopfield networks.Journal of Physics A: Mathematical and General, 21(14):3155–3169, 1988

1988

[28] [28]

A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

Alexander Maloney, Daniel A Roberts, and James Sully. A solvable model of neural scaling laws.arXiv preprint arXiv:2210.16859, 2022

arXiv 2022

[29] [29]

Asymp- totic theory of in-context learning by linear attention.Proceedings of the National Academy of Sciences, 122(28):e2502599122, 2025

Yue M Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, and Cengiz Pehlevan. Asymp- totic theory of in-context learning by linear attention.Proceedings of the National Academy of Sciences, 122(28):e2502599122, 2025

2025

[30] [30]

Training dynamics of transformers to recognize word co-occurrence via gradient flow analysis

Hongru Yang, Bhavya Kailkhura, Zhangyang Wang, and Yingbin Liang. Training dynamics of transformers to recognize word co-occurrence via gradient flow analysis. 2024. URL https://openreview.net/forum?id=w6q46IslSR

2024

[31] [31]

From condensation to rank collapse: A two-stage analysis of trans- former training dynamics

Zheng-An Chen and Tao Luo. From condensation to rank collapse: A two-stage analysis of trans- former training dynamics. 2026. URL https://openreview.net/forum?id=gm5mkiTGOy. 11

2026

[32] [32]

How transformers get rich: Approximation and dynamics analysis.arXiv preprint arXiv:2410.11474, 2025

Mingze Wang, Ruoxi Yu, Weinan E, and Lei Wu. How transformers get rich: Approximation and dynamics analysis.arXiv preprint arXiv:2410.11474, 2025

arXiv 2025

[33] [33]

JoMA: Demystifying multilayer transformers via joint dynamics of MLP and attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, and Simon Shaolei Du. JoMA: Demystifying multilayer transformers via joint dynamics of MLP and attention. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=LbJqRGNYCf

2024

[34] [34]

Distributional associations vs in-context reasoning: A study of feed-forward and attention layers

Lei Chen, Joan Bruna, and Alberto Bietti. Distributional associations vs in-context reasoning: A study of feed-forward and attention layers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=WCVMqRHWW5

2025

[35] [35]

Time course MechInterp: Analyzing the evolution of components and knowledge in large language models

Ahmad Dawar Hakimi, Ali Modarressi, Philipp Wicke, and Hinrich Schuetze. Time course MechInterp: Analyzing the evolution of components and knowledge in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025. Association for Computa-...

2025

[36] [36]

Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models.arXiv preprint arXiv:2112.00114, 2021

Pith/arXiv arXiv 2021

[37] [37]

Can language models learn from explanations in context? InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, 2022

Andrew Lampinen, Ishita Dasgupta, Stephanie Chan, Kory Mathewson, Mh Tessler, Antonia Creswell, James McClelland, Jane Wang, and Felix Hill. Can language models learn from explanations in context? InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 537–563, 2022

2022

[38] [38]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 15476–15488. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/ 2022/file/63...

2022

[39] [39]

Why think step by step? reasoning emerges from the locality of experience.Advances in Neural Information Processing Systems, 36: 70926–70947, 2023

Ben Prystawski, Michael Li, and Noah Goodman. Why think step by step? reasoning emerges from the locality of experience.Advances in Neural Information Processing Systems, 36: 70926–70947, 2023

2023

[40] [40]

Transformers learn shortcuts to automata.arXiv preprint arXiv:2210.10749, 2022

Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata.arXiv preprint arXiv:2210.10749, 2022

arXiv 2022

[41] [41]

Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective.Advances in Neural Information Processing Systems, 36:70757–70798, 2023

2023

[42] [42]

How does chain of thought decompose complex tasks?arXiv preprint arXiv:2604.08872, 2026

Amrut Nadgir, Vijay Balasubramanian, and Pratik Chaudhari. How does chain of thought decompose complex tasks?arXiv preprint arXiv:2604.08872, 2026

Pith/arXiv arXiv 2026

[43] [43]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InInternational Conference on Learning Representations, 2024

2024

[44] [44]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[45] [45]

Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2025

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hierarchical language structures.arXiv preprint arXiv:2305.13673, 2025

arXiv 2025

[46] [46]

Springer- Verlag, Berlin / New York, 2 edition, 1993

Ernst Hairer, Syvert Paul Nørsett, and Gerhard Wanner.Solving Ordinary Differential Equations I: Nonstiff Problems, volume 8 ofSpringer Series in Computational Mathematics. Springer- Verlag, Berlin / New York, 2 edition, 1993. 12 A Hyperparameters Table 1: Hyperparameters for the training simulations. Variable Value Description N32Magnitude of permutation...

1993