Assign and Add: A Mechanistic Study of Compositional Arithmetic

Alberto Bietti; Brady Exoo; John Sous

arxiv: 2605.31497 · v1 · pith:3NFLYB27new · submitted 2026-05-29 · 💻 cs.LG · stat.ML

Assign and Add: A Mechanistic Study of Compositional Arithmetic

Brady Exoo , Alberto Bietti , John Sous This is my paper

Pith reviewed 2026-06-28 22:55 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords compositional generalizationtransformersmechanistic interpretabilityvariable assignmentmodular additiontraining dynamicsarithmetic composition

0 comments

The pith

Transformers reuse the same modular addition module for both direct numbers and those reached through variable assignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how small transformers achieve compositional generalization on a task that requires first assigning numbers to variables and then performing modular addition on those values. Training data is split so that certain variable-number pairings never appear together during learning, yet the models still succeed on unseen combinations. Mechanistic inspection finds that the identical addition MLP is invoked whether the operands arrive directly or have first passed through the assignment pathway. Training unfolds in three phases: the addition operation is acquired first, the assignment routing structure appears next, and a final refinement stage extends generalization to harder sequences. A theoretical account links this reuse of internal mechanisms to the emergence of compositionality as a direct consequence of how the circuits are assembled during optimization.

Core claim

The central claim is that the same modular addition MLP module is invoked whether the inputs are supplied directly as numbers or are obtained indirectly after a separate variable assignment step has occurred. This shared circuit is what permits the model to generalize to novel pairings of variables and numbers that were withheld from the training distribution.

What carries the argument

The modular addition MLP module, which computes the arithmetic result and is shared between the direct-input path and the variable-assignment path.

If this is right

Compositional generalization follows when internal circuits are reused rather than duplicated for each new combination of skills.
Training proceeds through separable phases that first install the arithmetic operation, then the routing for assignment, and finally refine the integrated behavior.
Generalization to sequences withheld from training emerges only after the refinement phase has aligned the shared module with the assignment pathway.
Compositionality is a natural outcome of the compositionality already present inside the model's learned mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reuse pattern could be searched for in other tasks that combine lookup or binding with subsequent computation, such as function application or simple logical inference.
If the three-phase dynamic is robust, curricula that deliberately separate skill acquisition from integration might accelerate compositional learning in larger models.
The theoretical framework could be tested by ablating the refinement phase and checking whether generalization to hard sequences collapses while basic addition remains intact.
Scaling the setting to deeper networks might reveal whether additional circuits are recruited or whether the same modular addition module continues to be reused.

Load-bearing premise

The controlled toy task of variable assignment followed by modular addition in small transformers captures the mechanisms that produce compositional generalization in large models trained on natural data.

What would settle it

Finding two functionally distinct MLP modules, one used only for direct addition and another used only after variable lookup, in a replication of the same architecture and data split would falsify the reuse claim.

Figures

Figures reproduced from arXiv: 2605.31497 by Alberto Bietti, Brady Exoo, John Sous.

**Figure 2.** Figure 2: Accuracies during training partitioned by evaluation set. Add-restricted sets contain sequences with held-out addition pairs as discussed in Section 3, and var-restricted sets contain sequences with held-out variable positions. The (1) and (2) for the 2-var var-restricted sets denote how many of the variables are in “bad” positions. Train sets are those with both valid addition pairs and variable positions… view at source ↗

**Figure 3.** Figure 3: Data requirements for generalization. (a) Test accuracy on 2-variable sequences as a function of their relative frequency in the training set. The model requires a relative frequency of r ≈ 0.2 to successfully generalize. (b) Test accuracy on 0-variable sequences as a function of the fraction of all possible addition pairs seen during training. Generalization to unseen constant pairs requires training on a… view at source ↗

**Figure 4.** Figure 4: Attention patterns for an example sequence. Left (Layer 1): The = token attends to the two immediately preceding positions representing the operands (orange boxes). Simultaneously, constant tokens in positions 1–11 act as previous-token heads, attending to their assigned variables (red boxes). Right (Layer 2): The = token attends directly to the constants required for the addition (yellow boxes). These beh… view at source ↗

**Figure 5.** Figure 5: Residual stream similarities. (a) Cosine similarities between the pre-MLP residual stream vectors of different sequences. ”Matched” pairs contain the same underlying addition operation (e.g., b 3 + b 4 = versus + 3 4 =), whereas ”Mismatched” pairs do not. The high similarity suggests that a shared representation handles both variable and constant formats. (b) Cosine similarity between the pre-MLP residual … view at source ↗

**Figure 6.** Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: a analyzes this positional behavior in greater detail. Given the established preference for variables, we isolate the valid positions where a variable token can logically appear relative to a constant at position i. Specifically, we exclude positions greater than i (future tokens) due to causal masking and position i − 2, since the structural syntax dictates that a token two steps behind a constant must be… view at source ↗

**Figure 8.** Figure 8: Early emergence of variable assignment. (a,b) Layer-1 attention patterns on a fixed 1-variable sequence immediately before and after the first accuracy spike. Red boxes mark the two operand positions that are queried by the = token in the final model. (c,d) The corresponding layer-2 QK scores on variables, (OV1(evar))QK2(OV1(evar))⊤. Across this transition, the variable-identity block becomes strongly diag… view at source ↗

**Figure 9.** Figure 9: Late correction of var-restricted routing. (a) Variable-token contributions to the layer-1 attention from the = token, shown for selected variables. The b contribution begins substantially below the others and rises by the end of the window. (b) Accuracy on two-variable examples with b as the right operand, together with the fully var-restricted two-variable accuracy. The model completely fails to handle s… view at source ↗

**Figure 10.** Figure 10: shows the accuracies by evaluation set for another training run with the same parameters. The same “spikes” in non-var-restricted and var-restricted accuracies seen in [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: displays the preactivations for neuron 70 (left) and the preactivations for the theoretical construction of an MLP trained to perform modular addition in Gromov [12]. The figure shows a near-perfect match, suggesting our model indeed implements the same MLP circuit that has been previously discovered in the literature. 0 20 40 m 0 10 20 30 40 50 n Neuron 70 0 20 40 m 0 10 20 30 40 50 n Theoretical k=11 1.… view at source ↗

**Figure 12.** Figure 12: (a) End of the first phase of training, characterized by generalization on 0-variable addition and the emergence of structure for variable assignment. (b) End of the second phase of training, characterized by the rapid development of the variable assignment circuit (c) End of the last phase of training, characterized by the model “cleaning up” the variable assignment module and generalizing on all evaluat… view at source ↗

read the original abstract

Large language models are able to compose skills in order to perform complex tasks, many of which might not have been seen during training. The details of how exactly this composition occurs remain elusive. In this paper, we study a mechanism for compositional generalization in transformers by considering a simple controlled setting involving variable assignment and modular addition. By partitioning our training data into disjoint sets, we observe that small transformers are able to generalize to previously unseen combinations of variables and numbers. Our mechanistic analysis shows that the same ``modular addition'' MLP module is used whether the inputs are given directly or indirectly through a separate variable assignment mechanism. We also analyze the training dynamics from an empirical lens, which reveals three phases of learning: first, modular addition is learned, then the structure required for variable assignment, and finally a refinement phase where the model generalizes to some hard sequences not seen in training. Finally, we provide a theoretical framework to explain how compositionality emerges from training dynamics. These results suggest that compositional generalization can be a natural consequence of the compositionality of internal mechanisms in~transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that a small transformer reuses the same modular-addition MLP for both direct inputs and variable-assigned ones in a partitioned arithmetic task, with three observed training phases.

read the letter

The core observation is that the model reuses one MLP module for modular addition whether the operands arrive directly or via a learned variable-assignment path. This is shown through mechanistic inspection after training on data splits that separate variable-number combinations.

The work extends prior circuit studies on addition by adding the compositional layer of variable assignment and by documenting a three-phase trajectory: addition first, then assignment structure, then a refinement stage that improves generalization on held-out sequences. The data partitioning is a straightforward way to create the need for composition, and the reuse finding is a direct empirical result in this controlled setting.

The main limitation is the narrow scope. Everything is done with small transformers on synthetic modular arithmetic, so it remains open whether the same reuse pattern appears in larger models or on natural data. The theoretical framework is described at a high level in the abstract and would need more detail to evaluate. No load-bearing circularity or fitting issues are apparent from the description.

This is useful for researchers already working on mechanistic interpretability of compositionality. A reader who wants concrete examples of how transformers build reusable internal modules will find the observations worth examining. It is not positioned as a broad claim about large language models.

The paper deserves peer review. The central claim is internally consistent and addresses a live question in the subfield with targeted experiments.

Referee Report

1 major / 2 minor

Summary. The paper examines compositional generalization in small transformers trained on a controlled task of variable assignment followed by modular addition. By partitioning the training data into disjoint sets, the models generalize to unseen combinations of variables and numbers. Mechanistic analysis indicates that the same modular addition MLP module is reused for both direct numeric inputs and inputs routed through a learned variable assignment circuit. Training dynamics reveal three phases—learning modular addition, acquiring variable assignment structure, and a refinement phase enabling generalization to hard sequences—with a theoretical framework proposed to explain how compositionality emerges from these dynamics.

Significance. If the central claims hold, the work offers a concrete mechanistic example of module reuse enabling compositional generalization in transformers within a simplified arithmetic setting. The empirical identification of three training phases and the accompanying theoretical framework provide useful insights into how compositionality can arise naturally from training dynamics. The controlled experimental design and focus on mechanistic inspection are strengths that allow clear observation of the reuse phenomenon.

major comments (1)

[Mechanistic Analysis] The claim that the identical modular addition MLP is reused for direct and assigned inputs (abstract and mechanistic analysis section) requires explicit quantification of the evidence, such as activation similarity metrics, weight cosine similarities, or results from causal interventions like activation patching; without these details the reuse conclusion rests on qualitative inspection alone.

minor comments (2)

Provide the precise definition of the data partitioning scheme and the criteria used to identify 'hard sequences' in the refinement phase.
The theoretical framework would be strengthened by including explicit equations or a pseudocode description of the proposed dynamics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation of minor revision. We address the major comment point by point below.

read point-by-point responses

Referee: [Mechanistic Analysis] The claim that the identical modular addition MLP is reused for direct and assigned inputs (abstract and mechanistic analysis section) requires explicit quantification of the evidence, such as activation similarity metrics, weight cosine similarities, or results from causal interventions like activation patching; without these details the reuse conclusion rests on qualitative inspection alone.

Authors: We agree that the current mechanistic analysis relies primarily on qualitative inspection of circuit components and would benefit from quantitative support. In the revised manuscript we will add cosine similarity between the relevant MLP weight matrices, Pearson correlation of activations on matched inputs, and activation patching results showing that ablating the modular addition MLP produces comparable performance drops on both direct and variable-assigned test cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is an empirical mechanistic interpretability study on a controlled toy task involving variable assignment and modular addition in small transformers. Claims rest on observations from partitioned training data, direct inspection of MLP modules, and training dynamics across phases, with no mathematical derivations, first-principles predictions, or equations that reduce to fitted inputs by construction. The theoretical framework is presented as an explanation of observed dynamics rather than a self-referential definition or load-bearing self-citation chain. No steps match the enumerated circularity patterns, making the derivation chain self-contained within the experimental setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only. No free parameters or invented entities are mentioned. The central assumption is that findings from this narrow task transfer to broader compositional behavior in transformers.

axioms (1)

domain assumption The simple controlled task of variable assignment plus modular addition reflects the essential mechanisms of compositional generalization in transformers.
The paper uses this assumption to draw conclusions about general compositional abilities from the toy setting.

pith-pipeline@v0.9.1-grok · 5716 in / 1051 out tokens · 25014 ms · 2026-06-28T22:55:29.723370+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 2 canonical work pages

[1]

Lake and Marco Baroni

Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. InInternational Conference on Machine Learning (ICML), 2018

2018
[2]

COGS: A compositional generalization challenge based on semantic interpretation

Najoung Kim and Tal Linzen. COGS: A compositional generalization challenge based on semantic interpretation. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[3]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics (EMNLP), 2023

2023
[4]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za¨ ıd Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za¨ ıd Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. InAdvances in Neural Information Processing Systems (N...

2023
[5]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

2021
[6]

In-context learning and induction heads.Transformer Circuits Thread, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield- Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, ...

2022
[7]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herv´ e J´ egou, and L´ eon Bottou. Birth of a transformer: A memory viewpoint. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[8]

The mechanistic basis of data dependence and abrupt learning in an in-context classification task

Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. InInternational Conference on Learning Representations (ICLR), 2024. 10

2024
[9]

Eshaan Nichani, Alex Damian, and Jason D. Lee. How transformers learn causal structure with gradient descent. InInternational Conference on Machine Learning (ICML), 2024

2024
[10]

Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

2022
[11]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations (ICLR), 2023

2023
[12]

Grokking modular arithmetic, 2023

Andrey Gromov. Grokking modular arithmetic, 2023

2023
[13]

On the mechanism and dynamics of modular addition: Fourier features, lottery ticket, and grokking, 2026

Jianliang He, Leda Wang, Siyu Chen, and Zhuoran Yang. On the mechanism and dynamics of modular addition: Fourier features, lottery ticket, and grokking, 2026

2026
[14]

Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small. InInternational Conference on Learning Representations (ICLR), 2023

2023
[15]

How does GPT-2 compute greater-than?: Inter- preting mathematical abilities in a pre-trained language model

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT-2 compute greater-than?: Inter- preting mathematical abilities in a pre-trained language model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[16]

Circuit tracing: Revealing computational graphs in language models.Transformer Circuits Thread, 6:16318–16352, 2025

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, et al. Circuit tracing: Revealing computational graphs in language models.Transformer Circuits Thread, 6:16318–16352, 2025

2025
[17]

Unveiling transformers with lego: A synthetic reasoning task.arXiv preprint arXiv:2206.04301,

Yi Zhang, Arturs Backurs, S´ ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner. Unveiling transformers with lego: a synthetic reasoning task.arXiv preprint arXiv:2206.04301, 2022

work page arXiv 2022
[18]

Transformers learn shortcuts to automata

Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. InInternational Conference on Learning Representations (ICLR), 2023

2023
[19]

Discovering variable binding circuitry with desiderata

Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, and David Bau. Discovering variable binding circuitry with desiderata. ICML 2023 Workshop on Deployable Generative AI, 2023

2023
[20]

Dick, and Hidenori Tanaka

Rahul Ramesh, Ekdeep Singh Lubana, Mikail Khona, Robert P. Dick, and Hidenori Tanaka. Compositional capabilities of autoregressive transformers: A study on synthetic, interpretable tasks. InInternational Conference on Machine Learning (ICML), 2024

2024
[21]

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Tianyu He, Darshil Doshi, Aritra Das, and Andrey Gromov. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[22]

arXiv preprint arXiv:2505.20896 , year=

Yiwei Wu, Atticus Geiger, and Rapha¨ el Milliere. How do transformers learn variable binding in symbolic programs?arXiv preprint arXiv:2505.20896, 2025

work page arXiv 2025
[23]

Shattered compositionality: Counter- intuitive learning dynamics of transformers for arithmetic, 2026

Xingyu Zhao, Darsh Sharma, Rheeya Uppaal, and Yiqiao Zhong. Shattered compositionality: Counter- intuitive learning dynamics of transformers for arithmetic, 2026

2026
[24]

Transformers learn in-context by gradient descent

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo˜ ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, (ICML), 2023

2023
[25]

Iteration head: A mechanistic study of chain-of-thought

Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Yang, Fran¸ cois Charton, and Julia Kempe. Iteration head: A mechanistic study of chain-of-thought. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[26]

Composing global solutions to reasoning tasks via algebraic objects in neural nets, 2025

Yuandong Tian. Composing global solutions to reasoning tasks via algebraic objects in neural nets, 2025. 11

2025
[27]

Alternating gradient flows: A theory of feature learning in two-layer neural networks

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B Simon, Michael R DeWeese, Surya Ganguli, and Nina Miolane. Alternating gradient flows: A theory of feature learning in two-layer neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2026

2026
[28]

Lee, and Denny Wu

Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D. Lee, and Denny Wu. Learning compositional functions with transformers from easy-to-hard data. InAnnual Conference on Learning Theory (COLT), 2025

2025
[29]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[30]

Analyzing transformers in embedding space

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. InFindings of the Association for Computational Linguistics (ACL), 2023

2023
[31]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

2019
[32]

Transformerlens

Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/TransformerLensOrg/ TransformerLens, 2022

2022
[33]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

2019
[34]

Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M

Depen Morwani, Benjamin L. Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M. Kakade. Feature emergence via margin maximization: case studies in algebraic tasks. InInternational Conference on Learning Representations (ICLR), 2024

2024
[35]

cleaning up

Dan Friedman, Alexander Wettig, and Danqi Chen. Learning transformer programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. A Additional Experimental Results Figure 10 shows the accuracies by evaluation set for another training run with the same parameters. The same “spikes” in non-var-restricted and var-restricted accuracies seen ...

2023

[1] [1]

Lake and Marco Baroni

Brenden M. Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. InInternational Conference on Machine Learning (ICML), 2018

2018

[2] [2]

COGS: A compositional generalization challenge based on semantic interpretation

Najoung Kim and Tal Linzen. COGS: A compositional generalization challenge based on semantic interpretation. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

2020

[3] [3]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics (EMNLP), 2023

2023

[4] [4]

Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za¨ ıd Harchaoui, and Yejin Choi

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Za¨ ıd Harchaoui, and Yejin Choi. Faith and fate: Limits of transformers on compositionality. InAdvances in Neural Information Processing Systems (N...

2023

[5] [5]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

2021

[6] [6]

In-context learning and induction heads.Transformer Circuits Thread, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield- Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, ...

2022

[7] [7]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herv´ e J´ egou, and L´ eon Bottou. Birth of a transformer: A memory viewpoint. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[8] [8]

The mechanistic basis of data dependence and abrupt learning in an in-context classification task

Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. InInternational Conference on Learning Representations (ICLR), 2024. 10

2024

[9] [9]

Eshaan Nichani, Alex Damian, and Jason D. Lee. How transformers learn causal structure with gradient descent. InInternational Conference on Machine Learning (ICML), 2024

2024

[10] [10]

Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

2022

[11] [11]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations (ICLR), 2023

2023

[12] [12]

Grokking modular arithmetic, 2023

Andrey Gromov. Grokking modular arithmetic, 2023

2023

[13] [13]

On the mechanism and dynamics of modular addition: Fourier features, lottery ticket, and grokking, 2026

Jianliang He, Leda Wang, Siyu Chen, and Zhuoran Yang. On the mechanism and dynamics of modular addition: Fourier features, lottery ticket, and grokking, 2026

2026

[14] [14]

Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in GPT-2 small. InInternational Conference on Learning Representations (ICLR), 2023

2023

[15] [15]

How does GPT-2 compute greater-than?: Inter- preting mathematical abilities in a pre-trained language model

Michael Hanna, Ollie Liu, and Alexandre Variengien. How does GPT-2 compute greater-than?: Inter- preting mathematical abilities in a pre-trained language model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[16] [16]

Circuit tracing: Revealing computational graphs in language models.Transformer Circuits Thread, 6:16318–16352, 2025

Emmanuel Ameisen, Jack Lindsey, Adam Pearce, Wes Gurnee, Nicholas L Turner, Brian Chen, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, et al. Circuit tracing: Revealing computational graphs in language models.Transformer Circuits Thread, 6:16318–16352, 2025

2025

[17] [17]

Unveiling transformers with lego: A synthetic reasoning task.arXiv preprint arXiv:2206.04301,

Yi Zhang, Arturs Backurs, S´ ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, and Tal Wagner. Unveiling transformers with lego: a synthetic reasoning task.arXiv preprint arXiv:2206.04301, 2022

work page arXiv 2022

[18] [18]

Transformers learn shortcuts to automata

Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transformers learn shortcuts to automata. InInternational Conference on Learning Representations (ICLR), 2023

2023

[19] [19]

Discovering variable binding circuitry with desiderata

Xander Davies, Max Nadeau, Nikhil Prakash, Tamar Rott Shaham, and David Bau. Discovering variable binding circuitry with desiderata. ICML 2023 Workshop on Deployable Generative AI, 2023

2023

[20] [20]

Dick, and Hidenori Tanaka

Rahul Ramesh, Ekdeep Singh Lubana, Mikail Khona, Robert P. Dick, and Hidenori Tanaka. Compositional capabilities of autoregressive transformers: A study on synthetic, interpretable tasks. InInternational Conference on Machine Learning (ICML), 2024

2024

[21] [21]

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Tianyu He, Darshil Doshi, Aritra Das, and Andrey Gromov. Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[22] [22]

arXiv preprint arXiv:2505.20896 , year=

Yiwei Wu, Atticus Geiger, and Rapha¨ el Milliere. How do transformers learn variable binding in symbolic programs?arXiv preprint arXiv:2505.20896, 2025

work page arXiv 2025

[23] [23]

Shattered compositionality: Counter- intuitive learning dynamics of transformers for arithmetic, 2026

Xingyu Zhao, Darsh Sharma, Rheeya Uppaal, and Yiqiao Zhong. Shattered compositionality: Counter- intuitive learning dynamics of transformers for arithmetic, 2026

2026

[24] [24]

Transformers learn in-context by gradient descent

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo˜ ao Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, (ICML), 2023

2023

[25] [25]

Iteration head: A mechanistic study of chain-of-thought

Vivien Cabannes, Charles Arnal, Wassim Bouaziz, Xingyu Yang, Fran¸ cois Charton, and Julia Kempe. Iteration head: A mechanistic study of chain-of-thought. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[26] [26]

Composing global solutions to reasoning tasks via algebraic objects in neural nets, 2025

Yuandong Tian. Composing global solutions to reasoning tasks via algebraic objects in neural nets, 2025. 11

2025

[27] [27]

Alternating gradient flows: A theory of feature learning in two-layer neural networks

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B Simon, Michael R DeWeese, Surya Ganguli, and Nina Miolane. Alternating gradient flows: A theory of feature learning in two-layer neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2026

2026

[28] [28]

Lee, and Denny Wu

Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D. Lee, and Denny Wu. Learning compositional functions with transformers from easy-to-hard data. InAnnual Conference on Learning Theory (COLT), 2025

2025

[29] [29]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017

[30] [30]

Analyzing transformers in embedding space

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. InFindings of the Association for Computational Linguistics (ACL), 2023

2023

[31] [31]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019

2019

[32] [32]

Transformerlens

Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/TransformerLensOrg/ TransformerLens, 2022

2022

[33] [33]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

2019

[34] [34]

Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M

Depen Morwani, Benjamin L. Edelman, Costin-Andrei Oncescu, Rosie Zhao, and Sham M. Kakade. Feature emergence via margin maximization: case studies in algebraic tasks. InInternational Conference on Learning Representations (ICLR), 2024

2024

[35] [35]

cleaning up

Dan Friedman, Alexander Wettig, and Danqi Chen. Learning transformer programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. A Additional Experimental Results Figure 10 shows the accuracies by evaluation set for another training run with the same parameters. The same “spikes” in non-var-restricted and var-restricted accuracies seen ...

2023