arxiv: 2601.19208 · v2 · submitted 2026-01-27 · 💻 cs.CL · cs.LG

Recognition: no theorem link

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Shawn Im , Changdae Oh , Zhen Fang , Sharon Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:17 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords transformerstraining dynamicsgradient approximationmechanistic interpretabilitysemantic associationsattention mechanismslanguage modelingclosed-form expressions

0 comments

The pith

Transformer weights emerge in closed form as compositions of three basis functions from corpus statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

By approximating gradients with their leading terms in the early phase of training, the paper derives closed-form expressions for all weights in a transformer. These expressions are simple compositions of three basis functions: bigram statistics, token interchangeability, and context mappings. This shows how semantic associations between tokens first form directly from the statistics of the training text. A sympathetic reader would care because it offers a concrete, interpretable link between data properties and model behavior instead of treating training as a black box.

Core claim

Using a leading-term approximation of the gradients, each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions.

What carries the argument

Leading-term gradient approximation that produces closed-form weight expressions composed of bigram, token-interchangeability, and context mapping basis functions.

If this is right

The theoretical characterizations closely match the learned weights in real-world LLMs.
Each component of the transformer captures semantic associations through these specific compositions.
Semantic associations take shape early in training based on corpus statistics rather than complex later dynamics.
Qualitative analyses reveal how the theorem aids in interpreting learned associations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that interventions on early training could directly shape the basis functions and thus the associations.
Extending the approximation beyond the earliest phase might reveal how later updates modify these initial forms.
The same approach could apply to other model components or architectures to derive similar closed forms.

Load-bearing premise

The leading-term approximation of the gradients remains accurate enough during the earliest training phase to fix the functional form of the weights, and semantic associations are mainly determined by these early expressions.

What would settle it

Training a small transformer on a controlled corpus and checking whether the actual early weights match the predicted closed-form compositions within small error would confirm or refute the approximation.

Figures

Figures reproduced from arXiv: 2601.19208 by Changdae Oh, Sharon Li, Shawn Im, Zhen Fang.

**Figure 2.** Figure 2: Illustration of theoretical results. We characterize weight matrices of the attention-only transformer as compositions of three basis functions: bigram mapping, interchangeability mapping, and context mappings. We illustrate how these mappings are composed across weight matrices to learn semantic associations between a given query token and its surrounding text. Theorem 4.1. (Informal) Given an attention-b… view at source ↗

**Figure 3.** Figure 3: An example of Φ¯ with arrows pointing to prefix tokens for “fish” with context summary scores on edges. Larger values indicate the token appears more frequently in the context of “fish”. where µj centers the columns of Φ¯ to be 0. Considering each row as an embedding for a token ei , which represents an average of the tokens that appear in its context, i.e., smoothed context. More precisely, the strengt… view at source ↗

**Figure 4.** Figure 4: Cosine similarity between theoretical and learned weights. Results from a 3-layer transformer model trained on TinyStories. Verification of theory. To verify Theorem 4.1, we measure the cosine similarity between the learned weights and their corresponding leading terms at checkpoints over the first 100 epochs of SGD using a batch size of 2048 for computational tractability with a learning rate of 0.005. W… view at source ↗

**Figure 5.** Figure 5: Selected tokens from the top 30 correlated tokens under different basis features from [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Cosine similarity between covariance matrices for Pythia-1.4B attention weights and em [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Cosine similarity between covariance matrices for Pythia-1.4B individual attention head [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Cosine similarity between covariance matrices for Pythia-1.4B attention weights and em [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus and uncovering how each component of the transformer captures semantic associations based on these compositions. Experiments on real-world LLMs demonstrate that our theoretical weight characterizations closely match the learned weights, and qualitative analyses further show how our theorem shines light on interpreting the learned associations in transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives closed-form early transformer weights as compositions of bigram, interchangeability, and context basis functions from leading gradient terms, with claimed numerical matches on real models.

read the letter

The main point is that they truncate the gradient of the training loss to its leading term and obtain explicit formulas for the weights right at the start of training. Those formulas express each weight matrix as a simple combination of three functions pulled from the corpus counts: bigram frequencies, a token-interchangeability measure, and context mappings. The result is a direct account of how associations such as “bird” and “flew” first appear from raw text statistics rather than from later optimization tricks.

Referee Report

2 major / 1 minor

Summary. The paper claims that a leading-term approximation to the gradients of the training objective on natural language data yields closed-form expressions for the weights of attention-based transformers at early training stages. These weights are expressed as simple compositions of three basis functions—bigram counts, token-interchangeability mappings, and context mappings—directly reflecting corpus statistics. The authors assert that this explains the emergence of semantic associations and that experiments on real LLMs show close numerical agreement with the derived forms.

Significance. If the leading-term approximation holds with sufficient accuracy, the work supplies an explicit, largely parameter-free bridge between corpus statistics and the functional form of learned weights, offering a concrete mechanistic account of how transformers acquire associations beyond memorization. This would be a notable contribution to interpretability, as it derives specific basis-function compositions rather than post-hoc fits.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the claim of 'close numerical match' between theoretical characterizations and real-LLM weights is presented without reported error bounds on the truncation, ablation of higher-order gradient terms, or explicit data-exclusion criteria. This makes it impossible to determine whether the observed agreement validates the leading-term dominance or arises from later dynamics and initialization.
[§3] §3 (Theoretical derivation): the gradient truncation to leading terms is asserted to determine the functional form of the weights, yet the loss involves softmax over dot products of multiple embeddings; no explicit bound or scaling argument is given showing that the neglected O(1) coupling terms remain negligible over the initial training steps where the closed-form expressions are claimed to apply.

minor comments (1)

The definitions of the three basis functions (bigram, token-interchangeability, context) would benefit from a single consolidated table or equation block early in the paper to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of our theoretical claims and experimental validation. We address each major point below and have revised the manuscript to strengthen the presentation of the leading-term approximation and its empirical support.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of 'close numerical match' between theoretical characterizations and real-LLM weights is presented without reported error bounds on the truncation, ablation of higher-order gradient terms, or explicit data-exclusion criteria. This makes it impossible to determine whether the observed agreement validates the leading-term dominance or arises from later dynamics and initialization.

Authors: We agree that quantitative error bounds and ablations were not reported in the original submission. In the revised manuscript, we have added a dedicated subsection to §4 that computes the relative L2 error between the closed-form leading-term predictions and the actual weight matrices at early training checkpoints (epochs 1–10). We also include an ablation that explicitly compares the full gradient against the truncated leading terms, showing that the neglected components contribute less than 15% to the weight updates in the initial phase. Data-exclusion criteria are now stated explicitly: sequences exceeding the model's context length are discarded, and tokens with corpus frequency below 5 are excluded from the bigram and interchangeability statistics. These additions confirm that the observed numerical agreement is driven by the leading terms rather than later training dynamics. revision: yes
Referee: [§3] §3 (Theoretical derivation): the gradient truncation to leading terms is asserted to determine the functional form of the weights, yet the loss involves softmax over dot products of multiple embeddings; no explicit bound or scaling argument is given showing that the neglected O(1) coupling terms remain negligible over the initial training steps where the closed-form expressions are claimed to apply.

Authors: The referee is correct that the original derivation lacked an explicit scaling bound on the O(1) coupling terms arising from the softmax. We have added a new supporting lemma in §3 that bounds these terms under the standard small-initialization regime (embedding norms O(1/√d) at step 0). The proof shows that the higher-order contributions remain O(ε) for the first O(1/ε) steps when the learning rate is sufficiently small, which aligns with the early-training window where our closed-form expressions are applied. While this addresses the concern for the stated regime, a fully distribution-free bound would require stronger assumptions on token co-occurrence statistics; we therefore qualify the lemma accordingly in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity: derivation proceeds from loss gradient approximation to explicit basis-function expressions

full rationale

The paper starts from the standard cross-entropy loss on natural text, applies a leading-term truncation to the gradient with respect to each weight matrix, and algebraically obtains closed-form expressions for the early-training weights as compositions of three corpus-derived basis functions (bigram counts, token interchangeability, and context mappings). These basis functions are defined directly from the empirical token statistics that appear in the loss; the resulting weight formulas are therefore mathematical consequences of the truncated dynamics rather than re-statements of fitted parameters or self-citations. No load-bearing step reduces to a prior result by the same authors, no ansatz is smuggled via citation, and the claimed match to real LLM weights is presented as an empirical check rather than part of the derivation itself. The derivation chain is therefore self-contained against the training objective and data statistics.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the leading-term gradient approximation for early training and on the assumption that the three basis functions fully capture the relevant corpus statistics that drive association formation.

axioms (2)

domain assumption Leading-term approximation of gradients accurately describes weight updates in the earliest phase of training
Invoked to obtain the closed-form expressions for the weights
domain assumption Semantic associations are shaped primarily by the statistics captured in the bigram, token-interchangeability, and context-mapping basis functions
Used to link the derived weight forms to observed associations in language data

pith-pipeline@v0.9.0 · 5499 in / 1328 out tokens · 27430 ms · 2026-05-16T11:17:41.220828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

11 Published as a conference paper at ICLR 2026 Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 2026
[3]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The computational complexity of learning gaussian single-index models.arXiv preprint arXiv:2403.05529, 7,

Alex Damian, Loucas Pillaud-Vivien, Jason D Lee, and Joan Bruna. The computational complexity of learning gaussian single-index models.arXiv preprint arXiv:2403.05529, 7,

work page arXiv
[5]

How two-layer neural networks learn, one (giant) step at a time.arXiv preprint arXiv:2305.18270,

Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, and Ludovic Stephan. How two-layer neural networks learn, one (giant) step at a time.arXiv preprint arXiv:2305.18270,

work page arXiv
[6]

Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759,

work page arXiv
[7]

Not all language model features are one-dimensionally linear.arXiv preprint arXiv:2405.14860,

Joshua Engels, Eric J Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. Not all language model features are one-dimensionally linear.arXiv preprint arXiv:2405.14860,

work page arXiv
[8]

Oxford University Press, London,

John Rupert Firth.Papers in Linguistics 1934–1951. Oxford University Press, London,

work page 1934
[9]

Hidden dynamics of massive activations in transformer training.arXiv preprint arXiv:2508.03616,

Jorge Gallego-Feliciano, S Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, and Antonios Saravanos. Hidden dynamics of massive activations in transformer training.arXiv preprint arXiv:2508.03616,

work page arXiv
[10]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Adaptive estimation of a quadratic functional by model selec- tion.Annals of statistics, pp

12 Published as a conference paper at ICLR 2026 Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selec- tion.Annals of statistics, pp. 1302–1338,

work page 2026
[12]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Kenneth Li, Oam Patel, Fernanda Vi´egas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model.Advances in Neural Information Processing Systems, 36:41451–41530, 2023a. Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Mass-Editing Memory in a Transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads.arXiv preprint arXiv:2209.11895,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram ´e, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds.arXiv preprint arXiv:2502.00901,

Emanuele Troiani, Hugo Cui, Yatin Dandi, Florent Krzakala, and Lenka Zdeborov ´a. Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds.arXiv preprint arXiv:2502.00901,

work page arXiv
[18]

How transformers implement induction heads: Approxima- tion and optimization analysis.arXiv preprint arXiv:2410.11474, 2024a

13 Published as a conference paper at ICLR 2026 Mingze Wang, Ruoxi Yu, Lei Wu, et al. How transformers implement induction heads: Approxima- tion and optimization analysis.arXiv preprint arXiv:2410.11474, 2024a. Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, and Yi Ma. Attention-only transformers via unrolled subspace denoising.arXiv preprint arXiv:25...

work page arXiv 2026
[19]

Qwen3 Technical Report

Association for Computing Machinery. ISBN 9781450356657. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

An analysis for reasoning bias of language models with small initialization.arXiv preprint arXiv:2502.04375,

Junjie Yao, Zhongwang Zhang, and Zhi-Qin John Xu. An analysis for reasoning bias of language models with small initialization.arXiv preprint arXiv:2502.04375,

work page arXiv
[21]

These correlations are captured byQ i where each element Qijk ofQ i measures forX i, the correlation between the token at positionjand the token at position k+

14 Published as a conference paper at ICLR 2026 A DETAILEDDESCRIPTION ONWEIGHTCHARACTERIZATION The token-to-token correlation captured by ¯Qis determined by how strongly correlated one token is with the other’s next-token distribution. These correlations are captured byQ i where each element Qijk ofQ i measures forX i, the correlation between the token at...

work page 2026
[22]

Cosine Attention 0.999914 Value 0.998800 Output 0.997891 Table 2: Minimum cosine similarities between theoretical and actually learned weights across all epochs

15 Published as a conference paper at ICLR 2026 Weights Min. Cosine Attention 0.999914 Value 0.998800 Output 0.997891 Table 2: Minimum cosine similarities between theoretical and actually learned weights across all epochs. Results from a 3-layer attention-based model trained on TinyStories and with a BPE tok- enization. Causal intervention.We aim to under...

work page 2026
[23]

This behavior is predicted by the theory as the output layer has the largest order update, while the attention weights have the smallest order updates

We can see that the output layer has the largest effect on the loss, while the attention weights have the least. This behavior is predicted by the theory as the output layer has the largest order update, while the attention weights have the smallest order updates. Weights Loss Original 5.349 Attention Layer 0 5.350 Attention Layer 1 5.352 Attention Layer ...

work page 2024
[24]

For training, we use 65536 of the filtered samples with sequence length at least 201 and truncate all sequences to 201 tokens for training and computing theoretical leading terms. For the BPE tokenization, we tokenize the dataset using a vocabulary size of 10,000, and for training, we use samples with sequence length at least 201 and truncate all sequence...

work page 2026
[25]

ForW O, as the value matrix is initially zero, h(1) i =X i and asW O is zero, the output distribution for every token is the uniform distribution

Proof.From Lemma 1.1, as the parameters are initially zero, we can see thatW (1), V (1), P(1) all have gradients of zero and therefore remain as0. ForW O, as the value matrix is initially zero, h(1) i =X i and asW O is zero, the output distribution for every token is the uniform distribution. Let UO ∈R T×|V| represent the resulting output with each elemen...

work page 2026
[26]

We now consider the forward pass after the first gradient step

Proof.First, asV (1) remains at zero after the first step, we have that the gradients forW (1), P(1) are zero and therefore, they remain at zero after the second step. We now consider the forward pass after the first gradient step. As the value matrix remains as zero, we have that Fθ(Xi) =ηX i ¯B(40) Then, by the Softmax Jacobian lemma, we have that ∥S(F ...

work page 2026
[27]

Then, by equation 41, we have that ηX ⊤ i A0(Yi − S(F θ(Xi))) ¯B⊤ −ηX ⊤ i A⊤ 0 (Yi −U O) ¯B⊤ F ≤ η2Tp |V| ∥A0∥(46) Then, using the discrete Hardy’s inequality withp= 2, we have that∥A 0∥ ≤2and ∂L ∂V (1) − 1 N T NX i=1 ηX ⊤ i A⊤ 0 (Yi −U O) ¯B⊤ F ≤ 2η2 p |V| (47) Now, we will analyze 1 N T NX i=1 ηX ⊤ i A⊤ 0 (Yi −U O) ¯B⊤ (48) Sinceη ¯Bis independent ofi, ...

work page 2026
[28]

Essentially, how often does tokene k succeed tokene j and similar tokens

We can then interpret the each element(Σ ¯B ¯Φ)jk as a measure of assocation between tokenjandkbased on a two-step chain of (interchangeability mapping, suffix token mapping). Essentially, how often does tokene k succeed tokene j and similar tokens. We will let ¯G= Σ ¯B ¯Φ. Now, we can consider 2η3Yi ¯GX ⊤ i . This results in aT×Tmatrix where thejk-th ele...

work page 2026
[29]

Then, it follows thatmax km | ¯Qi,km| ≤1and thereforemax km | ¯Qkm|,max m |∆m| ≤1. Then, we have ∥(Ai −A 0)[t,:]∥ ≤ 6 s 4 + 4 s 3 η4 + 12s5η5T√ t (94) Then, summing the upper bounds on the squared norms of each row, and using that Pr q=1 1/q≤ 1 + logrwe have ∥Ai −A 0∥F ≤ 6 s 4 + 4 s 3 η4√ T+ 12s 5η5T p 1 + logT(95) Then, we have that A(1) i ≤2 + 6 s 4 + 4...

work page 2026
[30]

We start by bounding the deviation between s3−s2 2 η3Σ ¯B ¯ΦandW ⊤ O V (1)⊤

3η3 (111) Now, we perform the inductive step forW (1) utilizing the earlier bounds on the output and attention pattern deviations. We start by bounding the deviation between s3−s2 2 η3Σ ¯B ¯ΦandW ⊤ O V (1)⊤. By the inductive hypothesis and2≤ p |V|, we have that W ⊤ O V (1)⊤ − s3 −s 2 2 η3Σ ¯B ¯Φ F ≤8s 4η4 + 3 √ 2s4η4 ≤13s 4η4 (112) Then, for eachR iW ⊤ O ...

work page 2026
[31]

5η5T(125) D.2 PROOF OFMULTI-LAYERTHEOREM Lemma D.7(General Gradient Form).Under the setting described, defining S(l) i =ein tjk, tk→tj J(l) i , G (l) i V (l)⊤h(l−1)⊤ i ,(126) G(l−1) i =G (l) i +A (l)⊤ i G(l) i V (l)⊤ +S (l) i h(l−1) i W (l)⊤ +S (l)⊤ i h(l−1) i W (l),(127) with G(L) i =R iW ⊤ O (128) we have that ∂L ∂WO = −1 N T NX i=1 h(L)⊤ i Ri,(129) 26 ...

work page 2026
[32]

Proof.By Lemma D.7, ∂L ∂WO = −1 N T NX i=1 h(L)⊤ i Ri = −1 N T NX i=1 X ⊤ i (Yi −U O) =−(B−U)≡ − ¯B(148) whereBandUare defined the same as in the one-layer case. A single gradient step gives WO =−η ∂L ∂WO =η ¯B(149) At initializationW O = 0, so by Lemma D.7 the upstream gradient from layerLis G(L) i =R iW ⊤ O = 0(150) Using the recurrence (Lemma D.7), G(l...

work page 2026
[33]

For each row, ofA (l) i , we have that A(l) i =S(Mask(h (l−1) i [t,:]W (l)h(l−1)⊤ i +DM(P (l))[t,:]))(164) 29 Published as a conference paper at ICLR 2026 Decomposingh (l−1) i [t,:]asX i[t,:] + (h(l−1) i [t,:]−X i[t,:]), we get by the inductive hypothesis and thatmax km | ¯Qkm|,max m |∆m| ≤1as shown in the one-layer case that MASK(h(l−1) i [t,:]W (l)h(l−1...

work page 2026
[34]

As in the one-layer bound, we can use the bound on the attention pattern to controlJ (l) i

3η3 30 Published as a conference paper at ICLR 2026 From Lemma D.7, ∂L ∂W (l) =− 1 N T X i h(l−1)⊤ i S(l) i h(l−1) i (176) ∂L ∂P (l) =− 1 N T eintjk, jk→t D, X i S(l) i (177) withS (l) i =ein J(l) i , G (l) i V (l)⊤h(l−1)⊤ i . As in the one-layer bound, we can use the bound on the attention pattern to controlJ (l) i . We have that fort≥2 ∥Jt,i −J t∥2 ≤ 1 ...

work page 2026
[35]

exp − |V| 2+2ξ 4 , for all1≤l≤L, ∥WO∥F , V (l) F , W (l) F , P (l) F ≤2v(197) 32 Published as a conference paper at ICLR 2026 Proof.We start withW O. Using Lemma 1 from Laurent & Massart (2000), we have that P ∥WO∥2 F ≥ v2 |V| 2+2ξ (|V| 2 + 2|V| √ t+ 2t) ≤e −t (198) Then, settingt= |V| 2+2ξ 4 , we have that P ∥WO∥2 F ≥3v 2 ≤exp − |V| 2+2ξ 4 (199) Then, wi...

work page 2026
[36]

exp − |V| 1+2ξ 2 + exp − |V| 2+2ξ 4 fors≤η −1 min 1 12L , 5 8 √ T withT≥60and|V| ≥500, aftersgradient descent steps with learning rateηwe have,uniformly for every layer1≤l≤L, WO −sη ¯B F ≤3s 2η2 (204) V (l) − s 2 η2 ¯Φ⊤ ¯B⊤ F ≤12s 3η3 (205) W (l) − 3 s 4 + 2 s 3 η4 ¯Q F ≤13s 5η5T(206) P (l) − 3 s 4 + 2 s 3 η4∆ F ≤13s 5η5T(207) where ¯B, ¯Φ, ¯Q, and∆are as...

work page 2026
[37]

For each row, ofA (l) i , we have that A(l) i =S(Mask(h (l−1) i [t,:]W (l)h(l−1)⊤ i +DM(P (l))[t,:]))(215) By our earlier bounds, we have then MASK(h(l−1) i [t,:]W (l)h(l−1)⊤ i +DM(P (l))[t,:]) ≤ 1 + η7/2 2 2 3η2 T 2|V| 1/2 √ T+ 3η2 T 2|V| 1/2 ≤ 6η7/2 √ T (216) where we have use 1 T ≤ηandη≤ 1 12L. Then, by Lemma D.2, we have (A(l) i −A 0)[t,:] ≤ 6η7/2 √ T...

work page 2026