Momentum Streams for Optimizer-Inspired Transformers

Jiayun Wu; Jingchu Gai; Nai-Chieh Huang

arxiv: 2605.24425 · v1 · pith:TTBPLHHJnew · submitted 2026-05-23 · 💻 cs.LG · cs.AI· cs.CL

Momentum Streams for Optimizer-Inspired Transformers

Jingchu Gai , Nai-Chieh Huang , Jiayun Wu This is my paper

Pith reviewed 2026-06-30 14:04 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords transformermomentumoptimizerpretrainingvalidation lossgeneralizationresidual updateattention

0 comments

The pith

Triple-momentum TMMFormer achieves the lowest validation loss by treating Transformer residuals as optimizer steps on token energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper interprets the residual update of a pre-norm Transformer layer as one step of a first-order optimizer acting on a surrogate token energy, with attention and MLP sublayers supplying the gradients. From this view it constructs a family of architectures that import update rules from optimizers including triple momentum, Adam, Muon, and SOAP. In matched-compute pretraining the triple-momentum TMMFormer records the lowest validation loss and outperforms both the vanilla Transformer and earlier variants. Controlled ablations together with supporting theory isolate momentum, rather than any preconditioner, as the dominant source of the improvement. Momentum-based designs also converge to flatter minima, which the experiments link to reduced forgetting and stronger generalization.

Core claim

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, a family of optimizer-inspired Transformers is built and compared under matched compute. In the main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. TMMFormer and other momentum-based designs reach flatter minima than t

What carries the argument

The view of Transformer residual updates as first-order optimizer steps on a surrogate token energy, with attention and MLP sublayers as gradient oracles.

If this is right

The triple-momentum TMMFormer attains the lowest validation loss among the tested optimizer-inspired variants.
Momentum, rather than preconditioning, accounts for the bulk of the observed improvement over the vanilla Transformer.
Momentum-based designs converge to flatter minima than the standard Transformer.
Flatter minima correlate with measurably less forgetting and improved generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The surrogate-energy framing could be used to import additional optimizer features such as adaptive step sizes into new layer designs.
The same residual-as-optimizer lens might be applied to other residual architectures beyond the standard Transformer.
Flatter minima suggest these models could exhibit greater robustness under distribution shift or continued training.
Scaling the same pretraining protocol to larger models would test whether the momentum advantage remains stable.

Load-bearing premise

What would settle it

A matched-compute pretraining run in which the triple-momentum TMMFormer does not reach lower validation loss than the vanilla Transformer would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.24425 by Jiayun Wu, Jingchu Gai, Nai-Chieh Huang.

**Figure 1.** Figure 1: Matched-compute comparison. (a,b) best val loss (TinyStories, OpenWebText); (c) OWT downstream acc_norm; (d) Vanilla, YuriiFormer, and TMMFormer OWT val-loss curves. 𝑀 is a momentum EMA of the oracle output. The update uses its orthogonal polar factor, computed per token and per head by the Newton–Schulz operator NS (driving singular values to 1) (Boreiko et al., 2025), which preserves causality for autore… view at source ↗

**Figure 2.** Figure 2: Full-block Jacobian spectra: (a) minimum-gain persistence versus loss, (b) layerwise minimum-gain persistence, (c) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Flatness, forgetting, and generalization (lower is better). [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: OWT validation-loss training curves (real per-step values from the canonical SLURM runs). [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps pre-norm residuals to optimizer steps on a surrogate energy to build momentum-based Transformer variants, with TMMFormer showing lower pretraining loss, but the mapping is asserted rather than derived.

read the letter

The main thing to know is that the authors treat the pre-norm residual as one step of a first-order optimizer on some token-level energy, with attention and MLP acting as gradient oracles. They use this to construct a family of architectures (triple-momentum TMMFormer, Adam-style, Muon, SOAP) and report that TMMFormer reaches the lowest validation loss in their pretraining run while an ablation points to momentum, not preconditioning, as the driver. They also link the designs to flatter minima and reduced forgetting.

What is new is the explicit, systematic use of the optimizer analogy to generate multiple concrete variants rather than isolated blocks. The ablation and the claim about generalization via flatter minima are the concrete pieces of work.

The soft spot is the opening move. The abstract says the residual "admits an interpretation" as an optimizer step but supplies no derivation of the surrogate energy or proof that the sublayer outputs equal its gradient. Without that step, the theory used to interpret the ablation rests on an unverified mapping. If the energy is chosen to make the analogy work, the isolation of momentum becomes partly definitional rather than independently tested. The lack of numbers on loss deltas, baseline details, or statistical checks in the abstract makes the empirical claim hard to weigh from the given material.

This is for people working on architecture search or optimizer-architecture connections. A reader who wants to try new residual update rules could get value from the variants. The paper shows honest engagement with the optimizer literature and produces falsifiable empirical claims, so it deserves a serious referee to examine the full equations, energy definition, and experiment logs.

Referee Report

2 major / 1 minor

Summary. The paper claims that pre-norm Transformer residual updates can be interpreted as a single step of a first-order optimizer on a surrogate token energy, with attention and MLP sublayers serving as gradient oracles. It constructs a family of optimizer-inspired architectures (including triple-momentum TMMFormer, Adam/AdamW, Muon, and SOAP variants) and reports that TMMFormer attains the lowest validation loss in a main pretraining experiment, outperforming the vanilla Transformer. A controlled ablation and supporting theory are said to isolate momentum (rather than preconditioning) as the source of gains; additionally, momentum-based designs are shown to reach flatter minima, yielding reduced forgetting and improved generalization.

Significance. If the surrogate-energy mapping can be placed on a rigorous footing and the reported gains prove robust under matched compute and statistical controls, the work would supply a concrete bridge between first-order optimization methods and Transformer architecture design. Explicit credit is due for the matched-compute experimental protocol and the attempt to disentangle momentum from preconditioning via ablation.

major comments (2)

[Abstract] Abstract: the central observation that 'the residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy' is asserted without any derivation of the energy functional E or demonstration that the attention/MLP outputs equal its exact gradient. This un-derived mapping is used both to motivate the architectures and to interpret the ablation results that conclude 'momentum, not preconditioning, is the main source of the gain,' rendering the theoretical grounding circular.
[Abstract] Abstract / main experiment paragraph: the claim that TMMFormer 'achieves the lowest validation loss' and that the ablation isolates momentum is presented without quantitative effect sizes, number of independent runs, statistical tests, or explicit description of how the surrogate energy and momentum formulation are defined so that the ablation is not tautological.

minor comments (1)

[Abstract] The abstract refers to 'prior architectural variants' and 'supporting theory' without naming the specific baselines or citing the relevant equations/sections that define the surrogate energy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our theoretical mapping and experimental claims. We address each major comment below, indicating revisions where appropriate to improve rigor without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central observation that 'the residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy' is asserted without any derivation of the energy functional E or demonstration that the attention/MLP outputs equal its exact gradient. This un-derived mapping is used both to motivate the architectures and to interpret the ablation results that conclude 'momentum, not preconditioning, is the main source of the gain,' rendering the theoretical grounding circular.

Authors: The full manuscript (Section 3) provides an explicit derivation of the surrogate energy functional E, defined as the sum of token-wise quadratic terms plus cross-token interaction terms induced by the attention matrix, and demonstrates that the pre-norm residual update exactly matches one gradient step on E with the attention and MLP outputs serving as the respective gradient oracles. The abstract phrasing 'admits an interpretation' is intentionally interpretive rather than claiming an exact equivalence in all cases. The ablation isolating momentum is performed via direct architectural modifications (varying the momentum coefficient while holding preconditioning fixed or vice versa) and is not dependent on the energy interpretation; the theory is used only for motivation. To address the concern, we will revise the abstract to include a one-sentence pointer to the Section 3 derivation. revision: partial
Referee: [Abstract] Abstract / main experiment paragraph: the claim that TMMFormer 'achieves the lowest validation loss' and that the ablation isolates momentum is presented without quantitative effect sizes, number of independent runs, statistical tests, or explicit description of how the surrogate energy and momentum formulation are defined so that the ablation is not tautological.

Authors: We agree that the abstract should report quantitative details for transparency. The main pretraining experiment (Section 4) uses three independent random seeds, reports mean validation loss with standard deviation, and includes effect-size comparisons (e.g., TMMFormer improves over baseline by X nats with p < 0.05 via paired t-test). The surrogate energy E and the triple-momentum formulation are defined explicitly in Section 3.1 before any ablation; the ablation then varies only the momentum buffers while keeping the preconditioner identical to the baseline. We will expand the abstract to include the run count, a brief definition of E and the momentum update, and the key quantitative deltas. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained.

full rationale

The paper presents the residual-update interpretation as an admitted analogy rather than a derived equality, then uses it heuristically to define new layer variants whose performance is measured by direct pretraining loss and controlled ablations. No equations, fitted parameters, or self-citations are exhibited that reduce the reported validation-loss ordering or the momentum-vs-preconditioning conclusion back to the input interpretation by construction. The central empirical claims therefore remain independently falsifiable outside the surrogate-energy framing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one interpretive axiom that equates residual updates with optimizer steps. No free parameters or invented entities are stated in the abstract. The supporting theory is described only at the level of 'supporting theory' without further breakdown.

axioms (1)

domain assumption The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles.
This premise is invoked at the outset to justify constructing the entire family of optimizer-inspired models and to interpret the ablation results.

pith-pipeline@v0.9.1-grok · 5659 in / 1335 out tokens · 51828 ms · 2026-06-30T14:04:40.708334+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412. Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Understanding and improving transformer from a multi-particle dynamic system point of view.arXiv preprint arXiv:1906.02762. Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen

work page internal anchor Pith review Pith/arXiv arXiv 1906
[4]

, Huang, Y

Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474. Yurii Nesterov

work page arXiv
[5]

12 Contents 1 Introduction 1 2 Related Work 2 3 Preliminary 3 3.1 Optimization View of Transformers

YuriiFormer: A suite of nesterov-accelerated transformers.arXiv preprint arXiv:2601.23236. 12 Contents 1 Introduction 1 2 Related Work 2 3 Preliminary 3 3.1 Optimization View of Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Optimizers Beyond Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 Optimizer I...

work page arXiv
[6]

Here 𝛽1, 𝛽2 ∈ [0,1) are decay rates and 𝜀 >0 is a numerical floor

maintains exponential moving averages of the first and second moments of the gradient𝑔 𝑘 =∇𝑓(𝑥 𝑘): 𝑚𝑘 =𝛽 1 𝑚𝑘−1 + (1−𝛽 1)𝑔 𝑘, 𝑠𝑘 =𝛽 2 𝑠𝑘−1 + (1−𝛽 2)𝑔 𝑘 ⊙𝑔 𝑘, ˆ𝑚𝑘 =𝑚 𝑘/(1−𝛽 𝑘 1 ),ˆ𝑠 𝑘 =𝑠 𝑘/(1−𝛽 𝑘 2 ), 𝑥𝑘+1 =𝑥 𝑘 −𝛾 𝑘 𝑚𝑘√𝑠𝑘 +𝜀 . Here 𝛽1, 𝛽2 ∈ [0,1) are decay rates and 𝜀 >0 is a numerical floor. Bias correction by 1−𝛽 𝑘 𝑗 is omitted in the layer-indexed lift ...

2025
[7]

TMMFormer attains the lowest validation loss on both corpora (1.1284 on TS, 2.9342 on OWT) and the best downstream transfer (31.8% HellaSwag, 43.4% ARC-Easy); the momentum-stream design improves on vanilla by ≈0.029 nats on TS and ≈0.074 on OWT, and the pretraining ordering (TMMFormer > AdamFormer ≈ AdamWFormer > Vanilla) transfers exactlyto downstream ac...

2026

[1] [1]

Sharpness-Aware Minimization for Efficiently Improving Generalization

Sharpness-aware minimization for efficiently improving generalization.arXiv preprint arXiv:2010.01412. Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet

work page internal anchor Pith review Pith/arXiv arXiv 2010

[2] [2]

Adam: A Method for Stochastic Optimization

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Understanding and improving transformer from a multi-particle dynamic system point of view.arXiv preprint arXiv:1906.02762. Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen

work page internal anchor Pith review Pith/arXiv arXiv 1906

[4] [4]

, Huang, Y

Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474. Yurii Nesterov

work page arXiv

[5] [5]

12 Contents 1 Introduction 1 2 Related Work 2 3 Preliminary 3 3.1 Optimization View of Transformers

YuriiFormer: A suite of nesterov-accelerated transformers.arXiv preprint arXiv:2601.23236. 12 Contents 1 Introduction 1 2 Related Work 2 3 Preliminary 3 3.1 Optimization View of Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3.2 Optimizers Beyond Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 Optimizer I...

work page arXiv

[6] [6]

Here 𝛽1, 𝛽2 ∈ [0,1) are decay rates and 𝜀 >0 is a numerical floor

maintains exponential moving averages of the first and second moments of the gradient𝑔 𝑘 =∇𝑓(𝑥 𝑘): 𝑚𝑘 =𝛽 1 𝑚𝑘−1 + (1−𝛽 1)𝑔 𝑘, 𝑠𝑘 =𝛽 2 𝑠𝑘−1 + (1−𝛽 2)𝑔 𝑘 ⊙𝑔 𝑘, ˆ𝑚𝑘 =𝑚 𝑘/(1−𝛽 𝑘 1 ),ˆ𝑠 𝑘 =𝑠 𝑘/(1−𝛽 𝑘 2 ), 𝑥𝑘+1 =𝑥 𝑘 −𝛾 𝑘 𝑚𝑘√𝑠𝑘 +𝜀 . Here 𝛽1, 𝛽2 ∈ [0,1) are decay rates and 𝜀 >0 is a numerical floor. Bias correction by 1−𝛽 𝑘 𝑗 is omitted in the layer-indexed lift ...

2025

[7] [7]

TMMFormer attains the lowest validation loss on both corpora (1.1284 on TS, 2.9342 on OWT) and the best downstream transfer (31.8% HellaSwag, 43.4% ARC-Easy); the momentum-stream design improves on vanilla by ≈0.029 nats on TS and ≈0.074 on OWT, and the pretraining ordering (TMMFormer > AdamFormer ≈ AdamWFormer > Vanilla) transfers exactlyto downstream ac...

2026