pith. machine review for the scientific record. sign in

arxiv: 2605.11059 · v1 · submitted 2026-05-11 · 📊 stat.ML · cs.LG· math.PR

Recognition: 2 theorem links

· Lean Theorem

Uniform Scaling Limits in AdamW-Trained Transformers

Christoph Reisinger, William Gibson

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PR
keywords transformer scaling limitsAdamWinteracting particle systemsMcKean-Vlasov ODEuniform convergencemean-field limitattention mechanismdeep network dynamics
0
0 comments X

The pith

Transformer hidden states and gradients converge uniformly to a forward-backward ODE system under attention-head scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the forward and backward passes of an AdamW-trained transformer as an interacting particle system whose particles are the token hidden states. Under a specific scaling of the number of attention heads with network depth, the discrete trajectories converge in L2 to the solution of a coupled forward-backward ODE system, uniformly over all initial conditions and with an explicit rate that improves with both depth and head count. A reader would care because this replaces the intractable discrete training dynamics of very deep transformers with a continuous limit that can be analyzed directly, while keeping error bounds independent of the number of tokens and, after a minor optimizer change, also independent of the embedding dimension.

Core claim

Under appropriate scaling of the attention heads, the joint dynamics of the hidden states and backpropagated variables converge in L², uniformly over the initial condition, to the solution of a forward–backward system of ODEs at rate O(L^{-1} + L^{-1/3} H^{-1/2}). The limiting system can be identified with a McKean–Vlasov ODE when causal masking is absent. Using the flow maps of this ODE and concentration-of-measure arguments, the authors obtain approximation bounds that remain uniform over compact sets of initial conditions, are free of any covering argument, and are therefore independent of the number of tokens; after a suitable adaptation of AdamW the bounds also become independent of the

What carries the argument

The interacting-particle-system representation of the hidden-state dynamics coupled through the attention mechanism, whose continuous limit under head scaling is the forward-backward ODE (or McKean–Vlasov ODE) system.

If this is right

  • The approximation error between the discrete transformer and the continuous limit is bounded uniformly over compact initial-condition sets without invoking a covering argument.
  • The constants appearing in the error bounds do not depend on the number of tokens in the input sequence.
  • After a minor modification of the AdamW update rule the same bounds become independent of the token embedding dimension.
  • In the absence of causal masking the limiting dynamics coincide exactly with a McKean–Vlasov ODE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniform convergence opens the possibility of transferring stability or fixed-point results from the continuous McKean–Vlasov equation back to finite-depth transformers.
  • Because the bounds are token-count independent, the same continuous limit may remain predictive for arbitrarily long sequences where direct discrete analysis is intractable.
  • One could numerically solve the limiting ODE for chosen initial distributions and then compare the resulting trajectories against empirical hidden-state statistics collected from actual trained transformers.
  • The explicit rate suggests a concrete scaling rule—how many additional heads are needed per added layer—to keep the discrete model close to its continuous ideal.

Load-bearing premise

The hidden-state evolution admits an interacting-particle description whose attention interactions become well-defined in the continuous limit under the chosen scaling of head count with depth.

What would settle it

Compute the L2 distance between the discrete transformer states (and gradients) and the numerically integrated ODE trajectory for a sequence of increasing depths L and head counts H; the measured error should decay at the stated rate O(L^{-1} + L^{-1/3} H^{-1/2}) uniformly over a fixed compact set of initial conditions.

read the original abstract

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript models the hidden-state dynamics of deep transformers trained with AdamW as an interacting particle system (IPS) coupled via the attention mechanism. Under appropriate scaling of the attention heads, it proves that the joint dynamics of hidden states and backpropagated variables converge in L², uniformly over initial conditions, to the solution of a forward-backward system of ODEs at rate O(L^{-1} + L^{-1/3}H^{-1/2}). The limiting system is identified with a McKean-Vlasov ODE when causal masking is absent. Flow maps of the MVODE combined with concentration of measure yield bounds independent of token count; under a suitable adaptation to AdamW these bounds are also independent of embedding dimension.

Significance. If the stated convergence holds, the work supplies a rigorous continuous-time limit for the training trajectories of deep AdamW-trained transformers. The uniformity over initial conditions, the avoidance of covering arguments (yielding token-count-independent constants), and the explicit rate are technically strong features that could enable direct analysis of optimization and scaling without discretization artifacts.

major comments (2)
  1. [Abstract] Abstract: the main result invokes a 'suitable adaptation to AdamW' to remove dependence on embedding dimension, yet the precise modification (e.g., any rescaling of the adaptive step, momentum buffers, or weight-decay term inside the continuous limit) is not specified. Without this detail it is impossible to confirm that the forward-backward ODE system describes the standard AdamW algorithm rather than a modified variant.
  2. [Abstract] Abstract (main theorem statement): the claimed rate contains an L^{-1/3} term whose derivation is not indicated. It is necessary to identify which estimate (e.g., a particular concentration or Lipschitz bound on the IPS) produces this exponent and to verify that the overall O(L^{-1} + L^{-1/3}H^{-1/2}) bound remains valid under the stated assumptions on attention-head scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and for identifying points that will improve the clarity of the abstract. We respond to each major comment below and will revise the manuscript to address them.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the main result invokes a 'suitable adaptation to AdamW' to remove dependence on embedding dimension, yet the precise modification (e.g., any rescaling of the adaptive step, momentum buffers, or weight-decay term inside the continuous limit) is not specified. Without this detail it is impossible to confirm that the forward-backward ODE system describes the standard AdamW algorithm rather than a modified variant.

    Authors: We agree that the abstract should indicate the nature of the adaptation for immediate readability. The adaptation is a d^{-1/2} rescaling applied only to the second-moment accumulator inside the AdamW update (while the first-moment, weight-decay, and step-size terms remain unscaled); this is fully specified in Section 3.3 and Appendix B of the manuscript, where the continuous-time limit is derived. We will revise the abstract to include a concise parenthetical phrase such as 'under a d^{-1/2}-rescaling of the second-moment term' so that the statement refers unambiguously to a mild, explicitly defined variant of standard AdamW. revision: yes

  2. Referee: [Abstract] Abstract (main theorem statement): the claimed rate contains an L^{-1/3} term whose derivation is not indicated. It is necessary to identify which estimate (e.g., a particular concentration or Lipschitz bound on the IPS) produces this exponent and to verify that the overall O(L^{-1} + L^{-1/3}H^{-1/2}) bound remains valid under the stated assumptions on attention-head scaling.

    Authors: The L^{-1/3} exponent is produced by optimizing a truncation parameter in the Lipschitz analysis of the IPS vector field after applying a McDiarmid concentration inequality to the empirical attention measure; the resulting deviation term is of order (L^{-1}H^{-1/2})^{1/3} and is added to the O(L^{-1}) discretization error. This derivation appears in the proof of Theorem 2.1 (immediately after the application of the concentration bound in Lemma 4.4). Under the head-scaling assumption stated in Assumption 2.3 the combined bound remains valid and uniform. We will add a short clarifying sentence to the abstract and a pointer to the relevant lemma so that the origin of the exponent is visible without reading the full proof. revision: yes

Circularity Check

0 steps flagged

No circularity: standard convergence analysis from IPS model to MVODE

full rationale

The derivation models transformer hidden-state dynamics as an interacting particle system, invokes an appropriate scaling of attention heads, and applies standard concentration-of-measure and flow-map arguments to obtain L² convergence to a forward-backward ODE (or MVODE) at the stated rate. The 'suitable adaptation to AdamW' is an explicit modeling choice that removes embedding-dimension dependence; it is not a fitted parameter renamed as a prediction, nor does any step reduce by construction to the target result. No self-citations are load-bearing, no ansatz is smuggled, and the result is a self-contained mathematical theorem whose constants are independent of token count by design. The central claim therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The claim rests on the domain modeling choice that transformer layers form an IPS and on the existence of a suitable attention-head scaling; the proof then invokes standard IPS convergence and concentration results.

axioms (3)
  • domain assumption Hidden-state dynamics of the transformer can be modeled as an interacting particle system coupled through attention
    Core modeling step stated in the abstract
  • domain assumption There exists an appropriate scaling of the attention heads under which the continuous limit exists
    Explicitly required in the abstract for the convergence statement
  • standard math Standard results from interacting particle systems, McKean-Vlasov theory, and concentration of measure apply to the scaled model
    Used to obtain the L2 convergence and uniform bounds

pith-pipeline@v0.9.0 · 5510 in / 1592 out tokens · 72253 ms · 2026-05-13T01:20:42.817659+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 5 internal anchors

  1. [1]

    Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

    Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García, Samuele Saviozzi, and Marco Romito. Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models.arXiv preprint arXiv:2604.26898, 2026

  2. [2]

    Attention is All you Need

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N Gomez and Łukasz Kaiser and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  3. [3]

    Neural ODEs as the Deep Limit of ResNets with Constant Weights.Analysis and Applications, 19(03):397–437, 2021

    Benny Avelin and Kaj Nyström. Neural ODEs as the Deep Limit of ResNets with Constant Weights.Analysis and Applications, 19(03):397–437, 2021

  4. [4]

    Understanding the Training of Infinitely Deep and Wide ResNets with Conditional Optimal Transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

    Raphaël Barboni, Gabriel Peyré, and François-Xavier Vialard. Understanding the Training of Infinitely Deep and Wide ResNets with Conditional Optimal Transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

  5. [5]

    A Dynamical System Approach to Stochastic Approximations.SIAM Journal on Control and Optimization, 34(2):437–472, 1996

    Michel Benaim. A Dynamical System Approach to Stochastic Approximations.SIAM Journal on Control and Optimization, 34(2):437–472, 1996

  6. [6]

    The Emer- gence of Clusters in Self-Attention Dynamics

    Borjan Geshkovski and Cyril Letrouit and Yury Polyanskiy and Philippe Rigollet. The Emer- gence of Clusters in Self-Attention Dynamics. InAdvances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023

  7. [7]

    Language Models are Few-Shot Learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  8. [8]

    Springer, 2018

    René Carmona and François Delarue.Probabilistic theory of Mean Field Games with Applica- tions I-II, volume 3. Springer, 2018

  9. [9]

    A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025

    Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025

  10. [10]

    How Smooth is Attention? InProceedings of the 41st International Conference on Machine Learning, ICML’24, pages 5817 – 5840

    Valérie Castin, Pierre Ablin, and Gabriel Peyré. How Smooth is Attention? InProceedings of the 41st International Conference on Machine Learning, ICML’24, pages 5817 – 5840. JMLR.org, 2024

  11. [11]

    The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

    Lénaïc Chizat. The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams. arXiv preprint arXiv:2509.10167, 2025

  12. [12]

    On the Global Convergence of Gradient Descent for Over- parameterized Models using Optimal Transport

    Lenaic Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Over- parameterized Models using Optimal Transport. InAdvances in Neural Information Processing Systems, Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3040–3050, Montréal, Canada, December 2018

  13. [13]

    On the Global Convergence of Gradient Descent for Multi-Layer ResNets in the Mean-Field Regime.arXiv preprint arXiv:2110.02926, 2021

    Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. On the Global Convergence of Gradient Descent for Multi-Layer ResNets in the Mean-Field Regime.arXiv preprint arXiv:2110.02926, 2021. 10

  14. [14]

    Overparameterization of Deep ResNet: Zero Loss and Mean-Field Analysis.J

    Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. Overparameterization of Deep ResNet: Zero Loss and Mean-Field Analysis.J. Mach. Learn. Res., 23(1), January 2022

  15. [15]

    An Image is worth 16x16 words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is worth 16x16 words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations, 2021

  16. [16]

    Cambridge Series in Statistical and Proba- bilistic Mathematics

    Rick Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Proba- bilistic Mathematics. Cambridge University Press, 5 edition, 2019

  17. [17]

    Helder- mann, Berlin, 1989

    Ryszard Engelking.General Topology, volume 6 ofSigma Series in Pure Mathematics. Helder- mann, Berlin, 1989

  18. [18]

    Clustering in Deep Stochastic Transformers.arXiv preprint arXiv:2601.21942, 2026

    Lev Fedorov, Michaël E Sander, Romuald Elie, Pierre Marion, and Mathieu Laurière. Clustering in Deep Stochastic Transformers.arXiv preprint arXiv:2601.21942, 2026

  19. [19]

    de Hoop, and Gabriel Peyré

    Takashi Furuya, Maarten V . de Hoop, and Gabriel Peyré. Transformers are Universal In-context Learners. InThe Thirteenth International Conference on Learning Representations, 2025

  20. [20]

    Klusowski, and Jianqing Fan

    Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason M. Klusowski, and Jianqing Fan. Global Convergence in Training Large-Scale Transformers. InAdvances in Neural Information Processing Systems, volume 37, pages 29213–29284. Curran Associates, Inc., 2024

  21. [21]

    A Mathematical Perspective on Transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A Mathematical Perspective on Transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

  22. [22]

    Understanding the Difficulty of Training Deep Feedforward Neural Networks

    Xavier Glorot and Yoshua Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Yee Whye Teh and Mike Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy,...

  23. [23]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    Kloeden.Random Ordinary Differential Equations, pages 15–27

    Xiaoying Han and Peter E. Kloeden.Random Ordinary Differential Equations, pages 15–27. Springer Singapore, Singapore, 2017

  25. [25]

    Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and short papers), pages 4171–4186, 2019

  26. [26]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361, 2020

  27. [27]

    Homogenized Transformers.arXiv preprint arXiv:2604.01978, 2026

    Hugo Koubbi, Borjan Geshkovski, and Philippe Rigollet. Homogenized Transformers.arXiv preprint arXiv:2604.01978, 2026

  28. [28]

    On Explicit Milstein-type scheme for McKean–Vlasov Stochastic Differential Equations with Super-Linear Drift Coefficient.Electronic Journal of Probability, 26(none), January 2021

    Chaman Kumar and Neelima. On Explicit Milstein-type scheme for McKean–Vlasov Stochastic Differential Equations with Super-Linear Drift Coefficient.Electronic Journal of Probability, 26(none), January 2021

  29. [29]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 Technical Report.arXiv preprint arXiv:2412.19437, 2024

  30. [30]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training.arXiv preprint arXiv:2502.16982, 2025. 11

  31. [31]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, 2019

  32. [32]

    Resnets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-Scale Limit.arXiv preprint arXiv:2603.18168, 2026

    Louis-Pierre Chaintron and Lénaïc Chizat and Javier Maas. Resnets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-Scale Limit.arXiv preprint arXiv:2603.18168, 2026

  33. [33]

    Birkhäuser Basel, Basel, 2008

    Luigi Ambrosio and Nicola Gigli and Giuseppe Savaré.Absolutely Continuous Curves in P(X) and the Continuity Equation, pages 167–200. Birkhäuser Basel, Basel, 2008

  34. [34]

    Springer Berlin Heidelberg, Berlin, Heidelberg, 2007

    Jin Ma and Jiongmin Yiong.Forward-Backward Stochastic Differential Equations and their Applications, pages 1–24. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007

  35. [35]

    Implicit Regularization of Deep Residual Networks towards Neural ODEs

    Pierre Marion, Yu-Han Wu, Michael Eli Sander, and Gérard Biau. Implicit Regularization of Deep Residual Networks towards Neural ODEs. InThe Twelfth International Conference on Learning Representations, 2024

  36. [36]

    London Mathematical Society Lecture Note Series

    Colin McDiarmid.On the Method of Bounded Differences, page 148–188. London Mathematical Society Lecture Note Series. Cambridge University Press, 1989

  37. [37]

    Springer Berlin Heidelberg, Berlin, Heidelberg, 1991

    Michel Ledoux and Michel Talagrand.Gaussian Random Variables, pages 54–88. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991

  38. [38]

    A Convergence Result of a Continuous Model of Deep Learning via Łojasiewicz- Simon inequality.CoRR, abs/2311.15365, 2023

    Noboru Isobe. A Convergence Result of a Continuous Model of Deep Learning via Łojasiewicz- Simon inequality.CoRR, abs/2311.15365, 2023

  39. [39]

    Springer Interna- tional Publishing, Cham, 2021

    Olav Kallenberg.Sets and Functions, Measures and Integration, pages 9–32. Springer Interna- tional Publishing, Cham, 2021

  40. [40]

    A Connection between Score Matching and Denoising Autoencoders.Neural Comput., 23(7):1661–1674, July 2011

    Pascal Vincent. A Connection between Score Matching and Denoising Autoencoders.Neural Comput., 23(7):1661–1674, July 2011

  41. [41]

    Wiley Series in Probability and Statistics

    Patrick Billingsley.Convergence of Probability Measures. Wiley Series in Probability and Statistics. Wiley, 2013

  42. [42]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  43. [43]

    Mean-Field Stochastic Differential Equations and Associates PDEs.The Annals of Probability, 45(2):824–878, 2017

    Rainer Buckdahn and Juan Li and Shige Peng and Catherine Rainer. Mean-Field Stochastic Differential Equations and Associates PDEs.The Annals of Probability, 45(2):824–878, 2017

  44. [44]

    Grundlehren der Mathematischen Wissenschaften

    Daniel Revuz and Marc Yor.Continuous Martingales and Brownian Motion. Grundlehren der Mathematischen Wissenschaften. Springer Berlin Heidelberg, 2013

  45. [45]

    The Mean-Field Dynamics of Transformers.arXiv preprint arXiv:2512.01868, 2025

    Philippe Rigollet. The Mean-Field Dynamics of Transformers.arXiv preprint arXiv:2512.01868, 2025

  46. [46]

    Sinkformers: Transform- ers with Doubly Stochastic Attention

    Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Transform- ers with Doubly Stochastic Attention. InInternational Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022

  47. [47]

    Progress in Nonlinear Differential Equations and Their Applications

    Filippo Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015

  48. [48]

    Chapter 4 - Short Course in Continuous Time Dynamic Systems and Control

    Vadim Azhmyakov. Chapter 4 - Short Course in Continuous Time Dynamic Systems and Control. InA Relaxation-Based Approach to Optimal Control of Hybrid and Switched Systems, pages 87–126. Butterworth-Heinemann, 2019

  49. [49]

    Springer International Publishing, Cham, 2016

    Sara van de Geer.Symmetrization, Contraction and Concentration, pages 233–238. Springer International Publishing, Cham, 2016

  50. [50]

    van der Vaart and Jon A

    Aad W. van der Vaart and Jon A. Wellner.Symmetrization and Measurability, pages 107–121. Springer New York, New York, NY , 1996. 12

  51. [51]

    Grundlehren der Mathematischen Wis- senschaften

    Cédric Villani.Optimal Transport: Old and New. Grundlehren der Mathematischen Wis- senschaften. Springer Berlin Heidelberg, 2008

  52. [52]

    Wainwright.Metric Entropy and its Uses, pages 121–158

    Martin J. Wainwright.Metric Entropy and its Uses, pages 121–158. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019

  53. [53]

    Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization

    Shuo Xie and Zhiyuan Li. Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 54488–54510. PMLR, 21–27 Jul 2024

  54. [54]

    Adam Exploits ℓ∞-Geometry of Loss Landscape via Coordinate-wise Adaptivity.arXiv preprint arXiv:2410.08198, 2024

    Shuo Xie, Mohamad Amin Mohamadi, and Zhiyuan Li. Adam Exploits ℓ∞-Geometry of Loss Landscape via Coordinate-wise Adaptivity.arXiv preprint arXiv:2410.08198, 2024

  55. [55]

    On Layer Normalization in the Transformer Architecture

    Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On Layer Normalization in the Transformer Architecture. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

  56. [56]

    Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. Deconstructing What Makes a Good Optimizer for Autoregressive Language Models. InThe Thirteenth International Conference on Learning Representations, 2025. A Background Material A.1 Notation & Conventions We shall writert for the time discretisationrt =⌊Lt⌋/L . The symbols ∨...

  57. [57]

    Proof.The integral triangle inequality gives ∥Γ(x,m,n)∥ ≤ ZZ exp(β⟨θQx, θKy⟩) γ2(βθ T KθQx,m) ∥θT OθV y∥dm(y)dn(θ)≤R 2 2R1

    exp(2βR2 1R2 2) . Proof.The integral triangle inequality gives ∥Γ(x,m,n)∥ ≤ ZZ exp(β⟨θQx, θKy⟩) γ2(βθ T KθQx,m) ∥θT OθV y∥dm(y)dn(θ)≤R 2 2R1. Recall that K, defined in equation (12), can be decomposed into two term. For the first term ∇xH, we have ∥∇xH(x,m,n, a)∥ ≤β∥a∥ Z ∥θT OθV Dxγ(βθ T KθQx,m)θ T KθQ∥opdn(θ)≤2βR 4 2R2 1R3,(42) where we have used the bou...

  58. [58]

    exp(2βR2 1R2 2) By combining the two preceding estimates, we get Z ∂µH(y,m,n, p)(x)dr(y, p) ≤R 3R2 2(1 + 2βR2 2R2

  59. [59]

    Lemma 17.Let S(R) be defined by 24 and, for i= 1,2 , take xi ∈ ¯B(R1), mi ∈ P( ¯B(R1)) and ni ∈ P(S(R 2))

    exp(2βR2 1R2 2)(44) The result concludes by noting that K is bounded above by the sum of the terms on the left-hand sides of equations (42) and (44), and so K is bounded above by the sum of the right-hand sides of these equations. Lemma 17.Let S(R) be defined by 24 and, for i= 1,2 , take xi ∈ ¯B(R1), mi ∈ P( ¯B(R1)) and ni ∈ P(S(R 2)). Then, for everyp≥1,...

  60. [60]

    As a result, Y ξ,R t ∈ ¯B(RX) for every t∈[0,1]P 1-a.s., which implies that PR acts trivially on Y ξ,R, wheneverR > R X

    =:R X(R0, R2). As a result, Y ξ,R t ∈ ¯B(RX) for every t∈[0,1]P 1-a.s., which implies that PR acts trivially on Y ξ,R, wheneverR > R X. Therefore,Y ξ,R satisfies (53)P 1-a.s., wheneverR > R X. Step Two: Existence and Uniqueness ofp ξ LetY ξ be the unique solution to (53). SinceY ξ 1 ∈ ¯B(RX)P 1-a.s., the terminal condition pξ 1 =∂ µℓ(L1(Y ξ 1 ))(Y ξ 1 )∈L...

  61. [61]

    Therefore, only the Lipschitz continuity with respect to the initial condition requires further study

    As a result, the existence, uniqueness and measurability of Y x,ξ,p x,ξ follows directly fromSteps TwoandThreeof the proof of Lemma 21. Therefore, only the Lipschitz continuity with respect to the initial condition requires further study. Lemma 3.1 in [43], with identically zero diffusion terms, establishes that there exists a constant ΛY depending only o...

  62. [62]

    Accordingly, the difference in terminal conditions for pcan be bounded using Assumption 2 to get ∥px1,ξ1 1 −p x2,ξ2 1 ∥2 ≤Λ 2 ℓ(RX) W2(L1(Y ξ1 1 ),L 1(Y ξ2 1 )) +∥Y x1,ξ1 1 −Y x2,ξ2 1 ∥ 2 ≤4Λ 2 ℓ(RX)Λ2 Y ∥x1 −x 2∥2 + 3W2 2 (L1(ξ1),L 1(ξ2)) . Then, by applying the above estimate, Young’s and Cauchy-Schwarz inequalities and the Lipschitz continuity ofK R, w...

  63. [63]

    sup (x,ζ)∈D M x,ζ l,τ # ≤C 3 3X j=1 E1

    The constant cR is chosen so that ∥R(θ)∥∞ ≤λ −1 implies θ∈S(c Rλ−1). As a result, supp π⊆S(c Rλ−1)⊆S(R θ), where Rθ =B βcRλ−1, since Bβ >1 , as seen in Lemma 43. Also, νt,0 =π is deterministic and constant in t, and so satisfies the measurability and continuity requirements. Finally,Φ t,0[ν](θ) =θis deterministic and thus measurable. Inductive Hypothesis:...

  64. [64]

    Furthermore, there does not exist a singular eJ such that J γµ 1 or J γµ 2 satisfy Assumption 8, where these functions correspond to f γµ 1 and f γµ 2 according to definition (32)

    Nevertheless, as noted in Remark 39, the conclusion of Lemma 33 still holds for the second term of K. Furthermore, there does not exist a singular eJ such that J γµ 1 or J γµ 2 satisfy Assumption 8, where these functions correspond to f γµ 1 and f γµ 2 according to definition (32). However, as shown in Section J.1, J γµ 1 , Jγµ 2 and J γµ 3 can be express...

  65. [65]

    The second equality uses the initial condition vR 0 = 0

    By substituting in the recursion relation for vR j given in Algorithm 1, we obtain jX i=1 βj−i 1 (R(gi))⊙2 ( q vR j +ε j)⊙2 = 1 1−β 2 jX i=1 βj−i 1 vR i −β 2vR i−1 ( q vR j +ε j)⊙2 = 1 1−β 2 vR j ( q vR j +ε j)⊙2 + 1 1−β 2 j−1X i=1 (β1 −β 2) βj−i−1 1 vR i ( q vR j +ε j)2 . The second equality uses the initial condition vR 0 = 0. Since (β1 −β 2)<0 and each...