arxiv: 2605.11059 · v1 · submitted 2026-05-11 · 📊 stat.ML · cs.LG· math.PR

Recognition: 2 theorem links

· Lean Theorem

Uniform Scaling Limits in AdamW-Trained Transformers

Christoph Reisinger, William Gibson

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PR

keywords transformer scaling limitsAdamWinteracting particle systemsMcKean-Vlasov ODEuniform convergencemean-field limitattention mechanismdeep network dynamics

0 comments

The pith

Transformer hidden states and gradients converge uniformly to a forward-backward ODE system under attention-head scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the forward and backward passes of an AdamW-trained transformer as an interacting particle system whose particles are the token hidden states. Under a specific scaling of the number of attention heads with network depth, the discrete trajectories converge in L2 to the solution of a coupled forward-backward ODE system, uniformly over all initial conditions and with an explicit rate that improves with both depth and head count. A reader would care because this replaces the intractable discrete training dynamics of very deep transformers with a continuous limit that can be analyzed directly, while keeping error bounds independent of the number of tokens and, after a minor optimizer change, also independent of the embedding dimension.

Core claim

Under appropriate scaling of the attention heads, the joint dynamics of the hidden states and backpropagated variables converge in L², uniformly over the initial condition, to the solution of a forward–backward system of ODEs at rate O(L^{-1} + L^{-1/3} H^{-1/2}). The limiting system can be identified with a McKean–Vlasov ODE when causal masking is absent. Using the flow maps of this ODE and concentration-of-measure arguments, the authors obtain approximation bounds that remain uniform over compact sets of initial conditions, are free of any covering argument, and are therefore independent of the number of tokens; after a suitable adaptation of AdamW the bounds also become independent of the

What carries the argument

The interacting-particle-system representation of the hidden-state dynamics coupled through the attention mechanism, whose continuous limit under head scaling is the forward-backward ODE (or McKean–Vlasov ODE) system.

If this is right

The approximation error between the discrete transformer and the continuous limit is bounded uniformly over compact initial-condition sets without invoking a covering argument.
The constants appearing in the error bounds do not depend on the number of tokens in the input sequence.
After a minor modification of the AdamW update rule the same bounds become independent of the token embedding dimension.
In the absence of causal masking the limiting dynamics coincide exactly with a McKean–Vlasov ODE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform convergence opens the possibility of transferring stability or fixed-point results from the continuous McKean–Vlasov equation back to finite-depth transformers.
Because the bounds are token-count independent, the same continuous limit may remain predictive for arbitrarily long sequences where direct discrete analysis is intractable.
One could numerically solve the limiting ODE for chosen initial distributions and then compare the resulting trajectories against empirical hidden-state statistics collected from actual trained transformers.
The explicit rate suggests a concrete scaling rule—how many additional heads are needed per added layer—to keep the discrete model close to its continuous ideal.

Load-bearing premise

The hidden-state evolution admits an interacting-particle description whose attention interactions become well-defined in the continuous limit under the chosen scaling of head count with depth.

What would settle it

Compute the L2 distance between the discrete transformer states (and gradients) and the numerically integrated ODE trajectory for a sequence of increasing depths L and head counts H; the measured error should decay at the stated rate O(L^{-1} + L^{-1/3} H^{-1/2}) uniformly over a fixed compact set of initial conditions.

read the original abstract

We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives uniform L2 convergence of AdamW transformer dynamics to a forward-backward ODE via IPS and concentration, but the exact AdamW adaptation needs checking to confirm it matches practice.

read the letter

The main point is that the authors model transformer hidden states as an interacting particle system and prove L2 convergence of the joint forward and backpropagated trajectories to a forward-backward ODE system. The rate is O(L^{-1} + L^{-1/3}H^{-1/2}), and the bounds stay uniform over initial conditions while remaining independent of token count after concentration of measure is applied without a covering argument. With the mentioned adaptation to AdamW, the bounds also drop dependence on embedding dimension. That uniformity and the avoidance of covering arguments are the concrete technical steps forward here. The inclusion of backpropagated variables in the limit is also a reasonable extension beyond pure forward-pass analyses. The IPS setup and flow-map arguments are standard tools applied cleanly to this setting, and the abstract states the claims precisely enough to evaluate. The soft spot is the adaptation to AdamW. The abstract invokes it specifically to remove embedding-dimension dependence, but without the precise modification to momentum or second-moment terms it is unclear whether the limiting ODE still describes the optimizer run in practice. If the adaptation rescales the adaptive step or weight decay inside the continuous limit, the result applies to a modified algorithm rather than standard AdamW. The L^{-1/3} term in the rate also looks like it comes from a specific estimate that would need verification in the full proof. This work is for readers already comfortable with mean-field limits and stochastic analysis of neural nets. Someone working on scaling laws or continuous approximations for optimizers will find the uniformity results and the IPS modeling useful. The claims are grounded in existing techniques and stated with explicit rates, so the paper deserves a serious referee who can inspect the adaptation and close the estimates. I would send it out for peer review.

Referee Report

2 major / 0 minor

Summary. The manuscript models the hidden-state dynamics of deep transformers trained with AdamW as an interacting particle system (IPS) coupled via the attention mechanism. Under appropriate scaling of the attention heads, it proves that the joint dynamics of hidden states and backpropagated variables converge in L², uniformly over initial conditions, to the solution of a forward-backward system of ODEs at rate O(L^{-1} + L^{-1/3}H^{-1/2}). The limiting system is identified with a McKean-Vlasov ODE when causal masking is absent. Flow maps of the MVODE combined with concentration of measure yield bounds independent of token count; under a suitable adaptation to AdamW these bounds are also independent of embedding dimension.

Significance. If the stated convergence holds, the work supplies a rigorous continuous-time limit for the training trajectories of deep AdamW-trained transformers. The uniformity over initial conditions, the avoidance of covering arguments (yielding token-count-independent constants), and the explicit rate are technically strong features that could enable direct analysis of optimization and scaling without discretization artifacts.

major comments (2)

[Abstract] Abstract: the main result invokes a 'suitable adaptation to AdamW' to remove dependence on embedding dimension, yet the precise modification (e.g., any rescaling of the adaptive step, momentum buffers, or weight-decay term inside the continuous limit) is not specified. Without this detail it is impossible to confirm that the forward-backward ODE system describes the standard AdamW algorithm rather than a modified variant.
[Abstract] Abstract (main theorem statement): the claimed rate contains an L^{-1/3} term whose derivation is not indicated. It is necessary to identify which estimate (e.g., a particular concentration or Lipschitz bound on the IPS) produces this exponent and to verify that the overall O(L^{-1} + L^{-1/3}H^{-1/2}) bound remains valid under the stated assumptions on attention-head scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary and for identifying points that will improve the clarity of the abstract. We respond to each major comment below and will revise the manuscript to address them.

read point-by-point responses

Referee: [Abstract] Abstract: the main result invokes a 'suitable adaptation to AdamW' to remove dependence on embedding dimension, yet the precise modification (e.g., any rescaling of the adaptive step, momentum buffers, or weight-decay term inside the continuous limit) is not specified. Without this detail it is impossible to confirm that the forward-backward ODE system describes the standard AdamW algorithm rather than a modified variant.

Authors: We agree that the abstract should indicate the nature of the adaptation for immediate readability. The adaptation is a d^{-1/2} rescaling applied only to the second-moment accumulator inside the AdamW update (while the first-moment, weight-decay, and step-size terms remain unscaled); this is fully specified in Section 3.3 and Appendix B of the manuscript, where the continuous-time limit is derived. We will revise the abstract to include a concise parenthetical phrase such as 'under a d^{-1/2}-rescaling of the second-moment term' so that the statement refers unambiguously to a mild, explicitly defined variant of standard AdamW. revision: yes
Referee: [Abstract] Abstract (main theorem statement): the claimed rate contains an L^{-1/3} term whose derivation is not indicated. It is necessary to identify which estimate (e.g., a particular concentration or Lipschitz bound on the IPS) produces this exponent and to verify that the overall O(L^{-1} + L^{-1/3}H^{-1/2}) bound remains valid under the stated assumptions on attention-head scaling.

Authors: The L^{-1/3} exponent is produced by optimizing a truncation parameter in the Lipschitz analysis of the IPS vector field after applying a McDiarmid concentration inequality to the empirical attention measure; the resulting deviation term is of order (L^{-1}H^{-1/2})^{1/3} and is added to the O(L^{-1}) discretization error. This derivation appears in the proof of Theorem 2.1 (immediately after the application of the concentration bound in Lemma 4.4). Under the head-scaling assumption stated in Assumption 2.3 the combined bound remains valid and uniform. We will add a short clarifying sentence to the abstract and a pointer to the relevant lemma so that the origin of the exponent is visible without reading the full proof. revision: yes

Circularity Check

0 steps flagged

No circularity: standard convergence analysis from IPS model to MVODE

full rationale

The derivation models transformer hidden-state dynamics as an interacting particle system, invokes an appropriate scaling of attention heads, and applies standard concentration-of-measure and flow-map arguments to obtain L² convergence to a forward-backward ODE (or MVODE) at the stated rate. The 'suitable adaptation to AdamW' is an explicit modeling choice that removes embedding-dimension dependence; it is not a fitted parameter renamed as a prediction, nor does any step reduce by construction to the target result. No self-citations are load-bearing, no ansatz is smuggled, and the result is a self-contained mathematical theorem whose constants are independent of token count by design. The central claim therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The claim rests on the domain modeling choice that transformer layers form an IPS and on the existence of a suitable attention-head scaling; the proof then invokes standard IPS convergence and concentration results.

axioms (3)

domain assumption Hidden-state dynamics of the transformer can be modeled as an interacting particle system coupled through attention
Core modeling step stated in the abstract
domain assumption There exists an appropriate scaling of the attention heads under which the continuous limit exists
Explicitly required in the abstract for the convergence statement
standard math Standard results from interacting particle systems, McKean-Vlasov theory, and concentration of measure apply to the scaled model
Used to obtain the L2 convergence and uniform bounds

pith-pipeline@v0.9.0 · 5510 in / 1592 out tokens · 72253 ms · 2026-05-13T01:20:42.817659+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/RealityFromDistinction.lean J_uniquely_calibrated_via_higher_derivative; reality_from_one_distinction unclear
modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism... converge... to the solution of a forward–backward system of ODEs... McKean–Vlasov ODE (MVODE)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
uniform convergence... independent of the number of tokens... bounds... independent of the token embedding dimension

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 5 internal anchors

[1]

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García, Samuele Saviozzi, and Marco Romito. Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models.arXiv preprint arXiv:2604.26898, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Attention is All you Need

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N Gomez and Łukasz Kaiser and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[3]

Neural ODEs as the Deep Limit of ResNets with Constant Weights.Analysis and Applications, 19(03):397–437, 2021

Benny Avelin and Kaj Nyström. Neural ODEs as the Deep Limit of ResNets with Constant Weights.Analysis and Applications, 19(03):397–437, 2021

work page 2021
[4]

Understanding the Training of Infinitely Deep and Wide ResNets with Conditional Optimal Transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

Raphaël Barboni, Gabriel Peyré, and François-Xavier Vialard. Understanding the Training of Infinitely Deep and Wide ResNets with Conditional Optimal Transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025

work page 2025
[5]

A Dynamical System Approach to Stochastic Approximations.SIAM Journal on Control and Optimization, 34(2):437–472, 1996

Michel Benaim. A Dynamical System Approach to Stochastic Approximations.SIAM Journal on Control and Optimization, 34(2):437–472, 1996

work page 1996
[6]

The Emer- gence of Clusters in Self-Attention Dynamics

Borjan Geshkovski and Cyril Letrouit and Yury Polyanskiy and Philippe Rigollet. The Emer- gence of Clusters in Self-Attention Dynamics. InAdvances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023

work page 2023
[7]

Language Models are Few-Shot Learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

work page 1901
[8]

Springer, 2018

René Carmona and François Delarue.Probabilistic theory of Mean Field Games with Applica- tions I-II, volume 3. Springer, 2018

work page 2018
[9]

A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025

Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025

work page arXiv 2025
[10]

How Smooth is Attention? InProceedings of the 41st International Conference on Machine Learning, ICML’24, pages 5817 – 5840

Valérie Castin, Pierre Ablin, and Gabriel Peyré. How Smooth is Attention? InProceedings of the 41st International Conference on Machine Learning, ICML’24, pages 5817 – 5840. JMLR.org, 2024

work page 2024
[11]

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams

Lénaïc Chizat. The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams. arXiv preprint arXiv:2509.10167, 2025

work page arXiv 2025
[12]

On the Global Convergence of Gradient Descent for Over- parameterized Models using Optimal Transport

Lenaic Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Over- parameterized Models using Optimal Transport. InAdvances in Neural Information Processing Systems, Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3040–3050, Montréal, Canada, December 2018

work page 2018
[13]

On the Global Convergence of Gradient Descent for Multi-Layer ResNets in the Mean-Field Regime.arXiv preprint arXiv:2110.02926, 2021

Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. On the Global Convergence of Gradient Descent for Multi-Layer ResNets in the Mean-Field Regime.arXiv preprint arXiv:2110.02926, 2021. 10

work page arXiv 2021
[14]

Overparameterization of Deep ResNet: Zero Loss and Mean-Field Analysis.J

Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. Overparameterization of Deep ResNet: Zero Loss and Mean-Field Analysis.J. Mach. Learn. Res., 23(1), January 2022

work page 2022
[15]

An Image is worth 16x16 words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is worth 16x16 words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations, 2021

work page 2021
[16]

Cambridge Series in Statistical and Proba- bilistic Mathematics

Rick Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Proba- bilistic Mathematics. Cambridge University Press, 5 edition, 2019

work page 2019
[17]

Helder- mann, Berlin, 1989

Ryszard Engelking.General Topology, volume 6 ofSigma Series in Pure Mathematics. Helder- mann, Berlin, 1989

work page 1989
[18]

Clustering in Deep Stochastic Transformers.arXiv preprint arXiv:2601.21942, 2026

Lev Fedorov, Michaël E Sander, Romuald Elie, Pierre Marion, and Mathieu Laurière. Clustering in Deep Stochastic Transformers.arXiv preprint arXiv:2601.21942, 2026

work page arXiv 2026
[19]

de Hoop, and Gabriel Peyré

Takashi Furuya, Maarten V . de Hoop, and Gabriel Peyré. Transformers are Universal In-context Learners. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[20]

Klusowski, and Jianqing Fan

Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason M. Klusowski, and Jianqing Fan. Global Convergence in Training Large-Scale Transformers. InAdvances in Neural Information Processing Systems, volume 37, pages 29213–29284. Curran Associates, Inc., 2024

work page 2024
[21]

A Mathematical Perspective on Transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A Mathematical Perspective on Transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025

work page 2025
[22]

Understanding the Difficulty of Training Deep Feedforward Neural Networks

Xavier Glorot and Yoshua Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Yee Whye Teh and Mike Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy,...

work page
[23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Kloeden.Random Ordinary Differential Equations, pages 15–27

Xiaoying Han and Peter E. Kloeden.Random Ordinary Differential Equations, pages 15–27. Springer Singapore, Singapore, 2017

work page 2017
[25]

Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[26]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[27]

Homogenized Transformers.arXiv preprint arXiv:2604.01978, 2026

Hugo Koubbi, Borjan Geshkovski, and Philippe Rigollet. Homogenized Transformers.arXiv preprint arXiv:2604.01978, 2026

work page arXiv 2026
[28]

On Explicit Milstein-type scheme for McKean–Vlasov Stochastic Differential Equations with Super-Linear Drift Coefficient.Electronic Journal of Probability, 26(none), January 2021

Chaman Kumar and Neelima. On Explicit Milstein-type scheme for McKean–Vlasov Stochastic Differential Equations with Super-Linear Drift Coefficient.Electronic Journal of Probability, 26(none), January 2021

work page 2021
[29]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 Technical Report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training.arXiv preprint arXiv:2502.16982, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, 2019

work page 2019
[32]

Resnets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-Scale Limit.arXiv preprint arXiv:2603.18168, 2026

Louis-Pierre Chaintron and Lénaïc Chizat and Javier Maas. Resnets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-Scale Limit.arXiv preprint arXiv:2603.18168, 2026

work page arXiv 2026
[33]

Birkhäuser Basel, Basel, 2008

Luigi Ambrosio and Nicola Gigli and Giuseppe Savaré.Absolutely Continuous Curves in P(X) and the Continuity Equation, pages 167–200. Birkhäuser Basel, Basel, 2008

work page 2008
[34]

Springer Berlin Heidelberg, Berlin, Heidelberg, 2007

Jin Ma and Jiongmin Yiong.Forward-Backward Stochastic Differential Equations and their Applications, pages 1–24. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007

work page 2007
[35]

Implicit Regularization of Deep Residual Networks towards Neural ODEs

Pierre Marion, Yu-Han Wu, Michael Eli Sander, and Gérard Biau. Implicit Regularization of Deep Residual Networks towards Neural ODEs. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[36]

London Mathematical Society Lecture Note Series

Colin McDiarmid.On the Method of Bounded Differences, page 148–188. London Mathematical Society Lecture Note Series. Cambridge University Press, 1989

work page 1989
[37]

Springer Berlin Heidelberg, Berlin, Heidelberg, 1991

Michel Ledoux and Michel Talagrand.Gaussian Random Variables, pages 54–88. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991

work page 1991
[38]

A Convergence Result of a Continuous Model of Deep Learning via Łojasiewicz- Simon inequality.CoRR, abs/2311.15365, 2023

Noboru Isobe. A Convergence Result of a Continuous Model of Deep Learning via Łojasiewicz- Simon inequality.CoRR, abs/2311.15365, 2023

work page arXiv 2023
[39]

Springer Interna- tional Publishing, Cham, 2021

Olav Kallenberg.Sets and Functions, Measures and Integration, pages 9–32. Springer Interna- tional Publishing, Cham, 2021

work page 2021
[40]

A Connection between Score Matching and Denoising Autoencoders.Neural Comput., 23(7):1661–1674, July 2011

Pascal Vincent. A Connection between Score Matching and Denoising Autoencoders.Neural Comput., 23(7):1661–1674, July 2011

work page 2011
[41]

Wiley Series in Probability and Statistics

Patrick Billingsley.Convergence of Probability Measures. Wiley Series in Probability and Statistics. Wiley, 2013

work page 2013
[42]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[43]

Mean-Field Stochastic Differential Equations and Associates PDEs.The Annals of Probability, 45(2):824–878, 2017

Rainer Buckdahn and Juan Li and Shige Peng and Catherine Rainer. Mean-Field Stochastic Differential Equations and Associates PDEs.The Annals of Probability, 45(2):824–878, 2017

work page 2017
[44]

Grundlehren der Mathematischen Wissenschaften

Daniel Revuz and Marc Yor.Continuous Martingales and Brownian Motion. Grundlehren der Mathematischen Wissenschaften. Springer Berlin Heidelberg, 2013

work page 2013
[45]

The Mean-Field Dynamics of Transformers.arXiv preprint arXiv:2512.01868, 2025

Philippe Rigollet. The Mean-Field Dynamics of Transformers.arXiv preprint arXiv:2512.01868, 2025

work page arXiv 2025
[46]

Sinkformers: Transform- ers with Doubly Stochastic Attention

Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Transform- ers with Doubly Stochastic Attention. InInternational Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022

work page 2022
[47]

Progress in Nonlinear Differential Equations and Their Applications

Filippo Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015

work page 2015
[48]

Chapter 4 - Short Course in Continuous Time Dynamic Systems and Control

Vadim Azhmyakov. Chapter 4 - Short Course in Continuous Time Dynamic Systems and Control. InA Relaxation-Based Approach to Optimal Control of Hybrid and Switched Systems, pages 87–126. Butterworth-Heinemann, 2019

work page 2019
[49]

Springer International Publishing, Cham, 2016

Sara van de Geer.Symmetrization, Contraction and Concentration, pages 233–238. Springer International Publishing, Cham, 2016

work page 2016
[50]

van der Vaart and Jon A

Aad W. van der Vaart and Jon A. Wellner.Symmetrization and Measurability, pages 107–121. Springer New York, New York, NY , 1996. 12

work page 1996
[51]

Grundlehren der Mathematischen Wis- senschaften

Cédric Villani.Optimal Transport: Old and New. Grundlehren der Mathematischen Wis- senschaften. Springer Berlin Heidelberg, 2008

work page 2008
[52]

Wainwright.Metric Entropy and its Uses, pages 121–158

Martin J. Wainwright.Metric Entropy and its Uses, pages 121–158. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019

work page 2019
[53]

Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization

Shuo Xie and Zhiyuan Li. Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 54488–54510. PMLR, 21–27 Jul 2024

work page 2024
[54]

Adam Exploits ℓ∞-Geometry of Loss Landscape via Coordinate-wise Adaptivity.arXiv preprint arXiv:2410.08198, 2024

Shuo Xie, Mohamad Amin Mohamadi, and Zhiyuan Li. Adam Exploits ℓ∞-Geometry of Loss Landscape via Coordinate-wise Adaptivity.arXiv preprint arXiv:2410.08198, 2024

work page arXiv 2024
[55]

On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On Layer Normalization in the Transformer Architecture. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020

work page 2020
[56]

Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. Deconstructing What Makes a Good Optimizer for Autoregressive Language Models. InThe Thirteenth International Conference on Learning Representations, 2025. A Background Material A.1 Notation & Conventions We shall writert for the time discretisationrt =⌊Lt⌋/L . The symbols ∨...

work page 2025
[57]

Proof.The integral triangle inequality gives ∥Γ(x,m,n)∥ ≤ ZZ exp(β⟨θQx, θKy⟩) γ2(βθ T KθQx,m) ∥θT OθV y∥dm(y)dn(θ)≤R 2 2R1

exp(2βR2 1R2 2) . Proof.The integral triangle inequality gives ∥Γ(x,m,n)∥ ≤ ZZ exp(β⟨θQx, θKy⟩) γ2(βθ T KθQx,m) ∥θT OθV y∥dm(y)dn(θ)≤R 2 2R1. Recall that K, defined in equation (12), can be decomposed into two term. For the first term ∇xH, we have ∥∇xH(x,m,n, a)∥ ≤β∥a∥ Z ∥θT OθV Dxγ(βθ T KθQx,m)θ T KθQ∥opdn(θ)≤2βR 4 2R2 1R3,(42) where we have used the bou...

work page
[58]

exp(2βR2 1R2 2) By combining the two preceding estimates, we get Z ∂µH(y,m,n, p)(x)dr(y, p) ≤R 3R2 2(1 + 2βR2 2R2

work page
[59]

Lemma 17.Let S(R) be defined by 24 and, for i= 1,2 , take xi ∈ ¯B(R1), mi ∈ P( ¯B(R1)) and ni ∈ P(S(R 2))

exp(2βR2 1R2 2)(44) The result concludes by noting that K is bounded above by the sum of the terms on the left-hand sides of equations (42) and (44), and so K is bounded above by the sum of the right-hand sides of these equations. Lemma 17.Let S(R) be defined by 24 and, for i= 1,2 , take xi ∈ ¯B(R1), mi ∈ P( ¯B(R1)) and ni ∈ P(S(R 2)). Then, for everyp≥1,...

work page
[60]

As a result, Y ξ,R t ∈ ¯B(RX) for every t∈[0,1]P 1-a.s., which implies that PR acts trivially on Y ξ,R, wheneverR > R X

=:R X(R0, R2). As a result, Y ξ,R t ∈ ¯B(RX) for every t∈[0,1]P 1-a.s., which implies that PR acts trivially on Y ξ,R, wheneverR > R X. Therefore,Y ξ,R satisfies (53)P 1-a.s., wheneverR > R X. Step Two: Existence and Uniqueness ofp ξ LetY ξ be the unique solution to (53). SinceY ξ 1 ∈ ¯B(RX)P 1-a.s., the terminal condition pξ 1 =∂ µℓ(L1(Y ξ 1 ))(Y ξ 1 )∈L...

work page
[61]

Therefore, only the Lipschitz continuity with respect to the initial condition requires further study

As a result, the existence, uniqueness and measurability of Y x,ξ,p x,ξ follows directly fromSteps TwoandThreeof the proof of Lemma 21. Therefore, only the Lipschitz continuity with respect to the initial condition requires further study. Lemma 3.1 in [43], with identically zero diffusion terms, establishes that there exists a constant ΛY depending only o...

work page
[62]

Accordingly, the difference in terminal conditions for pcan be bounded using Assumption 2 to get ∥px1,ξ1 1 −p x2,ξ2 1 ∥2 ≤Λ 2 ℓ(RX) W2(L1(Y ξ1 1 ),L 1(Y ξ2 1 )) +∥Y x1,ξ1 1 −Y x2,ξ2 1 ∥ 2 ≤4Λ 2 ℓ(RX)Λ2 Y ∥x1 −x 2∥2 + 3W2 2 (L1(ξ1),L 1(ξ2)) . Then, by applying the above estimate, Young’s and Cauchy-Schwarz inequalities and the Lipschitz continuity ofK R, w...

work page
[63]

sup (x,ζ)∈D M x,ζ l,τ # ≤C 3 3X j=1 E1

The constant cR is chosen so that ∥R(θ)∥∞ ≤λ −1 implies θ∈S(c Rλ−1). As a result, supp π⊆S(c Rλ−1)⊆S(R θ), where Rθ =B βcRλ−1, since Bβ >1 , as seen in Lemma 43. Also, νt,0 =π is deterministic and constant in t, and so satisfies the measurability and continuity requirements. Finally,Φ t,0[ν](θ) =θis deterministic and thus measurable. Inductive Hypothesis:...

work page
[64]

Furthermore, there does not exist a singular eJ such that J γµ 1 or J γµ 2 satisfy Assumption 8, where these functions correspond to f γµ 1 and f γµ 2 according to definition (32)

Nevertheless, as noted in Remark 39, the conclusion of Lemma 33 still holds for the second term of K. Furthermore, there does not exist a singular eJ such that J γµ 1 or J γµ 2 satisfy Assumption 8, where these functions correspond to f γµ 1 and f γµ 2 according to definition (32). However, as shown in Section J.1, J γµ 1 , Jγµ 2 and J γµ 3 can be express...

work page
[65]

The second equality uses the initial condition vR 0 = 0

By substituting in the recursion relation for vR j given in Algorithm 1, we obtain jX i=1 βj−i 1 (R(gi))⊙2 ( q vR j +ε j)⊙2 = 1 1−β 2 jX i=1 βj−i 1 vR i −β 2vR i−1 ( q vR j +ε j)⊙2 = 1 1−β 2 vR j ( q vR j +ε j)⊙2 + 1 1−β 2 j−1X i=1 (β1 −β 2) βj−i−1 1 vR i ( q vR j +ε j)2 . The second equality uses the initial condition vR 0 = 0. Since (β1 −β 2)<0 and each...

work page