Recognition: 2 theorem links
· Lean TheoremUniform Scaling Limits in AdamW-Trained Transformers
Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3
The pith
Transformer hidden states and gradients converge uniformly to a forward-backward ODE system under attention-head scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under appropriate scaling of the attention heads, the joint dynamics of the hidden states and backpropagated variables converge in L², uniformly over the initial condition, to the solution of a forward–backward system of ODEs at rate O(L^{-1} + L^{-1/3} H^{-1/2}). The limiting system can be identified with a McKean–Vlasov ODE when causal masking is absent. Using the flow maps of this ODE and concentration-of-measure arguments, the authors obtain approximation bounds that remain uniform over compact sets of initial conditions, are free of any covering argument, and are therefore independent of the number of tokens; after a suitable adaptation of AdamW the bounds also become independent of the
What carries the argument
The interacting-particle-system representation of the hidden-state dynamics coupled through the attention mechanism, whose continuous limit under head scaling is the forward-backward ODE (or McKean–Vlasov ODE) system.
If this is right
- The approximation error between the discrete transformer and the continuous limit is bounded uniformly over compact initial-condition sets without invoking a covering argument.
- The constants appearing in the error bounds do not depend on the number of tokens in the input sequence.
- After a minor modification of the AdamW update rule the same bounds become independent of the token embedding dimension.
- In the absence of causal masking the limiting dynamics coincide exactly with a McKean–Vlasov ODE.
Where Pith is reading between the lines
- The uniform convergence opens the possibility of transferring stability or fixed-point results from the continuous McKean–Vlasov equation back to finite-depth transformers.
- Because the bounds are token-count independent, the same continuous limit may remain predictive for arbitrarily long sequences where direct discrete analysis is intractable.
- One could numerically solve the limiting ODE for chosen initial distributions and then compare the resulting trajectories against empirical hidden-state statistics collected from actual trained transformers.
- The explicit rate suggests a concrete scaling rule—how many additional heads are needed per added layer—to keep the discrete model close to its continuous ideal.
Load-bearing premise
The hidden-state evolution admits an interacting-particle description whose attention interactions become well-defined in the continuous limit under the chosen scaling of head count with depth.
What would settle it
Compute the L2 distance between the discrete transformer states (and gradients) and the numerically integrated ODE trajectory for a sequence of increasing depths L and head counts H; the measured error should decay at the stated rate O(L^{-1} + L^{-1/3} H^{-1/2}) uniformly over a fixed compact set of initial conditions.
read the original abstract
We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript models the hidden-state dynamics of deep transformers trained with AdamW as an interacting particle system (IPS) coupled via the attention mechanism. Under appropriate scaling of the attention heads, it proves that the joint dynamics of hidden states and backpropagated variables converge in L², uniformly over initial conditions, to the solution of a forward-backward system of ODEs at rate O(L^{-1} + L^{-1/3}H^{-1/2}). The limiting system is identified with a McKean-Vlasov ODE when causal masking is absent. Flow maps of the MVODE combined with concentration of measure yield bounds independent of token count; under a suitable adaptation to AdamW these bounds are also independent of embedding dimension.
Significance. If the stated convergence holds, the work supplies a rigorous continuous-time limit for the training trajectories of deep AdamW-trained transformers. The uniformity over initial conditions, the avoidance of covering arguments (yielding token-count-independent constants), and the explicit rate are technically strong features that could enable direct analysis of optimization and scaling without discretization artifacts.
major comments (2)
- [Abstract] Abstract: the main result invokes a 'suitable adaptation to AdamW' to remove dependence on embedding dimension, yet the precise modification (e.g., any rescaling of the adaptive step, momentum buffers, or weight-decay term inside the continuous limit) is not specified. Without this detail it is impossible to confirm that the forward-backward ODE system describes the standard AdamW algorithm rather than a modified variant.
- [Abstract] Abstract (main theorem statement): the claimed rate contains an L^{-1/3} term whose derivation is not indicated. It is necessary to identify which estimate (e.g., a particular concentration or Lipschitz bound on the IPS) produces this exponent and to verify that the overall O(L^{-1} + L^{-1/3}H^{-1/2}) bound remains valid under the stated assumptions on attention-head scaling.
Simulated Author's Rebuttal
We thank the referee for the positive summary and for identifying points that will improve the clarity of the abstract. We respond to each major comment below and will revise the manuscript to address them.
read point-by-point responses
-
Referee: [Abstract] Abstract: the main result invokes a 'suitable adaptation to AdamW' to remove dependence on embedding dimension, yet the precise modification (e.g., any rescaling of the adaptive step, momentum buffers, or weight-decay term inside the continuous limit) is not specified. Without this detail it is impossible to confirm that the forward-backward ODE system describes the standard AdamW algorithm rather than a modified variant.
Authors: We agree that the abstract should indicate the nature of the adaptation for immediate readability. The adaptation is a d^{-1/2} rescaling applied only to the second-moment accumulator inside the AdamW update (while the first-moment, weight-decay, and step-size terms remain unscaled); this is fully specified in Section 3.3 and Appendix B of the manuscript, where the continuous-time limit is derived. We will revise the abstract to include a concise parenthetical phrase such as 'under a d^{-1/2}-rescaling of the second-moment term' so that the statement refers unambiguously to a mild, explicitly defined variant of standard AdamW. revision: yes
-
Referee: [Abstract] Abstract (main theorem statement): the claimed rate contains an L^{-1/3} term whose derivation is not indicated. It is necessary to identify which estimate (e.g., a particular concentration or Lipschitz bound on the IPS) produces this exponent and to verify that the overall O(L^{-1} + L^{-1/3}H^{-1/2}) bound remains valid under the stated assumptions on attention-head scaling.
Authors: The L^{-1/3} exponent is produced by optimizing a truncation parameter in the Lipschitz analysis of the IPS vector field after applying a McDiarmid concentration inequality to the empirical attention measure; the resulting deviation term is of order (L^{-1}H^{-1/2})^{1/3} and is added to the O(L^{-1}) discretization error. This derivation appears in the proof of Theorem 2.1 (immediately after the application of the concentration bound in Lemma 4.4). Under the head-scaling assumption stated in Assumption 2.3 the combined bound remains valid and uniform. We will add a short clarifying sentence to the abstract and a pointer to the relevant lemma so that the origin of the exponent is visible without reading the full proof. revision: yes
Circularity Check
No circularity: standard convergence analysis from IPS model to MVODE
full rationale
The derivation models transformer hidden-state dynamics as an interacting particle system, invokes an appropriate scaling of attention heads, and applies standard concentration-of-measure and flow-map arguments to obtain L² convergence to a forward-backward ODE (or MVODE) at the stated rate. The 'suitable adaptation to AdamW' is an explicit modeling choice that removes embedding-dimension dependence; it is not a fitted parameter renamed as a prediction, nor does any step reduce by construction to the target result. No self-citations are load-bearing, no ansatz is smuggled, and the result is a self-contained mathematical theorem whose constants are independent of token count by design. The central claim therefore remains independent of its inputs.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Hidden-state dynamics of the transformer can be modeled as an interacting particle system coupled through attention
- domain assumption There exists an appropriate scaling of the attention heads under which the continuous limit exists
- standard math Standard results from interacting particle systems, McKean-Vlasov theory, and concentration of measure apply to the scaled model
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean; IndisputableMonolith/Foundation/RealityFromDistinction.leanJ_uniquely_calibrated_via_higher_derivative; reality_from_one_distinction unclearmodelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism... converge... to the solution of a forward–backward system of ODEs... McKean–Vlasov ODE (MVODE)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclearuniform convergence... independent of the number of tokens... bounds... independent of the token embedding dimension
Reference graph
Works this paper leans on
-
[1]
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Andrea Agazzi, Giuseppe Bruno, Eloy Mosig García, Samuele Saviozzi, and Marco Romito. Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models.arXiv preprint arXiv:2604.26898, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N Gomez and Łukasz Kaiser and Illia Polosukhin. Attention is All you Need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[3]
Benny Avelin and Kaj Nyström. Neural ODEs as the Deep Limit of ResNets with Constant Weights.Analysis and Applications, 19(03):397–437, 2021
work page 2021
-
[4]
Raphaël Barboni, Gabriel Peyré, and François-Xavier Vialard. Understanding the Training of Infinitely Deep and Wide ResNets with Conditional Optimal Transport.Communications on Pure and Applied Mathematics, 78(11):2149–2205, 2025
work page 2025
-
[5]
Michel Benaim. A Dynamical System Approach to Stochastic Approximations.SIAM Journal on Control and Optimization, 34(2):437–472, 1996
work page 1996
-
[6]
The Emer- gence of Clusters in Self-Attention Dynamics
Borjan Geshkovski and Cyril Letrouit and Yury Polyanskiy and Philippe Rigollet. The Emer- gence of Clusters in Self-Attention Dynamics. InAdvances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023
work page 2023
-
[7]
Language Models are Few-Shot Learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
work page 1901
-
[8]
René Carmona and François Delarue.Probabilistic theory of Mean Field Games with Applica- tions I-II, volume 3. Springer, 2018
work page 2018
-
[9]
A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025
Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025
-
[10]
Valérie Castin, Pierre Ablin, and Gabriel Peyré. How Smooth is Attention? InProceedings of the 41st International Conference on Machine Learning, ICML’24, pages 5817 – 5840. JMLR.org, 2024
work page 2024
-
[11]
The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams
Lénaïc Chizat. The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams. arXiv preprint arXiv:2509.10167, 2025
-
[12]
On the Global Convergence of Gradient Descent for Over- parameterized Models using Optimal Transport
Lenaic Chizat and Francis Bach. On the Global Convergence of Gradient Descent for Over- parameterized Models using Optimal Transport. InAdvances in Neural Information Processing Systems, Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3040–3050, Montréal, Canada, December 2018
work page 2018
-
[13]
Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. On the Global Convergence of Gradient Descent for Multi-Layer ResNets in the Mean-Field Regime.arXiv preprint arXiv:2110.02926, 2021. 10
-
[14]
Overparameterization of Deep ResNet: Zero Loss and Mean-Field Analysis.J
Zhiyan Ding, Shi Chen, Qin Li, and Stephen Wright. Overparameterization of Deep ResNet: Zero Loss and Mean-Field Analysis.J. Mach. Learn. Res., 23(1), January 2022
work page 2022
-
[15]
An Image is worth 16x16 words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is worth 16x16 words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations, 2021
work page 2021
-
[16]
Cambridge Series in Statistical and Proba- bilistic Mathematics
Rick Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Proba- bilistic Mathematics. Cambridge University Press, 5 edition, 2019
work page 2019
-
[17]
Ryszard Engelking.General Topology, volume 6 ofSigma Series in Pure Mathematics. Helder- mann, Berlin, 1989
work page 1989
-
[18]
Clustering in Deep Stochastic Transformers.arXiv preprint arXiv:2601.21942, 2026
Lev Fedorov, Michaël E Sander, Romuald Elie, Pierre Marion, and Mathieu Laurière. Clustering in Deep Stochastic Transformers.arXiv preprint arXiv:2601.21942, 2026
-
[19]
Takashi Furuya, Maarten V . de Hoop, and Gabriel Peyré. Transformers are Universal In-context Learners. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[20]
Cheng Gao, Yuan Cao, Zihao Li, Yihan He, Mengdi Wang, Han Liu, Jason M. Klusowski, and Jianqing Fan. Global Convergence in Training Large-Scale Transformers. InAdvances in Neural Information Processing Systems, volume 37, pages 29213–29284. Curran Associates, Inc., 2024
work page 2024
-
[21]
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A Mathematical Perspective on Transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025
work page 2025
-
[22]
Understanding the Difficulty of Training Deep Feedforward Neural Networks
Xavier Glorot and Yoshua Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Yee Whye Teh and Mike Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy,...
-
[23]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Kloeden.Random Ordinary Differential Equations, pages 15–27
Xiaoying Han and Peter E. Kloeden.Random Ordinary Differential Equations, pages 15–27. Springer Singapore, Singapore, 2017
work page 2017
-
[25]
Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin and Ming-Wei Chang and Kenton Lee and Kristina Toutanova. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (long and short papers), pages 4171–4186, 2019
work page 2019
-
[26]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[27]
Homogenized Transformers.arXiv preprint arXiv:2604.01978, 2026
Hugo Koubbi, Borjan Geshkovski, and Philippe Rigollet. Homogenized Transformers.arXiv preprint arXiv:2604.01978, 2026
-
[28]
Chaman Kumar and Neelima. On Explicit Milstein-type scheme for McKean–Vlasov Stochastic Differential Equations with Super-Linear Drift Coefficient.Electronic Journal of Probability, 26(none), January 2021
work page 2021
-
[29]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 Technical Report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Muon is Scalable for LLM Training
Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training.arXiv preprint arXiv:2502.16982, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. In7th International Conference on Learning Representations, ICLR 2019, 2019
work page 2019
-
[32]
Louis-Pierre Chaintron and Lénaïc Chizat and Javier Maas. Resnets of All Shapes and Sizes: Convergence of Training Dynamics in the Large-Scale Limit.arXiv preprint arXiv:2603.18168, 2026
-
[33]
Luigi Ambrosio and Nicola Gigli and Giuseppe Savaré.Absolutely Continuous Curves in P(X) and the Continuity Equation, pages 167–200. Birkhäuser Basel, Basel, 2008
work page 2008
-
[34]
Springer Berlin Heidelberg, Berlin, Heidelberg, 2007
Jin Ma and Jiongmin Yiong.Forward-Backward Stochastic Differential Equations and their Applications, pages 1–24. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007
work page 2007
-
[35]
Implicit Regularization of Deep Residual Networks towards Neural ODEs
Pierre Marion, Yu-Han Wu, Michael Eli Sander, and Gérard Biau. Implicit Regularization of Deep Residual Networks towards Neural ODEs. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[36]
London Mathematical Society Lecture Note Series
Colin McDiarmid.On the Method of Bounded Differences, page 148–188. London Mathematical Society Lecture Note Series. Cambridge University Press, 1989
work page 1989
-
[37]
Springer Berlin Heidelberg, Berlin, Heidelberg, 1991
Michel Ledoux and Michel Talagrand.Gaussian Random Variables, pages 54–88. Springer Berlin Heidelberg, Berlin, Heidelberg, 1991
work page 1991
-
[38]
Noboru Isobe. A Convergence Result of a Continuous Model of Deep Learning via Łojasiewicz- Simon inequality.CoRR, abs/2311.15365, 2023
-
[39]
Springer Interna- tional Publishing, Cham, 2021
Olav Kallenberg.Sets and Functions, Measures and Integration, pages 9–32. Springer Interna- tional Publishing, Cham, 2021
work page 2021
-
[40]
Pascal Vincent. A Connection between Score Matching and Denoising Autoencoders.Neural Comput., 23(7):1661–1674, July 2011
work page 2011
-
[41]
Wiley Series in Probability and Statistics
Patrick Billingsley.Convergence of Probability Measures. Wiley Series in Probability and Statistics. Wiley, 2013
work page 2013
-
[42]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[43]
Rainer Buckdahn and Juan Li and Shige Peng and Catherine Rainer. Mean-Field Stochastic Differential Equations and Associates PDEs.The Annals of Probability, 45(2):824–878, 2017
work page 2017
-
[44]
Grundlehren der Mathematischen Wissenschaften
Daniel Revuz and Marc Yor.Continuous Martingales and Brownian Motion. Grundlehren der Mathematischen Wissenschaften. Springer Berlin Heidelberg, 2013
work page 2013
-
[45]
The Mean-Field Dynamics of Transformers.arXiv preprint arXiv:2512.01868, 2025
Philippe Rigollet. The Mean-Field Dynamics of Transformers.arXiv preprint arXiv:2512.01868, 2025
-
[46]
Sinkformers: Transform- ers with Doubly Stochastic Attention
Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers: Transform- ers with Doubly Stochastic Attention. InInternational Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022
work page 2022
-
[47]
Progress in Nonlinear Differential Equations and Their Applications
Filippo Santambrogio.Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling. Progress in Nonlinear Differential Equations and Their Applications. Springer International Publishing, 2015
work page 2015
-
[48]
Chapter 4 - Short Course in Continuous Time Dynamic Systems and Control
Vadim Azhmyakov. Chapter 4 - Short Course in Continuous Time Dynamic Systems and Control. InA Relaxation-Based Approach to Optimal Control of Hybrid and Switched Systems, pages 87–126. Butterworth-Heinemann, 2019
work page 2019
-
[49]
Springer International Publishing, Cham, 2016
Sara van de Geer.Symmetrization, Contraction and Concentration, pages 233–238. Springer International Publishing, Cham, 2016
work page 2016
-
[50]
Aad W. van der Vaart and Jon A. Wellner.Symmetrization and Measurability, pages 107–121. Springer New York, New York, NY , 1996. 12
work page 1996
-
[51]
Grundlehren der Mathematischen Wis- senschaften
Cédric Villani.Optimal Transport: Old and New. Grundlehren der Mathematischen Wis- senschaften. Springer Berlin Heidelberg, 2008
work page 2008
-
[52]
Wainwright.Metric Entropy and its Uses, pages 121–158
Martin J. Wainwright.Metric Entropy and its Uses, pages 121–158. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019
work page 2019
-
[53]
Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization
Shuo Xie and Zhiyuan Li. Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 54488–54510. PMLR, 21–27 Jul 2024
work page 2024
-
[54]
Shuo Xie, Mohamad Amin Mohamadi, and Zhiyuan Li. Adam Exploits ℓ∞-Geometry of Loss Landscape via Coordinate-wise Adaptivity.arXiv preprint arXiv:2410.08198, 2024
-
[55]
On Layer Normalization in the Transformer Architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On Layer Normalization in the Transformer Architecture. InProceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020
work page 2020
-
[56]
Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. Deconstructing What Makes a Good Optimizer for Autoregressive Language Models. InThe Thirteenth International Conference on Learning Representations, 2025. A Background Material A.1 Notation & Conventions We shall writert for the time discretisationrt =⌊Lt⌋/L . The symbols ∨...
work page 2025
-
[57]
exp(2βR2 1R2 2) . Proof.The integral triangle inequality gives ∥Γ(x,m,n)∥ ≤ ZZ exp(β⟨θQx, θKy⟩) γ2(βθ T KθQx,m) ∥θT OθV y∥dm(y)dn(θ)≤R 2 2R1. Recall that K, defined in equation (12), can be decomposed into two term. For the first term ∇xH, we have ∥∇xH(x,m,n, a)∥ ≤β∥a∥ Z ∥θT OθV Dxγ(βθ T KθQx,m)θ T KθQ∥opdn(θ)≤2βR 4 2R2 1R3,(42) where we have used the bou...
-
[58]
exp(2βR2 1R2 2) By combining the two preceding estimates, we get Z ∂µH(y,m,n, p)(x)dr(y, p) ≤R 3R2 2(1 + 2βR2 2R2
-
[59]
exp(2βR2 1R2 2)(44) The result concludes by noting that K is bounded above by the sum of the terms on the left-hand sides of equations (42) and (44), and so K is bounded above by the sum of the right-hand sides of these equations. Lemma 17.Let S(R) be defined by 24 and, for i= 1,2 , take xi ∈ ¯B(R1), mi ∈ P( ¯B(R1)) and ni ∈ P(S(R 2)). Then, for everyp≥1,...
-
[60]
=:R X(R0, R2). As a result, Y ξ,R t ∈ ¯B(RX) for every t∈[0,1]P 1-a.s., which implies that PR acts trivially on Y ξ,R, wheneverR > R X. Therefore,Y ξ,R satisfies (53)P 1-a.s., wheneverR > R X. Step Two: Existence and Uniqueness ofp ξ LetY ξ be the unique solution to (53). SinceY ξ 1 ∈ ¯B(RX)P 1-a.s., the terminal condition pξ 1 =∂ µℓ(L1(Y ξ 1 ))(Y ξ 1 )∈L...
-
[61]
As a result, the existence, uniqueness and measurability of Y x,ξ,p x,ξ follows directly fromSteps TwoandThreeof the proof of Lemma 21. Therefore, only the Lipschitz continuity with respect to the initial condition requires further study. Lemma 3.1 in [43], with identically zero diffusion terms, establishes that there exists a constant ΛY depending only o...
-
[62]
Accordingly, the difference in terminal conditions for pcan be bounded using Assumption 2 to get ∥px1,ξ1 1 −p x2,ξ2 1 ∥2 ≤Λ 2 ℓ(RX) W2(L1(Y ξ1 1 ),L 1(Y ξ2 1 )) +∥Y x1,ξ1 1 −Y x2,ξ2 1 ∥ 2 ≤4Λ 2 ℓ(RX)Λ2 Y ∥x1 −x 2∥2 + 3W2 2 (L1(ξ1),L 1(ξ2)) . Then, by applying the above estimate, Young’s and Cauchy-Schwarz inequalities and the Lipschitz continuity ofK R, w...
-
[63]
sup (x,ζ)∈D M x,ζ l,τ # ≤C 3 3X j=1 E1
The constant cR is chosen so that ∥R(θ)∥∞ ≤λ −1 implies θ∈S(c Rλ−1). As a result, supp π⊆S(c Rλ−1)⊆S(R θ), where Rθ =B βcRλ−1, since Bβ >1 , as seen in Lemma 43. Also, νt,0 =π is deterministic and constant in t, and so satisfies the measurability and continuity requirements. Finally,Φ t,0[ν](θ) =θis deterministic and thus measurable. Inductive Hypothesis:...
-
[64]
Nevertheless, as noted in Remark 39, the conclusion of Lemma 33 still holds for the second term of K. Furthermore, there does not exist a singular eJ such that J γµ 1 or J γµ 2 satisfy Assumption 8, where these functions correspond to f γµ 1 and f γµ 2 according to definition (32). However, as shown in Section J.1, J γµ 1 , Jγµ 2 and J γµ 3 can be express...
-
[65]
The second equality uses the initial condition vR 0 = 0
By substituting in the recursion relation for vR j given in Algorithm 1, we obtain jX i=1 βj−i 1 (R(gi))⊙2 ( q vR j +ε j)⊙2 = 1 1−β 2 jX i=1 βj−i 1 vR i −β 2vR i−1 ( q vR j +ε j)⊙2 = 1 1−β 2 vR j ( q vR j +ε j)⊙2 + 1 1−β 2 j−1X i=1 (β1 −β 2) βj−i−1 1 vR i ( q vR j +ε j)2 . The second equality uses the initial condition vR 0 = 0. Since (β1 −β 2)<0 and each...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.