Transformer as an Euler Discretization of Score-based Variational Flow

Huadong Liao

arxiv: 2604.23740 · v1 · submitted 2026-04-26 · 💻 cs.LG

Transformer as an Euler Discretization of Score-based Variational Flow

Huadong Liao This is my paper

Pith reviewed 2026-05-08 06:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords TransformerScore-based Variational FlowSVFlowEuler discretizationattention mechanismvariational inferencespherical geometryunification

0 comments

The pith

The Transformer architecture is exactly the forward Euler discretization of spherical Score-based Variational Flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Score-based Variational Flow as a continuous dynamical system in which representations evolve according to a weighted average of conditional log-likelihood scores, with weights given by a variational posterior. It then shows that taking the forward Euler step on the spherical version of this flow produces the precise sequence of operations found in a Transformer block. Multi-head attention arises as a kernel-smoothed approximation to the underlying vector field, the feed-forward network supplies a relaxed approximation, and the residual-plus-layer-norm block implements a retraction that keeps the state on the sphere. This view supplies a single dynamical principle that explains both the architecture and why attention trains stably while mixture-of-experts variants need extra balancing losses.

Core claim

Forward Euler discretization of spherical SVFlow exactly recovers the Transformer. The state update at each layer is the discrete step that integrates the SVFlow vector field, where the vector field itself is the variational-posterior-weighted average of score functions. Multi-head attention realizes this vector field via a von Mises-Fisher kernel that smooths the posterior over tokens, the MoE or FFN block relaxes the same computation in a network form, and the residual-normalization step performs the spherical retraction required to keep the trajectory on the manifold. The resulting discrete dynamics therefore inherit the variational consistency of the continuous flow.

What carries the argument

Spherical Score-based Variational Flow (SVFlow), a continuous-time dynamical system whose vector field is the variational posterior-weighted average of conditional log-likelihood scores; its forward Euler discretization supplies the exact layer update rule.

If this is right

Multi-head attention computes a kernel-smoothed approximation to the SVFlow vector field at each layer.
The residual-normalization block maintains the spherical geometry required by the continuous flow.
MoE and FFN layers serve as relaxed, network-based approximations to the same vector field.
Variational consistency supplies an implicit regularization that explains stable training of attention without explicit penalties.
SVFlow-derived metrics on prefix-shuffled inputs correlate with downstream task performance and exhibit depth-dependent sensitivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that replace attention or normalization with other operators can be evaluated by checking whether they still correspond to an Euler step on a comparable flow.
The continuous-time perspective suggests that training Transformers at different depths or with different step sizes may be interpretable as numerical integration of the same underlying ODE.
Changing the underlying manifold or the form of the score function could generate new discrete architectures that inherit the same stability properties.

Load-bearing premise

The variational posterior over next tokens can be faithfully approximated by the von Mises-Fisher kernel inside multi-head attention, and the layer-norm-plus-residual block exactly matches the spherical retraction.

What would settle it

Measure whether the attention weights deviate systematically from the posterior weights that would be produced by the vMF kernel on the same token embeddings; a large, consistent mismatch would falsify the claimed equivalence.

Figures

Figures reproduced from arXiv: 2604.23740 by Huadong Liao.

**Figure 1.** Figure 1: Effect of regularization strength β on a Gaussian SVFlow. (a): Synthetic 2D dataset, colored by class. (b): Training with only the variational consistency objective (β → ∞). The vector field (arrows) is orthogonal to the ELBO contours (background), following the steepest descent direction. All points converge to a single ELBO optimum (red star), indicating posterior collapse onto a single Gaussian componen… view at source ↗

**Figure 2.** Figure 2: Layer-wise ∆(− log pt) for attention layers under different prefix shuffling rates. For both models, deep layers exhibit positive ∆ while shallow layers remain near zero, indicating deeper attention representations are more sensitive to context disruption. Value and Output Projection as Conditional Score. Similarly, let the conditional distribution p(x|z) be vMF with parameters κ ′ zµ ′ z = Wo,hW⊤ v… view at source ↗

**Figure 3.** Figure 3: Layer-wise evolution of divergence KL(q∥p) and concentrations KL(q∥U) and KL(p∥U) under baseline (solid) and highest shuffle rate (dashed) conditions. The y-axis is logarithmic. While KL(q∥U) remains in a similar moderate range across all three models, KL(p∥U) differs dramatically: it stays low and flat for Llama3.2, rises gradually for Qwen2.5, and surges at early layers for Qwen3. Consequently, KL(q∥p) r… view at source ↗

**Figure 4.** Figure 4: Layer-wise vMF concentration κ (log scale) for Llama3.2, Qwen2.5, and Qwen3 under baseline (solid) and perturbed (dashed) conditions. κ determines KL(p∥U) and differs by orders of magnitude across models; perturbation leaves these orders unchanged, indicating that the sharpness of p is unaffected by context shuffling. 19 view at source ↗

read the original abstract

Despite the Transformer's dominance across machine learning, its architecture remains largely heuristic and lacks a unified theoretical foundation. We introduce Score-based Variational Flow (SVFlow), a continuous-time dynamical system for representation learning in which the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores, and provide a principled basis for regularization through variational consistency. We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry. This unification explains why attention trains stably without explicit regularization while MoE requires auxiliary balancing losses. Experiments on pre-trained language models with prefix shuffling show that SVFlow-induced metrics correlate with task performance, reveal depth-dependent sensitivity, and reflect the intrinsic dynamics of attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims the Transformer is exactly the forward Euler step of a spherical score-based variational flow, but that equivalence rests on treating the vMF kernel and residual-norm as precise matches rather than modeling choices.

read the letter

The main point is that this work frames the Transformer as the result of one forward Euler discretization on a continuous spherical SVFlow, where the state evolves by a variational posterior-weighted average of conditional log-likelihood scores. If the math checks out without extra assumptions, it could give a cleaner way to think about stability and regularization in large models. The authors map multi-head attention to a vMF kernel smoothing of the vector field, MoE or FFN to a relaxed network approximation, and the residual plus layer-norm to a retraction that keeps things on the sphere. They also use this to explain why attention needs no extra balancing loss while MoE does. The experiments shuffle prefixes on pre-trained language models and show that SVFlow-derived metrics track task performance and depth sensitivity, which is a concrete check even if not the strongest possible test of the derivation.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Score-based Variational Flow (SVFlow) as a continuous-time dynamical system for representation learning, where the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores. It claims that the forward Euler discretization of the spherical SVFlow exactly recovers the Transformer architecture, with multi-head attention approximating the SVFlow vector field via a vMF kernel-smoothed posterior, MoE/FFN providing a relaxed approximation, and the residual-normalization block implementing a relaxed retraction to maintain spherical geometry. Experiments on pre-trained language models using prefix shuffling demonstrate that SVFlow-induced metrics correlate with task performance and reveal depth-dependent sensitivities.

Significance. If the central claim of exact recovery holds, the paper would offer a significant theoretical unification of the Transformer architecture with continuous dynamical systems, providing a principled explanation for its training stability without explicit regularization and the need for auxiliary losses in MoE. The experimental validation suggests that SVFlow metrics could serve as diagnostic tools for model analysis. The work builds on ideas from score-based models and variational inference, potentially opening avenues for deriving new architectures from continuous flows.

major comments (3)

[Abstract] The abstract asserts that 'forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture' yet immediately qualifies multi-head attention as approximating 'via a vMF kernel-smoothed posterior' and the residual-normalization as a 'relaxed retraction'. This tension indicates that the recovery may depend on additional modeling assumptions rather than being exact, which is central to the paper's unification claim and requires explicit resolution.
[Sections 3 and 4] The mappings in Sections 3 and 4 equate the vMF kernel in attention to the variational posterior over next-token scores and the layer-norm plus residual to the spherical retraction operator. However, these are presented as direct correspondences without deriving the necessity of the vMF form or the exact geometric equivalence from the SVFlow ODE definition alone. If these are choices rather than consequences, the discretization yields an analogous but not identical architecture.
[SVFlow Definition] The SVFlow vector field is constructed using the same conditional log-likelihood scores that the Transformer is trained to predict. This setup risks circularity, where the Euler discretization recovers the Transformer by construction through the shared score definition, rather than providing an independent derivation of the architecture from the continuous system.

minor comments (2)

[Experiments] The description of the prefix shuffling experiments could be expanded to include precise definitions of how the SVFlow-induced metrics are calculated from the model outputs.
[Notation] Ensure consistent use of notation for the spherical geometry and retraction operators throughout the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope of our unification claim. We address each major point below, revising the manuscript to resolve ambiguities in the abstract and strengthen the derivations in Sections 3 and 4. The core claim remains that the Transformer block structure arises exactly from the forward Euler discretization under the specified SVFlow components.

read point-by-point responses

Referee: [Abstract] The abstract asserts that 'forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture' yet immediately qualifies multi-head attention as approximating 'via a vMF kernel-smoothed posterior' and the residual-normalization as a 'relaxed retraction'. This tension indicates that the recovery may depend on additional modeling assumptions rather than being exact, which is central to the paper's unification claim and requires explicit resolution.

Authors: We agree that the original abstract wording created ambiguity between 'exact recovery' and the qualifiers 'approximating' and 'relaxed'. The exactness refers to the mathematical equivalence of the discretized ODE step to the Transformer block when the vector field uses the vMF-smoothed posterior and the retraction is implemented via residual-plus-normalization. The qualifiers describe the practical realization of these SVFlow elements. We have revised the abstract to read: 'We show that the forward Euler discretization of spherical SVFlow recovers the Transformer architecture exactly, with multi-head attention implementing the vector field via a vMF kernel-smoothed posterior and the residual-normalization block implementing a relaxed retraction to the sphere.' This removes the tension while preserving the claim. revision: yes
Referee: [Sections 3 and 4] The mappings in Sections 3 and 4 equate the vMF kernel in attention to the variational posterior over next-token scores and the layer-norm plus residual to the spherical retraction operator. However, these are presented as direct correspondences without deriving the necessity of the vMF form or the exact geometric equivalence from the SVFlow ODE definition alone. If these are choices rather than consequences, the discretization yields an analogous but not identical architecture.

Authors: We have expanded Sections 3 and 4 with explicit derivations. In Section 3, we show that the vMF kernel arises directly from the exponential-family form of the variational posterior over score directions on the sphere, which is the maximum-entropy distribution consistent with the SVFlow variational objective; this is not an arbitrary choice but the canonical one for the spherical geometry. In Section 4, we derive that the residual connection followed by normalization is the first-order Taylor approximation to the spherical retraction operator needed to enforce the manifold constraint after the Euler update. While other kernels are mathematically possible, the vMF form is the one that yields the attention mechanism exactly, making the discretized architecture identical to the Transformer rather than merely analogous. These additions clarify the derivations from the ODE. revision: yes
Referee: [SVFlow Definition] The SVFlow vector field is constructed using the same conditional log-likelihood scores that the Transformer is trained to predict. This setup risks circularity, where the Euler discretization recovers the Transformer by construction through the shared score definition, rather than providing an independent derivation of the architecture from the continuous system.

Authors: The SVFlow is introduced as a standalone continuous-time dynamical system whose vector field is the posterior-weighted average of conditional scores (gradients of log-likelihoods), defined without reference to any discrete model. The Transformer is shown to emerge when this continuous system is discretized and the scores are supplied by a network trained to predict them. This is not circularity but the intended unification: the continuous flow provides the independent dynamical principle, and the architecture is the discretization that realizes it. We have added a clarifying paragraph in the introduction emphasizing that the ODE is primary and the Transformer is its exact discrete counterpart under the stated approximations. revision: partial

Circularity Check

2 steps flagged

SVFlow vector field defined via vMF-smoothed conditional scores; Euler step recovers Transformer by construction once vMF and retraction are posited as exact matches

specific steps

self definitional [Abstract and Sections 3-4]
"We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry."

The SVFlow vector field is defined (Section 3) as evolving according to a variational posterior-weighted average of conditional log-likelihood scores. The paper then asserts that attention computes precisely this average once a vMF kernel is chosen and that layer-norm+residual is the retraction. With those identifications granted, the Euler discretization step reproduces the Transformer equations by algebraic substitution; the 'recovery' is therefore definitional rather than an independent derivation from a prior continuous dynamics.
ansatz smuggled in via citation [Section 3 (definition of spherical SVFlow)]
"the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores"

The continuous-time vector field is constructed so that its discretization will match attention once the vMF kernel and spherical retraction are inserted. No derivation from first principles shows why the variational posterior must be smoothed by vMF or why the geometry must be spherical; both are ansatzes chosen to produce the Transformer block.

full rationale

The central claim is that one forward Euler step on spherical SVFlow yields the standard Transformer block. This holds only because the paper defines the SVFlow vector field itself as a variational-posterior-weighted average of conditional log-likelihood scores, then states that multi-head attention implements exactly that average via a vMF kernel while residual+layer-norm implements the spherical retraction. Both correspondences are introduced as modeling choices rather than derived from an independent continuous-time definition; once they are granted, the discretization step is tautological. No external benchmark or parameter-free derivation is supplied to show the mapping is forced rather than fitted. The remainder of the paper (experiments on pre-trained models) is downstream of this equivalence and does not break the circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on introducing SVFlow as a new dynamical system whose discretization yields the Transformer; the paper therefore adds one invented continuous-time object and several modeling approximations whose independent justification is not visible from the abstract.

free parameters (2)

Euler step size
The discretization step length must be chosen to match the residual scaling used in Transformers.
vMF concentration parameter
Controls how sharply the attention kernel approximates the posterior; appears fitted or chosen to recover standard attention.

axioms (2)

domain assumption The state lives on the unit sphere and the retraction after each step preserves this geometry
Invoked to justify the residual-plus-layer-norm block as a spherical retraction.
ad hoc to paper The variational posterior can be represented by a vMF kernel-smoothed distribution over tokens
Required for the multi-head attention interpretation; not derived from first principles in the abstract.

invented entities (1)

Score-based Variational Flow (SVFlow) no independent evidence
purpose: Continuous-time dynamical system whose vector field is a posterior-weighted average of conditional log-likelihood scores
New object introduced to unify the Transformer; no independent empirical handle provided in the abstract.

pith-pipeline@v0.9.0 · 5449 in / 1569 out tokens · 35729 ms · 2026-05-08T06:17:59.343179+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Dalal, and Vishal Misra

Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. Gradient dynamics of attention: How cross-entropy sculpts bayesian manifolds, 2026

work page 2026
[2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016

work page 2016
[3]

Dhillon, Joydeep Ghosh, and Suvrit Sra

Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions.Journal of Machine Learning Research, 6:1345–1382, 2005

work page 2005
[4]

Blei, Alp Kucukelbir, and Jon D

David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 112(518):859–877, April 2017

work page 2017
[5]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2025

work page 2025
[6]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InAdvances in neural information processing systems, pages 6571–6583, 2018

work page 2018
[7]

Switchhead: Accelerating transformers with mixture-of-experts attention, 2024

Robert Csordas, Piotr Piekos, Kazuki Irie, and Jurgen Schmidhuber. Switchhead: Accelerating transformers with mixture-of-experts attention, 2024

work page 2024
[8]

Calibration of Pre-trained Transformers

Shrey Desai and Greg Durrett. Calibration of Pre-trained Transformers. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020
[9]

Naesseth, Max Welling, and Jan-Willem van de Meent

Floor Eijkelboom, Grigory Bartosh, Christian A. Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[10]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[11]

The emergence of clusters in self-attention dynamics

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023

work page 2023
[12]

A mathematical perspective on transformers, 2025

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers, 2025

work page 2025
[13]

Advancing expert specialization for better moe

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for better moe. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[14]

Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction

Akshat Gupta, Atahan Ozdemir, Caoqinwei Gong, and Gopala Anumanchipalli. Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction. InACL 2025 Student Research Workshop, 2025

work page 2025
[15]

Fuse it more deeply! a variational transformer with layer-wise latent variable inference for text generation, 2022

Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun, and Xing Xie. Fuse it more deeply! a variational transformer with layer-wise latent variable inference for text generation, 2022

work page 2022
[16]

Convergence analysis of probability flow ode for score-based generative models.IEEE Transactions on Information Theory, 71(6):4581–4601, June 2025

Daniel Zhengyu Huang, Jiaoyang Huang, and Zhengjiang Lin. Convergence analysis of probability flow ode for score-based generative models.IEEE Transactions on Information Theory, 71(6):4581–4601, June 2025

work page 2025
[17]

Jacobs, Michael I

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

work page 1991
[18]

Ode transformer: An ordinary differential equation-inspired model for neural machine translation, 2021

Bei Li, Quan Du, Tao Zhou, Shuhan Zhou, Xin Zeng, Tong Xiao, and Jingbo Zhu. Ode transformer: An ordinary differential equation-inspired model for neural machine translation, 2021

work page 2021
[19]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. 10

work page 2023
[20]

Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021

Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[21]

Dense backpropagation improves training for sparse mixture- of-experts

Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Thérien, Sambit Sahu, Tom Gold- stein, and Supriyo Chakraborty. Dense backpropagation improves training for sparse mixture- of-experts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[22]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

work page 2025
[23]

Transformers as unrolled inference in probabilistic laplacian eigenmaps

Aditya Ravuri and Neil D Lawrence. Transformers as unrolled inference in probabilistic laplacian eigenmaps. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025

work page 2025
[24]

Outrageously large neural networks: The sparsely-gated mixture-of- experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017

work page 2017
[25]

Heejune Sheen, Siyu Chen, Tianhao Wang, and Harrison H. Zhou. Implicit regularization of gradient flow on one-layer softmax attention, 2024

work page 2024
[26]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021
[27]

Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026

Ye Su and Yong Liu. Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026

work page 2026
[28]

Xue-Cheng Tai, Hao Liu, Lingfeng Li, and Raymond H. Chan. A mathematical explanation of transformers for large language models and gpts, 2025

work page 2025
[29]

Trans- formers as support vector machines

Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Trans- formers as support vector machines. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

work page 2023
[30]

Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025

Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheong- woong Kang, and Jaesik Choi. Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025

work page 2025
[31]

Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019

work page 2019
[32]

Implicit bias and fast conver- gence rates for self-attention.Transactions on Machine Learning Research, 2025

Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. Implicit bias and fast conver- gence rates for self-attention.Transactions on Machine Learning Research, 2025

work page 2025
[33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[34]

Simulation of the von mises fisher distribution.Communications in Statistics - Simulation and Computation, 23(1):157–164, 1994

Andrew T.A Wood. Simulation of the von mises fisher distribution.Communications in Statistics - Simulation and Computation, 23(1):157–164, 1994

work page 1994
[35]

Efficient streaming language models with attention sinks, 2024

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024

work page 2024
[36]

Spherical latent spaces for stable variational autoencoders

Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[37]

Unifying learning dynamics and generalization in transformers scaling law, 2026

Chiwun Yang. Unifying learning dynamics and generalization in transformers scaling law, 2026

work page 2026
[38]

X z 1z′=z −q ϕℓ(z′|xℓ) qϕℓ(z|xℓ)Aℓ(z) # ∂gϕℓ(xℓ)z′ ∂ϕℓ (48) = X z′

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 11 A Proofs A.1 SVFlow Evidence Lower Bound We provide the detailed derivation of the instantaneous EL...

work page 2019

[1] [1]

Dalal, and Vishal Misra

Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. Gradient dynamics of attention: How cross-entropy sculpts bayesian manifolds, 2026

work page 2026

[2] [2]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016

work page 2016

[3] [3]

Dhillon, Joydeep Ghosh, and Suvrit Sra

Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions.Journal of Machine Learning Research, 6:1345–1382, 2005

work page 2005

[4] [4]

Blei, Alp Kucukelbir, and Jon D

David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 112(518):859–877, April 2017

work page 2017

[5] [5]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2025

work page 2025

[6] [6]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InAdvances in neural information processing systems, pages 6571–6583, 2018

work page 2018

[7] [7]

Switchhead: Accelerating transformers with mixture-of-experts attention, 2024

Robert Csordas, Piotr Piekos, Kazuki Irie, and Jurgen Schmidhuber. Switchhead: Accelerating transformers with mixture-of-experts attention, 2024

work page 2024

[8] [8]

Calibration of Pre-trained Transformers

Shrey Desai and Greg Durrett. Calibration of Pre-trained Transformers. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020

[9] [9]

Naesseth, Max Welling, and Jan-Willem van de Meent

Floor Eijkelboom, Grigory Bartosh, Christian A. Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[10] [10]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[11] [11]

The emergence of clusters in self-attention dynamics

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023

work page 2023

[12] [12]

A mathematical perspective on transformers, 2025

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers, 2025

work page 2025

[13] [13]

Advancing expert specialization for better moe

Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for better moe. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[14] [14]

Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction

Akshat Gupta, Atahan Ozdemir, Caoqinwei Gong, and Gopala Anumanchipalli. Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction. InACL 2025 Student Research Workshop, 2025

work page 2025

[15] [15]

Fuse it more deeply! a variational transformer with layer-wise latent variable inference for text generation, 2022

Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun, and Xing Xie. Fuse it more deeply! a variational transformer with layer-wise latent variable inference for text generation, 2022

work page 2022

[16] [16]

Convergence analysis of probability flow ode for score-based generative models.IEEE Transactions on Information Theory, 71(6):4581–4601, June 2025

Daniel Zhengyu Huang, Jiaoyang Huang, and Zhengjiang Lin. Convergence analysis of probability flow ode for score-based generative models.IEEE Transactions on Information Theory, 71(6):4581–4601, June 2025

work page 2025

[17] [17]

Jacobs, Michael I

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

work page 1991

[18] [18]

Ode transformer: An ordinary differential equation-inspired model for neural machine translation, 2021

Bei Li, Quan Du, Tao Zhou, Shuhan Zhou, Xin Zeng, Tong Xiao, and Jingbo Zhu. Ode transformer: An ordinary differential equation-inspired model for neural machine translation, 2021

work page 2021

[19] [19]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. 10

work page 2023

[20] [20]

Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021

Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021

work page 2021

[21] [21]

Dense backpropagation improves training for sparse mixture- of-experts

Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Thérien, Sambit Sahu, Tom Gold- stein, and Supriyo Chakraborty. Dense backpropagation improves training for sparse mixture- of-experts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[22] [22]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

work page 2025

[23] [23]

Transformers as unrolled inference in probabilistic laplacian eigenmaps

Aditya Ravuri and Neil D Lawrence. Transformers as unrolled inference in probabilistic laplacian eigenmaps. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025

work page 2025

[24] [24]

Outrageously large neural networks: The sparsely-gated mixture-of- experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017

work page 2017

[25] [25]

Heejune Sheen, Siyu Chen, Tianhao Wang, and Harrison H. Zhou. Implicit regularization of gradient flow on one-layer softmax attention, 2024

work page 2024

[26] [26]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

work page 2021

[27] [27]

Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026

Ye Su and Yong Liu. Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026

work page 2026

[28] [28]

Xue-Cheng Tai, Hao Liu, Lingfeng Li, and Raymond H. Chan. A mathematical explanation of transformers for large language models and gpts, 2025

work page 2025

[29] [29]

Trans- formers as support vector machines

Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Trans- formers as support vector machines. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

work page 2023

[30] [30]

Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025

Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheong- woong Kang, and Jaesik Choi. Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025

work page 2025

[31] [31]

Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019

Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019

work page 2019

[32] [32]

Implicit bias and fast conver- gence rates for self-attention.Transactions on Machine Learning Research, 2025

Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. Implicit bias and fast conver- gence rates for self-attention.Transactions on Machine Learning Research, 2025

work page 2025

[33] [33]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[34] [34]

Simulation of the von mises fisher distribution.Communications in Statistics - Simulation and Computation, 23(1):157–164, 1994

Andrew T.A Wood. Simulation of the von mises fisher distribution.Communications in Statistics - Simulation and Computation, 23(1):157–164, 1994

work page 1994

[35] [35]

Efficient streaming language models with attention sinks, 2024

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024

work page 2024

[36] [36]

Spherical latent spaces for stable variational autoencoders

Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018

[37] [37]

Unifying learning dynamics and generalization in transformers scaling law, 2026

Chiwun Yang. Unifying learning dynamics and generalization in transformers scaling law, 2026

work page 2026

[38] [38]

X z 1z′=z −q ϕℓ(z′|xℓ) qϕℓ(z|xℓ)Aℓ(z) # ∂gϕℓ(xℓ)z′ ∂ϕℓ (48) = X z′

Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 11 A Proofs A.1 SVFlow Evidence Lower Bound We provide the detailed derivation of the instantaneous EL...

work page 2019