Transformer as an Euler Discretization of Score-based Variational Flow
Pith reviewed 2026-05-08 06:17 UTC · model grok-4.3
The pith
The Transformer architecture is exactly the forward Euler discretization of spherical Score-based Variational Flow.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Forward Euler discretization of spherical SVFlow exactly recovers the Transformer. The state update at each layer is the discrete step that integrates the SVFlow vector field, where the vector field itself is the variational-posterior-weighted average of score functions. Multi-head attention realizes this vector field via a von Mises-Fisher kernel that smooths the posterior over tokens, the MoE or FFN block relaxes the same computation in a network form, and the residual-normalization step performs the spherical retraction required to keep the trajectory on the manifold. The resulting discrete dynamics therefore inherit the variational consistency of the continuous flow.
What carries the argument
Spherical Score-based Variational Flow (SVFlow), a continuous-time dynamical system whose vector field is the variational posterior-weighted average of conditional log-likelihood scores; its forward Euler discretization supplies the exact layer update rule.
If this is right
- Multi-head attention computes a kernel-smoothed approximation to the SVFlow vector field at each layer.
- The residual-normalization block maintains the spherical geometry required by the continuous flow.
- MoE and FFN layers serve as relaxed, network-based approximations to the same vector field.
- Variational consistency supplies an implicit regularization that explains stable training of attention without explicit penalties.
- SVFlow-derived metrics on prefix-shuffled inputs correlate with downstream task performance and exhibit depth-dependent sensitivity.
Where Pith is reading between the lines
- Architectures that replace attention or normalization with other operators can be evaluated by checking whether they still correspond to an Euler step on a comparable flow.
- The continuous-time perspective suggests that training Transformers at different depths or with different step sizes may be interpretable as numerical integration of the same underlying ODE.
- Changing the underlying manifold or the form of the score function could generate new discrete architectures that inherit the same stability properties.
Load-bearing premise
The variational posterior over next tokens can be faithfully approximated by the von Mises-Fisher kernel inside multi-head attention, and the layer-norm-plus-residual block exactly matches the spherical retraction.
What would settle it
Measure whether the attention weights deviate systematically from the posterior weights that would be produced by the vMF kernel on the same token embeddings; a large, consistent mismatch would falsify the claimed equivalence.
Figures
read the original abstract
Despite the Transformer's dominance across machine learning, its architecture remains largely heuristic and lacks a unified theoretical foundation. We introduce Score-based Variational Flow (SVFlow), a continuous-time dynamical system for representation learning in which the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores, and provide a principled basis for regularization through variational consistency. We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry. This unification explains why attention trains stably without explicit regularization while MoE requires auxiliary balancing losses. Experiments on pre-trained language models with prefix shuffling show that SVFlow-induced metrics correlate with task performance, reveal depth-dependent sensitivity, and reflect the intrinsic dynamics of attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Score-based Variational Flow (SVFlow) as a continuous-time dynamical system for representation learning, where the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores. It claims that the forward Euler discretization of the spherical SVFlow exactly recovers the Transformer architecture, with multi-head attention approximating the SVFlow vector field via a vMF kernel-smoothed posterior, MoE/FFN providing a relaxed approximation, and the residual-normalization block implementing a relaxed retraction to maintain spherical geometry. Experiments on pre-trained language models using prefix shuffling demonstrate that SVFlow-induced metrics correlate with task performance and reveal depth-dependent sensitivities.
Significance. If the central claim of exact recovery holds, the paper would offer a significant theoretical unification of the Transformer architecture with continuous dynamical systems, providing a principled explanation for its training stability without explicit regularization and the need for auxiliary losses in MoE. The experimental validation suggests that SVFlow metrics could serve as diagnostic tools for model analysis. The work builds on ideas from score-based models and variational inference, potentially opening avenues for deriving new architectures from continuous flows.
major comments (3)
- [Abstract] The abstract asserts that 'forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture' yet immediately qualifies multi-head attention as approximating 'via a vMF kernel-smoothed posterior' and the residual-normalization as a 'relaxed retraction'. This tension indicates that the recovery may depend on additional modeling assumptions rather than being exact, which is central to the paper's unification claim and requires explicit resolution.
- [Sections 3 and 4] The mappings in Sections 3 and 4 equate the vMF kernel in attention to the variational posterior over next-token scores and the layer-norm plus residual to the spherical retraction operator. However, these are presented as direct correspondences without deriving the necessity of the vMF form or the exact geometric equivalence from the SVFlow ODE definition alone. If these are choices rather than consequences, the discretization yields an analogous but not identical architecture.
- [SVFlow Definition] The SVFlow vector field is constructed using the same conditional log-likelihood scores that the Transformer is trained to predict. This setup risks circularity, where the Euler discretization recovers the Transformer by construction through the shared score definition, rather than providing an independent derivation of the architecture from the continuous system.
minor comments (2)
- [Experiments] The description of the prefix shuffling experiments could be expanded to include precise definitions of how the SVFlow-induced metrics are calculated from the model outputs.
- [Notation] Ensure consistent use of notation for the spherical geometry and retraction operators throughout the manuscript.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the scope of our unification claim. We address each major point below, revising the manuscript to resolve ambiguities in the abstract and strengthen the derivations in Sections 3 and 4. The core claim remains that the Transformer block structure arises exactly from the forward Euler discretization under the specified SVFlow components.
read point-by-point responses
-
Referee: [Abstract] The abstract asserts that 'forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture' yet immediately qualifies multi-head attention as approximating 'via a vMF kernel-smoothed posterior' and the residual-normalization as a 'relaxed retraction'. This tension indicates that the recovery may depend on additional modeling assumptions rather than being exact, which is central to the paper's unification claim and requires explicit resolution.
Authors: We agree that the original abstract wording created ambiguity between 'exact recovery' and the qualifiers 'approximating' and 'relaxed'. The exactness refers to the mathematical equivalence of the discretized ODE step to the Transformer block when the vector field uses the vMF-smoothed posterior and the retraction is implemented via residual-plus-normalization. The qualifiers describe the practical realization of these SVFlow elements. We have revised the abstract to read: 'We show that the forward Euler discretization of spherical SVFlow recovers the Transformer architecture exactly, with multi-head attention implementing the vector field via a vMF kernel-smoothed posterior and the residual-normalization block implementing a relaxed retraction to the sphere.' This removes the tension while preserving the claim. revision: yes
-
Referee: [Sections 3 and 4] The mappings in Sections 3 and 4 equate the vMF kernel in attention to the variational posterior over next-token scores and the layer-norm plus residual to the spherical retraction operator. However, these are presented as direct correspondences without deriving the necessity of the vMF form or the exact geometric equivalence from the SVFlow ODE definition alone. If these are choices rather than consequences, the discretization yields an analogous but not identical architecture.
Authors: We have expanded Sections 3 and 4 with explicit derivations. In Section 3, we show that the vMF kernel arises directly from the exponential-family form of the variational posterior over score directions on the sphere, which is the maximum-entropy distribution consistent with the SVFlow variational objective; this is not an arbitrary choice but the canonical one for the spherical geometry. In Section 4, we derive that the residual connection followed by normalization is the first-order Taylor approximation to the spherical retraction operator needed to enforce the manifold constraint after the Euler update. While other kernels are mathematically possible, the vMF form is the one that yields the attention mechanism exactly, making the discretized architecture identical to the Transformer rather than merely analogous. These additions clarify the derivations from the ODE. revision: yes
-
Referee: [SVFlow Definition] The SVFlow vector field is constructed using the same conditional log-likelihood scores that the Transformer is trained to predict. This setup risks circularity, where the Euler discretization recovers the Transformer by construction through the shared score definition, rather than providing an independent derivation of the architecture from the continuous system.
Authors: The SVFlow is introduced as a standalone continuous-time dynamical system whose vector field is the posterior-weighted average of conditional scores (gradients of log-likelihoods), defined without reference to any discrete model. The Transformer is shown to emerge when this continuous system is discretized and the scores are supplied by a network trained to predict them. This is not circularity but the intended unification: the continuous flow provides the independent dynamical principle, and the architecture is the discretization that realizes it. We have added a clarifying paragraph in the introduction emphasizing that the ODE is primary and the Transformer is its exact discrete counterpart under the stated approximations. revision: partial
Circularity Check
SVFlow vector field defined via vMF-smoothed conditional scores; Euler step recovers Transformer by construction once vMF and retraction are posited as exact matches
specific steps
-
self definitional
[Abstract and Sections 3-4]
"We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry."
The SVFlow vector field is defined (Section 3) as evolving according to a variational posterior-weighted average of conditional log-likelihood scores. The paper then asserts that attention computes precisely this average once a vMF kernel is chosen and that layer-norm+residual is the retraction. With those identifications granted, the Euler discretization step reproduces the Transformer equations by algebraic substitution; the 'recovery' is therefore definitional rather than an independent derivation from a prior continuous dynamics.
-
ansatz smuggled in via citation
[Section 3 (definition of spherical SVFlow)]
"the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores"
The continuous-time vector field is constructed so that its discretization will match attention once the vMF kernel and spherical retraction are inserted. No derivation from first principles shows why the variational posterior must be smoothed by vMF or why the geometry must be spherical; both are ansatzes chosen to produce the Transformer block.
full rationale
The central claim is that one forward Euler step on spherical SVFlow yields the standard Transformer block. This holds only because the paper defines the SVFlow vector field itself as a variational-posterior-weighted average of conditional log-likelihood scores, then states that multi-head attention implements exactly that average via a vMF kernel while residual+layer-norm implements the spherical retraction. Both correspondences are introduced as modeling choices rather than derived from an independent continuous-time definition; once they are granted, the discretization step is tautological. No external benchmark or parameter-free derivation is supplied to show the mapping is forced rather than fitted. The remainder of the paper (experiments on pre-trained models) is downstream of this equivalence and does not break the circularity.
Axiom & Free-Parameter Ledger
free parameters (2)
- Euler step size
- vMF concentration parameter
axioms (2)
- domain assumption The state lives on the unit sphere and the retraction after each step preserves this geometry
- ad hoc to paper The variational posterior can be represented by a vMF kernel-smoothed distribution over tokens
invented entities (1)
-
Score-based Variational Flow (SVFlow)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. Gradient dynamics of attention: How cross-entropy sculpts bayesian manifolds, 2026
work page 2026
-
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016
work page 2016
-
[3]
Dhillon, Joydeep Ghosh, and Suvrit Sra
Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions.Journal of Machine Learning Research, 6:1345–1382, 2005
work page 2005
-
[4]
Blei, Alp Kucukelbir, and Jon D
David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 112(518):859–877, April 2017
work page 2017
-
[5]
Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2025
work page 2025
-
[6]
Neural ordinary differential equations
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InAdvances in neural information processing systems, pages 6571–6583, 2018
work page 2018
-
[7]
Switchhead: Accelerating transformers with mixture-of-experts attention, 2024
Robert Csordas, Piotr Piekos, Kazuki Irie, and Jurgen Schmidhuber. Switchhead: Accelerating transformers with mixture-of-experts attention, 2024
work page 2024
-
[8]
Calibration of Pre-trained Transformers
Shrey Desai and Greg Durrett. Calibration of Pre-trained Transformers. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
work page 2020
-
[9]
Naesseth, Max Welling, and Jan-Willem van de Meent
Floor Eijkelboom, Grigory Bartosh, Christian A. Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[10]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[11]
The emergence of clusters in self-attention dynamics
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023
work page 2023
-
[12]
A mathematical perspective on transformers, 2025
Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers, 2025
work page 2025
-
[13]
Advancing expert specialization for better moe
Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for better moe. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[14]
Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction
Akshat Gupta, Atahan Ozdemir, Caoqinwei Gong, and Gopala Anumanchipalli. Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction. InACL 2025 Student Research Workshop, 2025
work page 2025
-
[15]
Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun, and Xing Xie. Fuse it more deeply! a variational transformer with layer-wise latent variable inference for text generation, 2022
work page 2022
-
[16]
Daniel Zhengyu Huang, Jiaoyang Huang, and Zhengjiang Lin. Convergence analysis of probability flow ode for score-based generative models.IEEE Transactions on Information Theory, 71(6):4581–4601, June 2025
work page 2025
-
[17]
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991
work page 1991
-
[18]
Bei Li, Quan Du, Tao Zhou, Shuhan Zhou, Xin Zeng, Tong Xiao, and Jingbo Zhu. Ode transformer: An ordinary differential equation-inspired model for neural machine translation, 2021
work page 2021
-
[19]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. 10
work page 2023
-
[20]
Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021
Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[21]
Dense backpropagation improves training for sparse mixture- of-experts
Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Thérien, Sambit Sahu, Tom Gold- stein, and Supriyo Chakraborty. Dense backpropagation improves training for sparse mixture- of-experts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[22]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025
work page 2025
-
[23]
Transformers as unrolled inference in probabilistic laplacian eigenmaps
Aditya Ravuri and Neil D Lawrence. Transformers as unrolled inference in probabilistic laplacian eigenmaps. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025
work page 2025
-
[24]
Outrageously large neural networks: The sparsely-gated mixture-of- experts layer
Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017
work page 2017
-
[25]
Heejune Sheen, Siyu Chen, Tianhao Wang, and Harrison H. Zhou. Implicit regularization of gradient flow on one-layer softmax attention, 2024
work page 2024
-
[26]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[27]
Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026
Ye Su and Yong Liu. Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026
work page 2026
-
[28]
Xue-Cheng Tai, Hao Liu, Lingfeng Li, and Raymond H. Chan. A mathematical explanation of transformers for large language models and gpts, 2025
work page 2025
-
[29]
Trans- formers as support vector machines
Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Trans- formers as support vector machines. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023
work page 2023
-
[30]
Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025
Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheong- woong Kang, and Jaesik Choi. Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025
work page 2025
-
[31]
Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019
work page 2019
-
[32]
Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. Implicit bias and fast conver- gence rates for self-attention.Transactions on Machine Learning Research, 2025
work page 2025
-
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[34]
Andrew T.A Wood. Simulation of the von mises fisher distribution.Communications in Statistics - Simulation and Computation, 23(1):157–164, 1994
work page 1994
-
[35]
Efficient streaming language models with attention sinks, 2024
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024
work page 2024
-
[36]
Spherical latent spaces for stable variational autoencoders
Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
-
[37]
Unifying learning dynamics and generalization in transformers scaling law, 2026
Chiwun Yang. Unifying learning dynamics and generalization in transformers scaling law, 2026
work page 2026
-
[38]
X z 1z′=z −q ϕℓ(z′|xℓ) qϕℓ(z|xℓ)Aℓ(z) # ∂gϕℓ(xℓ)z′ ∂ϕℓ (48) = X z′
Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 11 A Proofs A.1 SVFlow Evidence Lower Bound We provide the detailed derivation of the instantaneous EL...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.