pith. sign in

arxiv: 2604.23740 · v1 · submitted 2026-04-26 · 💻 cs.LG

Transformer as an Euler Discretization of Score-based Variational Flow

Pith reviewed 2026-05-08 06:17 UTC · model grok-4.3

classification 💻 cs.LG
keywords TransformerScore-based Variational FlowSVFlowEuler discretizationattention mechanismvariational inferencespherical geometryunification
0
0 comments X

The pith

The Transformer architecture is exactly the forward Euler discretization of spherical Score-based Variational Flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Score-based Variational Flow as a continuous dynamical system in which representations evolve according to a weighted average of conditional log-likelihood scores, with weights given by a variational posterior. It then shows that taking the forward Euler step on the spherical version of this flow produces the precise sequence of operations found in a Transformer block. Multi-head attention arises as a kernel-smoothed approximation to the underlying vector field, the feed-forward network supplies a relaxed approximation, and the residual-plus-layer-norm block implements a retraction that keeps the state on the sphere. This view supplies a single dynamical principle that explains both the architecture and why attention trains stably while mixture-of-experts variants need extra balancing losses.

Core claim

Forward Euler discretization of spherical SVFlow exactly recovers the Transformer. The state update at each layer is the discrete step that integrates the SVFlow vector field, where the vector field itself is the variational-posterior-weighted average of score functions. Multi-head attention realizes this vector field via a von Mises-Fisher kernel that smooths the posterior over tokens, the MoE or FFN block relaxes the same computation in a network form, and the residual-normalization step performs the spherical retraction required to keep the trajectory on the manifold. The resulting discrete dynamics therefore inherit the variational consistency of the continuous flow.

What carries the argument

Spherical Score-based Variational Flow (SVFlow), a continuous-time dynamical system whose vector field is the variational posterior-weighted average of conditional log-likelihood scores; its forward Euler discretization supplies the exact layer update rule.

If this is right

  • Multi-head attention computes a kernel-smoothed approximation to the SVFlow vector field at each layer.
  • The residual-normalization block maintains the spherical geometry required by the continuous flow.
  • MoE and FFN layers serve as relaxed, network-based approximations to the same vector field.
  • Variational consistency supplies an implicit regularization that explains stable training of attention without explicit penalties.
  • SVFlow-derived metrics on prefix-shuffled inputs correlate with downstream task performance and exhibit depth-dependent sensitivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that replace attention or normalization with other operators can be evaluated by checking whether they still correspond to an Euler step on a comparable flow.
  • The continuous-time perspective suggests that training Transformers at different depths or with different step sizes may be interpretable as numerical integration of the same underlying ODE.
  • Changing the underlying manifold or the form of the score function could generate new discrete architectures that inherit the same stability properties.

Load-bearing premise

The variational posterior over next tokens can be faithfully approximated by the von Mises-Fisher kernel inside multi-head attention, and the layer-norm-plus-residual block exactly matches the spherical retraction.

What would settle it

Measure whether the attention weights deviate systematically from the posterior weights that would be produced by the vMF kernel on the same token embeddings; a large, consistent mismatch would falsify the claimed equivalence.

Figures

Figures reproduced from arXiv: 2604.23740 by Huadong Liao.

Figure 1
Figure 1. Figure 1: Effect of regularization strength β on a Gaussian SVFlow. (a): Synthetic 2D dataset, colored by class. (b): Training with only the variational consistency objective (β → ∞). The vector field (arrows) is orthogonal to the ELBO contours (background), following the steepest descent direction. All points converge to a single ELBO optimum (red star), indicating posterior collapse onto a single Gaussian componen… view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise ∆(− log pt) for attention lay￾ers under different prefix shuffling rates. For both models, deep layers exhibit positive ∆ while shal￾low layers remain near zero, indicating deeper at￾tention representations are more sensitive to con￾text disruption. Value and Output Projection as Condi￾tional Score. Similarly, let the conditional distribution p(x|z) be vMF with parameters κ ′ zµ ′ z =  Wo,hW⊤ v… view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise evolution of divergence KL(q∥p) and concentrations KL(q∥U) and KL(p∥U) under baseline (solid) and highest shuffle rate (dashed) conditions. The y-axis is logarithmic. While KL(q∥U) remains in a similar moderate range across all three models, KL(p∥U) differs dramatically: it stays low and flat for Llama3.2, rises gradually for Qwen2.5, and surges at early layers for Qwen3. Consequently, KL(q∥p) r… view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise vMF concentration κ (log scale) for Llama3.2, Qwen2.5, and Qwen3 under baseline (solid) and perturbed (dashed) conditions. κ determines KL(p∥U) and differs by orders of magnitude across models; perturbation leaves these orders unchanged, indicating that the sharpness of p is unaffected by context shuffling. 19 view at source ↗
read the original abstract

Despite the Transformer's dominance across machine learning, its architecture remains largely heuristic and lacks a unified theoretical foundation. We introduce Score-based Variational Flow (SVFlow), a continuous-time dynamical system for representation learning in which the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores, and provide a principled basis for regularization through variational consistency. We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry. This unification explains why attention trains stably without explicit regularization while MoE requires auxiliary balancing losses. Experiments on pre-trained language models with prefix shuffling show that SVFlow-induced metrics correlate with task performance, reveal depth-dependent sensitivity, and reflect the intrinsic dynamics of attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Score-based Variational Flow (SVFlow) as a continuous-time dynamical system for representation learning, where the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores. It claims that the forward Euler discretization of the spherical SVFlow exactly recovers the Transformer architecture, with multi-head attention approximating the SVFlow vector field via a vMF kernel-smoothed posterior, MoE/FFN providing a relaxed approximation, and the residual-normalization block implementing a relaxed retraction to maintain spherical geometry. Experiments on pre-trained language models using prefix shuffling demonstrate that SVFlow-induced metrics correlate with task performance and reveal depth-dependent sensitivities.

Significance. If the central claim of exact recovery holds, the paper would offer a significant theoretical unification of the Transformer architecture with continuous dynamical systems, providing a principled explanation for its training stability without explicit regularization and the need for auxiliary losses in MoE. The experimental validation suggests that SVFlow metrics could serve as diagnostic tools for model analysis. The work builds on ideas from score-based models and variational inference, potentially opening avenues for deriving new architectures from continuous flows.

major comments (3)
  1. [Abstract] The abstract asserts that 'forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture' yet immediately qualifies multi-head attention as approximating 'via a vMF kernel-smoothed posterior' and the residual-normalization as a 'relaxed retraction'. This tension indicates that the recovery may depend on additional modeling assumptions rather than being exact, which is central to the paper's unification claim and requires explicit resolution.
  2. [Sections 3 and 4] The mappings in Sections 3 and 4 equate the vMF kernel in attention to the variational posterior over next-token scores and the layer-norm plus residual to the spherical retraction operator. However, these are presented as direct correspondences without deriving the necessity of the vMF form or the exact geometric equivalence from the SVFlow ODE definition alone. If these are choices rather than consequences, the discretization yields an analogous but not identical architecture.
  3. [SVFlow Definition] The SVFlow vector field is constructed using the same conditional log-likelihood scores that the Transformer is trained to predict. This setup risks circularity, where the Euler discretization recovers the Transformer by construction through the shared score definition, rather than providing an independent derivation of the architecture from the continuous system.
minor comments (2)
  1. [Experiments] The description of the prefix shuffling experiments could be expanded to include precise definitions of how the SVFlow-induced metrics are calculated from the model outputs.
  2. [Notation] Ensure consistent use of notation for the spherical geometry and retraction operators throughout the manuscript.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope of our unification claim. We address each major point below, revising the manuscript to resolve ambiguities in the abstract and strengthen the derivations in Sections 3 and 4. The core claim remains that the Transformer block structure arises exactly from the forward Euler discretization under the specified SVFlow components.

read point-by-point responses
  1. Referee: [Abstract] The abstract asserts that 'forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture' yet immediately qualifies multi-head attention as approximating 'via a vMF kernel-smoothed posterior' and the residual-normalization as a 'relaxed retraction'. This tension indicates that the recovery may depend on additional modeling assumptions rather than being exact, which is central to the paper's unification claim and requires explicit resolution.

    Authors: We agree that the original abstract wording created ambiguity between 'exact recovery' and the qualifiers 'approximating' and 'relaxed'. The exactness refers to the mathematical equivalence of the discretized ODE step to the Transformer block when the vector field uses the vMF-smoothed posterior and the retraction is implemented via residual-plus-normalization. The qualifiers describe the practical realization of these SVFlow elements. We have revised the abstract to read: 'We show that the forward Euler discretization of spherical SVFlow recovers the Transformer architecture exactly, with multi-head attention implementing the vector field via a vMF kernel-smoothed posterior and the residual-normalization block implementing a relaxed retraction to the sphere.' This removes the tension while preserving the claim. revision: yes

  2. Referee: [Sections 3 and 4] The mappings in Sections 3 and 4 equate the vMF kernel in attention to the variational posterior over next-token scores and the layer-norm plus residual to the spherical retraction operator. However, these are presented as direct correspondences without deriving the necessity of the vMF form or the exact geometric equivalence from the SVFlow ODE definition alone. If these are choices rather than consequences, the discretization yields an analogous but not identical architecture.

    Authors: We have expanded Sections 3 and 4 with explicit derivations. In Section 3, we show that the vMF kernel arises directly from the exponential-family form of the variational posterior over score directions on the sphere, which is the maximum-entropy distribution consistent with the SVFlow variational objective; this is not an arbitrary choice but the canonical one for the spherical geometry. In Section 4, we derive that the residual connection followed by normalization is the first-order Taylor approximation to the spherical retraction operator needed to enforce the manifold constraint after the Euler update. While other kernels are mathematically possible, the vMF form is the one that yields the attention mechanism exactly, making the discretized architecture identical to the Transformer rather than merely analogous. These additions clarify the derivations from the ODE. revision: yes

  3. Referee: [SVFlow Definition] The SVFlow vector field is constructed using the same conditional log-likelihood scores that the Transformer is trained to predict. This setup risks circularity, where the Euler discretization recovers the Transformer by construction through the shared score definition, rather than providing an independent derivation of the architecture from the continuous system.

    Authors: The SVFlow is introduced as a standalone continuous-time dynamical system whose vector field is the posterior-weighted average of conditional scores (gradients of log-likelihoods), defined without reference to any discrete model. The Transformer is shown to emerge when this continuous system is discretized and the scores are supplied by a network trained to predict them. This is not circularity but the intended unification: the continuous flow provides the independent dynamical principle, and the architecture is the discretization that realizes it. We have added a clarifying paragraph in the introduction emphasizing that the ODE is primary and the Transformer is its exact discrete counterpart under the stated approximations. revision: partial

Circularity Check

2 steps flagged

SVFlow vector field defined via vMF-smoothed conditional scores; Euler step recovers Transformer by construction once vMF and retraction are posited as exact matches

specific steps
  1. self definitional [Abstract and Sections 3-4]
    "We show that forward Euler discretization of spherical SVFlow exactly recovers the Transformer architecture. Multi-head attention approximates SVFlow vector field via a vMF kernel-smoothed posterior, while MoE/FFN approximates it in a relaxed network-based way, and the residual-normalization block implements a relaxed retraction that maintains spherical geometry."

    The SVFlow vector field is defined (Section 3) as evolving according to a variational posterior-weighted average of conditional log-likelihood scores. The paper then asserts that attention computes precisely this average once a vMF kernel is chosen and that layer-norm+residual is the retraction. With those identifications granted, the Euler discretization step reproduces the Transformer equations by algebraic substitution; the 'recovery' is therefore definitional rather than an independent derivation from a prior continuous dynamics.

  2. ansatz smuggled in via citation [Section 3 (definition of spherical SVFlow)]
    "the state evolves according to a variational posterior-weighted average of conditional log-likelihood scores"

    The continuous-time vector field is constructed so that its discretization will match attention once the vMF kernel and spherical retraction are inserted. No derivation from first principles shows why the variational posterior must be smoothed by vMF or why the geometry must be spherical; both are ansatzes chosen to produce the Transformer block.

full rationale

The central claim is that one forward Euler step on spherical SVFlow yields the standard Transformer block. This holds only because the paper defines the SVFlow vector field itself as a variational-posterior-weighted average of conditional log-likelihood scores, then states that multi-head attention implements exactly that average via a vMF kernel while residual+layer-norm implements the spherical retraction. Both correspondences are introduced as modeling choices rather than derived from an independent continuous-time definition; once they are granted, the discretization step is tautological. No external benchmark or parameter-free derivation is supplied to show the mapping is forced rather than fitted. The remainder of the paper (experiments on pre-trained models) is downstream of this equivalence and does not break the circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on introducing SVFlow as a new dynamical system whose discretization yields the Transformer; the paper therefore adds one invented continuous-time object and several modeling approximations whose independent justification is not visible from the abstract.

free parameters (2)
  • Euler step size
    The discretization step length must be chosen to match the residual scaling used in Transformers.
  • vMF concentration parameter
    Controls how sharply the attention kernel approximates the posterior; appears fitted or chosen to recover standard attention.
axioms (2)
  • domain assumption The state lives on the unit sphere and the retraction after each step preserves this geometry
    Invoked to justify the residual-plus-layer-norm block as a spherical retraction.
  • ad hoc to paper The variational posterior can be represented by a vMF kernel-smoothed distribution over tokens
    Required for the multi-head attention interpretation; not derived from first principles in the abstract.
invented entities (1)
  • Score-based Variational Flow (SVFlow) no independent evidence
    purpose: Continuous-time dynamical system whose vector field is a posterior-weighted average of conditional log-likelihood scores
    New object introduced to unify the Transformer; no independent empirical handle provided in the abstract.

pith-pipeline@v0.9.0 · 5449 in / 1569 out tokens · 35729 ms · 2026-05-08T06:17:59.343179+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    Dalal, and Vishal Misra

    Naman Agarwal, Siddhartha R. Dalal, and Vishal Misra. Gradient dynamics of attention: How cross-entropy sculpts bayesian manifolds, 2026

  2. [2]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016

  3. [3]

    Dhillon, Joydeep Ghosh, and Suvrit Sra

    Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, and Suvrit Sra. Clustering on the unit hypersphere using von mises-fisher distributions.Journal of Machine Learning Research, 6:1345–1382, 2005

  4. [4]

    Blei, Alp Kucukelbir, and Jon D

    David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 112(518):859–877, April 2017

  5. [5]

    A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2025

    Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, page 1–20, 2025

  6. [6]

    Neural ordinary differential equations

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InAdvances in neural information processing systems, pages 6571–6583, 2018

  7. [7]

    Switchhead: Accelerating transformers with mixture-of-experts attention, 2024

    Robert Csordas, Piotr Piekos, Kazuki Irie, and Jurgen Schmidhuber. Switchhead: Accelerating transformers with mixture-of-experts attention, 2024

  8. [8]

    Calibration of Pre-trained Transformers

    Shrey Desai and Greg Durrett. Calibration of Pre-trained Transformers. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  9. [9]

    Naesseth, Max Welling, and Jan-Willem van de Meent

    Floor Eijkelboom, Grigory Bartosh, Christian A. Naesseth, Max Welling, and Jan-Willem van de Meent. Variational flow matching for graph generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  10. [10]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  11. [11]

    The emergence of clusters in self-attention dynamics

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 57026–57037. Curran Associates, Inc., 2023

  12. [12]

    A mathematical perspective on transformers, 2025

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers, 2025

  13. [13]

    Advancing expert specialization for better moe

    Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Xinye Cao, Sicong Leng, Qimei Cui, and Xudong Jiang. Advancing expert specialization for better moe. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  14. [14]

    Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction

    Akshat Gupta, Atahan Ozdemir, Caoqinwei Gong, and Gopala Anumanchipalli. Layernorm vs RMSNorm: A geometric perspective and the case against mean subtraction. InACL 2025 Student Research Workshop, 2025

  15. [15]

    Fuse it more deeply! a variational transformer with layer-wise latent variable inference for text generation, 2022

    Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun, and Xing Xie. Fuse it more deeply! a variational transformer with layer-wise latent variable inference for text generation, 2022

  16. [16]

    Convergence analysis of probability flow ode for score-based generative models.IEEE Transactions on Information Theory, 71(6):4581–4601, June 2025

    Daniel Zhengyu Huang, Jiaoyang Huang, and Zhengjiang Lin. Convergence analysis of probability flow ode for score-based generative models.IEEE Transactions on Information Theory, 71(6):4581–4601, June 2025

  17. [17]

    Jacobs, Michael I

    Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79–87, 1991

  18. [18]

    Ode transformer: An ordinary differential equation-inspired model for neural machine translation, 2021

    Bei Li, Quan Du, Tao Zhou, Shuhan Zhou, Xin Zeng, Tong Xiao, and Jingbo Zhu. Ode transformer: An ordinary differential equation-inspired model for neural machine translation, 2021

  19. [19]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. 10

  20. [20]

    Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021

    Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34, 2021

  21. [21]

    Dense backpropagation improves training for sparse mixture- of-experts

    Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Thérien, Sambit Sahu, Tom Gold- stein, and Supriyo Chakraborty. Dense backpropagation improves training for sparse mixture- of-experts. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  22. [22]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free, 2025

  23. [23]

    Transformers as unrolled inference in probabilistic laplacian eigenmaps

    Aditya Ravuri and Neil D Lawrence. Transformers as unrolled inference in probabilistic laplacian eigenmaps. InNeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling, 2025

  24. [24]

    Outrageously large neural networks: The sparsely-gated mixture-of- experts layer

    Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017

  25. [25]

    Heejune Sheen, Siyu Chen, Tianhao Wang, and Harrison H. Zhou. Implicit regularization of gradient flow on one-layer softmax attention, 2024

  26. [26]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  27. [27]

    Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026

    Ye Su and Yong Liu. Variational inference, entropy, and orthogonality: A unified theory of mixture-of-experts, 2026

  28. [28]

    Xue-Cheng Tai, Hao Liu, Lingfeng Li, and Raymond H. Chan. A mathematical explanation of transformers for large language models and gpts, 2025

  29. [29]

    Trans- formers as support vector machines

    Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Trans- formers as support vector machines. InNeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023

  30. [30]

    Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025

    Anh Tong, Thanh Nguyen-Tang, Dongeun Lee, Duc Nguyen, Toan Tran, David Hall, Cheong- woong Kang, and Jaesik Choi. Neural ode transformers: Analyzing internal dynamics and adaptive fine-tuning, 2025

  31. [31]

    Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019

    Yao-Hung Hubert Tsai, Shaojie Bai, Makoto Yamada, Louis-Philippe Morency, and Ruslan Salakhutdinov. Transformer dissection: A unified understanding of transformer’s attention via the lens of kernel, 2019

  32. [32]

    Implicit bias and fast conver- gence rates for self-attention.Transactions on Machine Learning Research, 2025

    Bhavya Vasudeva, Puneesh Deora, and Christos Thrampoulidis. Implicit bias and fast conver- gence rates for self-attention.Transactions on Machine Learning Research, 2025

  33. [33]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  34. [34]

    Simulation of the von mises fisher distribution.Communications in Statistics - Simulation and Computation, 23(1):157–164, 1994

    Andrew T.A Wood. Simulation of the von mises fisher distribution.Communications in Statistics - Simulation and Computation, 23(1):157–164, 1994

  35. [35]

    Efficient streaming language models with attention sinks, 2024

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks, 2024

  36. [36]

    Spherical latent spaces for stable variational autoencoders

    Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

  37. [37]

    Unifying learning dynamics and generalization in transformers scaling law, 2026

    Chiwun Yang. Unifying learning dynamics and generalization in transformers scaling law, 2026

  38. [38]

    X z 1z′=z −q ϕℓ(z′|xℓ) qϕℓ(z|xℓ)Aℓ(z) # ∂gϕℓ(xℓ)z′ ∂ϕℓ (48) = X z′

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. 11 A Proofs A.1 SVFlow Evidence Lower Bound We provide the detailed derivation of the instantaneous EL...