Clustering in Deep Stochastic Transformers

Lev Fedorov, Michaël E Sander, Romuald Elie, Pierre Marion, Mathieu Laurière · 2026 · arXiv 2601.21942

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Kinetic theory for Transformers and the lost-in-the-middle phenomenon

math.AP · 2026-05-09 · conditional · novelty 8.0

A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.

Uniform Scaling Limits in AdamW-Trained Transformers

stat.ML · 2026-05-11 · unverdicted · novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

math.PR · 2026-04-29 · unverdicted · novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

citing papers explorer

Showing 3 of 3 citing papers.

Kinetic theory for Transformers and the lost-in-the-middle phenomenon math.AP · 2026-05-09 · conditional · none · ref 15
A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.
Uniform Scaling Limits in AdamW-Trained Transformers stat.ML · 2026-05-11 · unverdicted · none · ref 18
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models math.PR · 2026-04-29 · unverdicted · none · ref 21
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

Clustering in Deep Stochastic Transformers

fields

years

verdicts

representative citing papers

citing papers explorer