pith. machine review for the scientific record. sign in

arxiv: 2605.09213 · v1 · submitted 2026-05-09 · 🧮 math.AP · cs.LG· math.PR

Recognition: 2 theorem links

· Lean Theorem

Kinetic theory for Transformers and the lost-in-the-middle phenomenon

Borjan Geshkovski, Mitia Duerinckx, Stefano Rossi

Pith reviewed 2026-05-12 02:32 UTC · model grok-4.3

classification 🧮 math.AP cs.LGmath.PR
keywords causal self-attentionmean-field limitlost-in-the-middletransformerskinetic theoryinteracting particle systemscumulant expansionGlauber calculus
0
0 comments X

The pith

Causal self-attention dynamics produce a U-shaped retrieval profile that explains lost-in-the-middle for uniform tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats decoder Transformer attention as a system of non-exchangeable interacting particles with a causal triangular structure. It establishes a quantitative mean-field limit together with next-order correlation estimates by adapting cumulant expansions and Glauber calculus to this setting. When input tokens are drawn independently and uniformly, the limiting correlation equations admit an explicit closed-form solution. The resulting retrieval probabilities, viewed as a function of source position, form a U-shape: elevated at the start and end of the prompt, with a single interior minimum when a smallness condition on the parameters holds. This supplies a rigorous kinetic-theory account of the empirically observed difficulty in retrieving middle-context information.

Core claim

By viewing causal self-attention as a non-exchangeable particle system and performing a quantitative mean-field limit, the authors obtain a closed-form expression for the limiting pairwise correlations when tokens are iid uniform. These correlations determine the token retrieval profile, which is provably U-shaped with primacy and recency effects and a unique interior minimum under an explicit smallness condition on the model parameters.

What carries the argument

The limiting correlation equation obtained from the mean-field limit of the causal self-attention particle system, solved in closed form for iid uniform tokens.

If this is right

  • Retrieval probability is strictly higher for tokens near the beginning or end of the prompt than for those in the middle.
  • The interior minimum is unique once the smallness condition on parameters is met.
  • The U-shaped profile arises directly from the triangular causal dependency structure of the attention mechanism.
  • Quantitative bounds control the distance between the finite-particle system and its mean-field limit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectural changes that reduce the effective interaction strength might flatten the U and improve middle-context recall.
  • The same mean-field analysis could be applied to non-uniform token distributions to predict how vocabulary statistics affect the lost-in-the-middle effect.
  • The kinetic-theory perspective suggests analogous limits for other attention variants or for encoder-decoder models with different dependency graphs.
  • Empirical checks of the smallness condition on real trained models could indicate whether the derived regime is the one governing practical lost-in-the-middle failures.

Load-bearing premise

Input tokens are independent and identically distributed uniformly, together with a smallness condition on the interaction strength that permits the closed-form solution.

What would settle it

A direct numerical computation of attention weights on long sequences of iid uniform tokens that yields a retrieval profile without a unique interior minimum, or without the predicted U-shape, would contradict the closed-form result.

Figures

Figures reproduced from arXiv: 2605.09213 by Borjan Geshkovski, Mitia Duerinckx, Stefano Rossi.

Figure 1
Figure 1. Figure 1: Write a high-quality answer for the given question using only the provided search results (some of which might be irrelevant). Document [1](Title: List of Nobel laureates in Physics) ... Document [2](Title: Asian Americans in science and technology) ... Document [3](Title: Scientist) ... Question: who got the first nobel prize in physics Answer: Input Context Wilhelm Conrad Röntgen Desired Answer Write a h… view at source ↗
Figure 2
Figure 2. Figure 2: The profile predicted by Theorem 1.3 in the explicit regime β = λ = 1 and M = 8. Each panel plots the centered correction σ0 7→ St(σ0) − min St from (1.21). 1.2.1. On positional encodings. We focus on the ALiBi encoding factor for simplicity, but the proofs of Theorems 1.1 and 1.2 extend with only minor modifications to more general causal weights of the form b((j −k)/N), provided b is at least continuous … view at source ↗
Figure 3
Figure 3. Figure 3: Simulation of the particle system (1.1) for N = 64 and β = λ = 1, initialized near four small angular clusters. Crosses mark the initial angles θk(0), while hollow points mark the evolved angles θj (t); the labels follow the first and last positions. The plot is read by comparing evolved angles with initial ones: for a fixed output position j, proximity of θj (t) to θk(0) is the trajectory-level analogue o… view at source ↗
read the original abstract

We study causal self-attention dynamics -- a toy model for decoder Transformers -- which we interpret as a non-exchangeable interacting particle system. Adapting cumulant expansions to the triangular causal dependency structure of the model, and appealing to non-hierarchical methods to estimate correlations using Glauber calculus, we prove a quantitative mean-field limit result and a next-order characterization of correlations. For iid uniformly distributed tokens, the limiting correlation equation can be solved in closed form and we obtain a rigorous explanation of the empirically observed \emph{lost-in-the-middle} phenomenon: the token retrieval profile, as a function of the source position in the prompt, is $\mathsf{U}$-shaped, with primacy, recency, and a unique interior minimum under an explicit smallness condition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript models causal self-attention in decoder Transformers as a non-exchangeable interacting particle system. By adapting cumulant expansions to the triangular causal structure and using Glauber calculus for non-hierarchical correlation estimates, it proves a quantitative mean-field limit together with a next-order characterization. For iid uniformly distributed tokens the limiting correlation equation is solved in closed form, yielding a U-shaped token retrieval profile (primacy, recency, unique interior minimum) that explains the lost-in-the-middle phenomenon under an explicit smallness condition.

Significance. If the mean-field limit and closed-form solution hold, the work supplies a rigorous, essentially parameter-free mathematical account of an empirically important Transformer behavior using tools from kinetic theory. The adaptation of cumulant methods to causal triangular structures is technically novel and the closed-form derivation strengthens explanatory power; these features could seed further PDE-based analysis of attention dynamics.

major comments (2)
  1. [Abstract] Abstract and statement of the main quantitative limit: the smallness condition is load-bearing both for closed-form solvability of the limiting correlation equation and for the validity of the mean-field approximation via cumulant expansions; the manuscript provides no explicit scaling of this condition with sequence length N, leaving open whether the U-shaped profile remains valid at the moderate prompt lengths (N ~ 10^2–10^3) where lost-in-the-middle is routinely observed.
  2. [Limiting correlation equation] Derivation of the limiting correlation equation (via adapted cumulant expansions): the quantitative error bounds rely on the smallness assumption being satisfied; the paper should supply a concrete test or counter-example showing whether the U-shape persists when the smallness parameter is only marginally satisfied, as this directly affects the claimed rigorous explanation.
minor comments (2)
  1. [Introduction] Notation for the non-exchangeable particle system and the causal triangular structure could be introduced more explicitly in the introduction to aid readers outside kinetic theory.
  2. [Discussion] A short discussion of how the iid-uniform assumption might be relaxed while retaining the qualitative U-shape would strengthen the link to practical Transformers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We appreciate the positive assessment of the work's significance and novelty in adapting kinetic theory tools to causal self-attention. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and statement of the main quantitative limit: the smallness condition is load-bearing both for closed-form solvability of the limiting correlation equation and for the validity of the mean-field approximation via cumulant expansions; the manuscript provides no explicit scaling of this condition with sequence length N, leaving open whether the U-shaped profile remains valid at the moderate prompt lengths (N ~ 10^2–10^3) where lost-in-the-middle is routinely observed.

    Authors: We thank the referee for pointing out this aspect. The smallness condition is independent of the sequence length N, as it ensures the convergence of the cumulant expansion in the mean-field limit taken as N tends to infinity. The quantitative mean-field limit provides error bounds that vanish as N increases under this condition, which supports the relevance of the U-shaped profile for large but finite N such as 10^2 to 10^3. We will revise the abstract and add a discussion in the introduction on the finite-N implications to make the applicability clearer, without claiming validity outside the smallness regime. revision: partial

  2. Referee: [Limiting correlation equation] Derivation of the limiting correlation equation (via adapted cumulant expansions): the quantitative error bounds rely on the smallness assumption being satisfied; the paper should supply a concrete test or counter-example showing whether the U-shape persists when the smallness parameter is only marginally satisfied, as this directly affects the claimed rigorous explanation.

    Authors: We agree that understanding the behavior at the boundary of the smallness condition is important. However, the manuscript provides a rigorous explanation precisely when the smallness condition is satisfied, leading to the closed-form U-shape. When the parameter is only marginally satisfied, the error bounds in the mean-field limit may not be small, and the profile could deviate, but this is consistent with the statement of the result. As the work is a theoretical derivation using PDE and kinetic theory methods, we do not include numerical tests or counter-examples. We will add a remark clarifying that the U-shape is a feature of the limiting equation under the smallness assumption and that violations may lead to different behaviors, but the current analysis does not extend there. revision: partial

Circularity Check

0 steps flagged

No significant circularity: derivation proceeds from particle system to mean-field limit to closed-form PDE solution

full rationale

The paper defines a causal self-attention particle system, adapts cumulant expansions to its triangular structure, proves a quantitative mean-field limit, and then solves the resulting limiting correlation equation in closed form for iid uniform tokens under an explicit smallness condition. The U-shaped retrieval profile (primacy, recency, interior minimum) is obtained by direct solution of that PDE rather than by fitting, renaming, or self-referential definition. No load-bearing step reduces to a prior self-citation, fitted parameter, or ansatz smuggled from the authors' own work; the smallness condition is an explicit hypothesis required for both the limit and the closed form, not a hidden tautology. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the mean-field limit for a non-exchangeable particle system with triangular causal interactions. The smallness condition is an explicit technical hypothesis rather than a fitted parameter. No new particles or forces are postulated.

axioms (2)
  • domain assumption Tokens are iid and uniformly distributed on the vocabulary.
    Invoked to obtain the closed-form solution of the limiting correlation equation.
  • ad hoc to paper The interaction strength satisfies an explicit smallness condition.
    Required for the unique interior minimum and the quantitative mean-field limit.

pith-pipeline@v0.9.0 · 5430 in / 1373 out tokens · 31388 ms · 2026-05-12T02:32:14.772744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

  1. [1]

    Perceptrons and localization of attention’s mean-field landscape.arXiv preprint arXiv:2601.21366, 2026

    Antonio Álvarez-López, Borjan Geshkovski, and Domènec Ruiz-Balet. Perceptrons and localization of attention’s mean-field landscape. arXiv preprint arXiv:2601.21366 , 2026

  2. [2]

    Bronstein, Petar Veličković, and Razvan Pascanu

    Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael M. Bronstein, Petar Veličković, and Razvan Pascanu. Why do LLMs attend to the first token? InConference on Language Modeling , 2025

  3. [3]

    A duality method for mean-field limits with singular interactions

    Didier Bresch, Mitia Duerinckx, and Pierre-Emmanuel Jabin. A duality method for mean-field limits with singular interactions. Preprint, arXiv:2402.04695

  4. [4]

    A new approach to the mean-field limit of Vlasov-Fokker- Planck equations

    Didier Bresch, Pierre-Emmanuel Jabin, and Juan Soler. A new approach to the mean-field limit of Vlasov-Fokker- Planck equations. Anal. PDE, 18(4):1037–1064, 2025

  5. [5]

    Mean field limit and quantitative estimates with singular attractive kernels

    Didier Bresch, Pierre-Emmanuel Jabin, and Zhenfu Wang. Mean field limit and quantitative estimates with singular attractive kernels. Duke Mathematical Journal , 172(13):2591 – 2641, 2023

  6. [6]

    Emergence of meta-stable clustering in mean-field transformer models

    Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. Emergence of meta-stable clustering in mean-field transformer models. InThe Thirteenth International Conference on Learning Representations , 2025

  7. [7]

    A multiscale analysis of mean-field transformers in the moderate interaction regime

    Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean-field transformers in the moderate interaction regime. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems , 2025

  8. [8]

    A Unified Perspective on the Dynamics of Deep Transformers.arXiv preprint arXiv:2501.18322, 2025

    Valérie Castin, Pierre Ablin, José Antonio Carrillo, and Gabriel Peyré. A unified perspective on the dynamics of deep transformers. arXiv preprint arXiv:2501.18322 , 2025

  9. [9]

    Quantitative Clustering in Mean-Field Transformer Models

    Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Quantitative clustering in mean-field transformer models. arXiv preprint arXiv:2504.14697 , 2025. 2Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the Eur...

  10. [10]

    Lost in the middle at birth: An exact theory of transformer position bias.arXiv preprint arXiv:2603.10123, 2026

    Borun D Chowdhury. Lost in the middle at birth: An exact theory of transformer position bias.arXiv preprint arXiv:2603.10123, 2026

  11. [11]

    Bronstein, Yann LeCun, and Ravid Shwartz-Ziv

    Enrique Queipo de Llano, Alvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael M. Bronstein, Yann LeCun, and Ravid Shwartz-Ziv. Attention sinks and compression valleys in LLMs are two sides of the same coin. InThe Fourteenth International Conference on Learning Representations , 2026

  12. [12]

    Attention is not all you need: Pure attention loses rank doubly exponentially with depth

    Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. InInternational conference on machine learning , pages 2793–2803. PMLR, 2021

  13. [13]

    On the size of chaos via Glauber calculus in the classical mean-field dynamics.Communications in Mathematical Physics , 382(1):613–653, 2021

    Mitia Duerinckx. On the size of chaos via Glauber calculus in the classical mean-field dynamics.Communications in Mathematical Physics , 382(1):613–653, 2021

  14. [14]

    Duerinckx and P.-E

    MitiaDuerinckxandPierre-EmmanuelJabin.CorrelationestimatesforBrownianparticleswithsingularinteractions. Preprint, arXiv:2510.01507

  15. [15]

    Clustering in Deep Stochastic Transformers.arXiv preprint arXiv:2601.21942, 2026

    Lev Fedorov, Michaël E Sander, Romuald Elie, Pierre Marion, and Mathieu Laurière. Clustering in deep stochastic transformers. arXiv preprint arXiv:2601.21942 , 2026

  16. [16]

    Dynamic metastability in the self-attention model.arXiv preprint arXiv:2410.06833, 2024

    Borjan Geshkovski, Hugo Koubbi, Yury Polyanskiy, and Philippe Rigollet. Dynamic metastability in the self- attention model. arXiv preprint arXiv:2410.06833 , 2024

  17. [17]

    Advances in Neural Information Processing Systems , 36:57026–57037, 2023

    BorjanGeshkovski, CyrilLetrouit, YuryPolyanskiy, andPhilippeRigollet.Theemergenceofclustersinself-attention dynamics. Advances in Neural Information Processing Systems , 36:57026–57037, 2023

  18. [18]

    A mathematical perspective on trans- formers

    Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on trans- formers. Bulletin of the American Mathematical Society , 62(3):427–479, 2025

  19. [19]

    Measure-to-measure inter- polation using transformers.arXiv preprint arXiv:2411.04551, 2024

    Borjan Geshkovski, Philippe Rigollet, and Domènec Ruiz-Balet. Measure-to-measure interpolation using transform- ers. arXiv preprint arXiv:2411.04551 , 2024

  20. [20]

    On the dynamics of large particle systems in the mean field limit

    François Golse. On the dynamics of large particle systems in the mean field limit. In Adrian Muntean, Jens Rademacher, and Antonios Zagaris, editors, Macroscopic and Large Scale Phenomena: Coarse Graining, Mean Field Limits and Ergodicity , pages 1–144. Springer International Publishing, Cham, 2016

  21. [21]

    When attention sink emerges in language models: An empirical view

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. In The Thirteenth International Conference on Learning Representations, 2025

  22. [22]

    A residual-aware theory of position bias in transformers

    Hanna Herasimchyk, Robin Labryga, Tomislav Prusina, and Sören Laue. A residual-aware theory of position bias in transformers. arXiv preprint arXiv:2602.16837 , 2026

  23. [23]

    Higher-order propagation of chaos inL2 for interacting diffusions

    Elias Hess-Childs and Keefer Rowan. Higher-order propagation of chaos inL2 for interacting diffusions. Probab. Math. Phys., 6(2):581–646, 2025

  24. [24]

    Found in the middle: Calibrating positional attention bias improves long context utilization

    Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, and Tomas Pfister. Found in the middle: Calibrating positional attention bias improves long context utilization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Comput...

  25. [25]

    Mean-field limit of non-exchangeable systems.Communi- cations on Pure and Applied Mathematics , 78(4):651–741, 2025

    Pierre-Emmanuel Jabin, David Poyato, and Juan Soler. Mean-field limit of non-exchangeable systems.Communi- cations on Pure and Applied Mathematics , 78(4):651–741, 2025

  26. [26]

    Mean field limit for stochastic particle systems

    Pierre-Emmanuel Jabin and Zhenfu Wang. Mean field limit for stochastic particle systems. In Nicola Bellomo, Pierre Degond, and Eitan Tadmor, editors,Active Particles, Volume 1 : Advances in Theory, Models, and Applications , pages 379–402. Springer International Publishing, Cham, 2017

  27. [27]

    LongLLMLin- gua: Accelerating and enhancing LLMs in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLin- gua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Lon...

  28. [28]

    Clustering in causal attention masking.Advances in neural information processing systems , 37:115652–115681, 2024

    Nikita Karagodin, Yury Polyanskiy, and Philippe Rigollet. Clustering in causal attention masking.Advances in neural information processing systems , 37:115652–115681, 2024

  29. [29]

    Homogenized Transformers.arXiv preprint arXiv:2604.01978, 2026

    Hugo Koubbi, Borjan Geshkovski, and Philippe Rigollet. Homogenized transformers. arXiv preprint arXiv:2604.01978, 2026

  30. [30]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics , 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics , 12:157–173, 2024

  31. [31]

    Signal propagation in transformers: Theoretical perspectives and the role of rank collapse.Advances in Neural Information Processing Systems, 35:27198–27211, 2022

    Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh, and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse.Advances in Neural Information Processing Systems, 35:27198–27211, 2022

  32. [32]

    T. Paul, M. Pulvirenti, and S. Simonella. On the Size of Chaos in the Mean Field Dynamics.Arch. Ration. Mech. Anal., 231(1):285–317, 2019

  33. [33]

    Synchronization of mean-field models on the circle.arXiv preprint arXiv:2507.22857, 2025

    Yury Polyanskiy, Philippe Rigollet, and Andrew Yao. Synchronization of mean-field models on the circle.arXiv preprint arXiv:2507.22857, 2025. 24 M. DUERINCKX, B. GESHKOVSKI, AND S. ROSSI

  34. [34]

    Train short, test long: Attention with linear biases enables input length extrapolation

    Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations , 2022

  35. [35]

    Global-in-time mean-field convergence for singular Riesz-type diffusive flows

    Matthew Rosenzweig and Sylvia Serfaty. Global-in-time mean-field convergence for singular Riesz-type diffusive flows. The Annals of Applied Probability , 33(2):954 – 998, 2023

  36. [36]

    Mean field limit for Coulomb-type flows.Duke Mathematical Journal , 169(15):2887–2935, 2020

    Sylvia Serfaty. Mean field limit for Coulomb-type flows.Duke Mathematical Journal , 169(15):2887–2935, 2020

  37. [37]

    Roformer: Enhanced transformer with rotary position embedding, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2024

  38. [38]

    Attention is all you need.Advances in Neural Information Processing Systems , 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems , 30, 2017

  39. [39]

    On the emergence of position bias in transformers

    Xinyi Wu, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the emergence of position bias in transformers. In Forty-second International Conference on Machine Learning , 2025

  40. [40]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations , 2024

  41. [41]

    Deepvit: Towards deeper vision transformer.arXiv preprint arXiv:2103.11886 , 2021

    Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer.arXiv preprint arXiv:2103.11886 , 2021. Mitia Duerinckx Département de Mathémathiques Université Libre de Bruxelles Boulevard du Triomphe B-1050 Bruxelles, Belgium e-mail: mitia.duerinckx@ulb.be Borjan Ges...