Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

Alexandre V. Morozov; Anirvan M. Sengupta; Matthew Smart; Nilava Metya; Soumya Ganguly

arxiv: 2605.29351 · v1 · pith:LMHVEVYYnew · submitted 2026-05-28 · 💻 cs.LG · math.DS· stat.ML

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

Matthew Smart , Soumya Ganguly , Nilava Metya , Alexandre V. Morozov , Anirvan M. Sengupta This is my paper

Pith reviewed 2026-06-29 09:01 UTC · model grok-4.3

classification 💻 cs.LG math.DSstat.ML

keywords attentionempirical Bayesparticle dynamicsin-context learningtransformerdenoisingposterior inferencediffusion

0 comments

The pith

Minimal attention-only transformers admit a two-stage empirical Bayes interpretation where attention computes kernel-weighted posterior means and depth refines the context distribution via particle dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention-only transformers under all-token corruption perform in-context empirical Bayes inference in two distinct stages. A single attention step produces a kernel-weighted posterior mean relative to the empirical distribution induced by the context tokens. Depth then evolves this distribution through particle dynamics on a context-induced energy landscape, while a long-range skip connection supplies the original noisy input as the query for the final posterior step. Effective denoising arises from a fixed kernel bandwidth and finite integration horizon alone, without any explicit noise schedule, and the framework yields a posterior-mean recovery guarantee for suitable priors along with a link to reverse-diffusion dynamics.

Core claim

Minimal attention-only transformers under all-token corruption admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The context itself induces a depth-dependent energy landscape governing in-context inference, and effective denoising emerges without an explicit noise schedule using only a fixed kernel bandwidth and finite

What carries the argument

The two-stage empirical Bayes view in which a kernel-weighted posterior mean is computed from the context empirical distribution, with depth evolving that distribution through particle dynamics on an induced energy landscape.

If this is right

Depth refines the empirical distribution through particle dynamics on the induced energy landscape.
Effective denoising arises from a fixed kernel bandwidth and finite integration horizon without any explicit noise schedule.
A principled relationship between depth and noise level follows from the finite integration horizon.
The empirical estimator converges to the Bayes-optimal predictor for a class of well-behaved priors under asymptotic conditions.
Attention receives a statistical interpretation as sample-based posterior estimation without explicit density modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distinct role assigned to skip connections suggests they could be tuned separately from standard attention layers in practical architectures.
The particle-dynamics stage may generalize to other in-context tasks such as few-shot classification or regression.
Training objectives could be modified to explicitly encourage the emergence of the predicted energy landscape.
The reverse-diffusion connection opens the possibility of importing sampling techniques from diffusion models into transformer inference.

Load-bearing premise

The context tokens induce a depth-dependent energy landscape that governs the in-context inference process.

What would settle it

Direct computation showing that attention weights fail to match the kernel weights with respect to the empirical context distribution, or that increasing depth fails to produce the predicted refinement of the posterior mean estimate.

Figures

Figures reproduced from arXiv: 2605.29351 by Alexandre V. Morozov, Anirvan M. Sengupta, Matthew Smart, Nilava Metya, Soumya Ganguly.

**Figure 1.** Figure 1: (a) Collective in-context denoising: Multilayer attention implements a two-stage EB procedure. Depth iteratively refines a particle prior (Stage 1), while a long-range skip— acting as an Attention Residual (AttnRes)—performs posterior averaging against the initial noisy input (Stage 2). (b) Particle dynamics: Discrete self-attention updates move corrupted tokens from their initial noised distribution towar… view at source ↗

**Figure 2.** Figure 2: Attention as discretized reverse diffusion. (a) Particles initialized from a noised distribution ρT are iteratively updated across layers using Gaussian attention steps (t = ℓh) to approximately recover the clean prior ρ0 = N (0, 1). (b) Variance dynamics for Gaussian denoising under different kernel bandwidths β. Parameters: 20 seeds, N = 5000, L0 = 200, σ2 = 0.25. for l = 0, 1, . . . , L − 1, starting wi… view at source ↗

**Figure 3.** Figure 3: Dynamic energy landscape. (a) Ground truth GMM prior. (b) Corrupted input prompt (N = 5, 000). (c) Dynamic refinement: As model depth ℓ increases, Stage 1 self-attention iteratively sharpens the particle prior (white cloud). This process dynamically sculpts an associative memory landscape E(Z(ℓ), x˜) which determines the posterior averaging step in Stage 2; Eq. (2) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Multilayer attention-based denoiser approaches Bayes-optimal [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Mean squared error (MSE) as a function of depth [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

read the original abstract

We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two-stage empirical Bayes framing with particle dynamics for attention is a new angle worth checking, but the recovery guarantee rests on unspecified asymptotics whose link to finite transformers is unclear.

read the letter

The main takeaway is that this paper frames attention in minimal attention-only transformers under all-token corruption as a two-stage empirical Bayes process. A single attention step does kernel-weighted posterior mean estimation from the context tokens, depth refines the distribution via particle dynamics, and the long-range skip carries the noisy input for the second-stage inference. This gives distinct roles to depth and residuals and claims effective denoising emerges from fixed bandwidth plus finite integration without any explicit noise schedule.

What is actually new is the explicit split into these two stages plus the depth-noise relationship derived from the particle dynamics, along with the link to reverse-diffusion limits. The abstract positions this as a statistical view of in-context inference via sample-based posterior estimation rather than density modeling. That organizing story is cleaner than many prior kernel or mean-field takes on attention.

The soft spot is the posterior-mean recovery guarantee. It is stated for well-behaved priors where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions, yet the abstract leaves the regime unspecified—context length to infinity at fixed depth, depth to infinity at fixed context, or joint scaling. If the guarantee only holds in the infinite-context limit while the particle-dynamics analysis stays at finite depth, it does not directly underwrite the finite-setting denoising claim. The derivations would need to be inspected to see whether that gap is real or just an abstract omission.

This is for readers working on theoretical accounts of in-context learning and transformer internals. It shows honest engagement with empirical Bayes and diffusion ideas and the claims are structured enough to be checked. It deserves a serious referee even if the asymptotics require tightening.

Referee Report

1 major / 0 minor

Summary. The manuscript studies minimal attention-only transformers under all-token corruption and interprets them as a two-stage empirical Bayes procedure. A single attention step is claimed to compute a kernel-weighted posterior mean with respect to the empirical distribution induced by the context. Depth is interpreted as performing particle dynamics to refine this distribution (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2). The authors claim that effective denoising emerges without an explicit noise schedule using only a fixed kernel bandwidth and finite integration horizon, and they establish a posterior-mean recovery guarantee for a class of well-behaved priors under asymptotic conditions, connecting the dynamics to reverse-diffusion limits.

Significance. If the derivations and recovery guarantee hold with the stated conditions made precise, the work supplies a statistical interpretation of attention as in-context sample-based posterior estimation. This could clarify the distinct roles of depth (particle refinement) and residuals (query carrying) in transformers and isolate a minimal setting where context induces a depth-dependent energy landscape for inference, without requiring explicit density modeling or noise schedules.

major comments (1)

[Abstract] Abstract: the posterior-mean recovery guarantee is stated only under unspecified 'asymptotic conditions.' The manuscript must explicitly identify the scaling regime (context length n→∞ at fixed depth, depth L→∞ at fixed n, joint scaling, or another limit) because this choice determines whether the guarantee applies to the finite-depth, finite-context denoising behavior that the two-stage framework isolates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to make the asymptotic regime explicit. We agree that this clarification strengthens the presentation and will revise the abstract and theorem statements accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the posterior-mean recovery guarantee is stated only under unspecified 'asymptotic conditions.' The manuscript must explicitly identify the scaling regime (context length n→∞ at fixed depth, depth L→∞ at fixed n, joint scaling, or another limit) because this choice determines whether the guarantee applies to the finite-depth, finite-context denoising behavior that the two-stage framework isolates.

Authors: We agree that the scaling regime must be stated explicitly. The recovery guarantee is derived under the regime n → ∞ with depth L held fixed (and kernel bandwidth fixed). This limit is the natural one for the two-stage view: the single attention step produces an empirical posterior mean whose error vanishes as the context size grows, while the subsequent particle dynamics (Stage 1) and residual query (Stage 2) operate at finite depth. We will revise the abstract to read “under the asymptotic regime of context length n → ∞ at fixed depth” and will add the same clarification to the theorem statement and its proof sketch. revision: yes

Circularity Check

0 steps flagged

No circularity: interpretive framework with external-style guarantee

full rationale

The paper advances a two-stage empirical Bayes reading of attention and states a posterior-mean recovery guarantee under asymptotic conditions. No quoted step reduces a claimed prediction or result to a fitted parameter or self-citation by construction; the central claims are presented as interpretive mappings and convergence statements rather than closed algebraic identities. The derivation chain therefore remains self-contained against the supplied text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Ledger populated from abstract statements only. No explicit free parameters or invented entities are named. The two domain assumptions listed are the load-bearing premises required for the claimed interpretation and guarantees.

axioms (2)

domain assumption The context itself induces a depth-dependent energy landscape governing in-context inference
Invoked to isolate the minimal setting and to explain how depth refines the distribution.
domain assumption A class of well-behaved priors exists for which the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions
Required for the posterior-mean recovery guarantee.

pith-pipeline@v0.9.1-grok · 5731 in / 1424 out tokens · 34524 ms · 2026-06-29T09:01:05.221115+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 18 canonical work pages · 4 internal anchors

[1]

urich. Birkh\

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar\'e. Gradient flows in metric spaces and in the space of probability measures. Lectures in Mathematics ETH Z\"urich. Birkh\"auser Verlag, Basel, second edition, 2008. ISBN 978-3-7643-8721-1

2008
[2]

Laplacian eigenmaps for dimensionality reduction and data representation

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15 0 (6): 0 1373--1396, 2003

2003
[3]

I. Bihari. A generalization of a lemma of B ellman and its application to uniqueness problems of differential equations. Acta Math. Acad. Sci. Hungar., 7: 0 81--94, 1956. ISSN 0001-5954,1588-2632. doi:10.1007/BF02022967. URL https://doi.org/10.1007/BF02022967

work page doi:10.1007/bf02022967 1956
[4]

Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006

2006
[5]

A multiscale analysis of mean-field transformers in the moderate interaction regime

Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean-field transformers in the moderate interaction regime. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=WCRPgBpbcA

2025
[6]

Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

Martin Burger, Samira Kabri, Yury Korolev, Tim Roith, and Lukas Weigand. Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 383 0 (2298): 0 20240233, 06 2025. ISSN 1364-503X. doi:10.1098...

work page doi:10.1098/rsta.2024.0233 2025
[7]

Propagation of chaos: a review of models, methods and applications

Louis-Pierre Chaintron and Antoine Diez. Propagation of chaos: a review of models, methods and applications. I . M odels and methods. Kinet. Relat. Models, 15 0 (6): 0 895--1015, 2022. ISSN 1937-5093,1937-5077. doi:10.3934/krm.2022017. URL https://doi.org/10.3934/krm.2022017

work page doi:10.3934/krm.2022017 2022
[8]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018

2018
[9]

Skyformer: Remodel self-attention with gaussian kernel and nyström method

Yifan Chen, Qi Zeng, Heng Ji, and Yun Yang. Skyformer: Remodel self-attention with gaussian kernel and nyström method. Advances in Neural Information Processing Systems, 34: 0 2122--2135, 2021

2021
[10]

Comaniciu and P

D. Comaniciu and P. Meer. Mean shift analysis and applications. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1197--1203 vol.2, 1999. doi:10.1109/ICCV.1999.790416

work page doi:10.1109/iccv.1999.790416 1999
[11]

NRGPT : An energy-based alternative for GPT

Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, and Dmitry Krotov. NRGPT : An energy-based alternative for GPT . In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=B3Muyi2zgo

2026
[12]

Mean field simulation for monte carlo integration

Pierre Del Moral. Mean field simulation for monte carlo integration. Monographs on Statistics and Applied Probability, 126 0 (26): 0 6, 2013

2013
[13]

Vlasov equations

Roland L’vovich Dobrushin. Vlasov equations. Functional Analysis and Its Applications, 13 0 (2): 0 115--123, 1979

1979
[14]

Tweedie’s formula and selection bias

Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106 0 (496): 0 1602--1614, 2011

2011
[15]

On the rate of convergence in W asserstein distance of the empirical measure

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in W asserstein distance of the empirical measure. Probab. Theory Related Fields, 162 0 (3-4): 0 707--738, 2015. ISSN 0178-8051,1432-2064. doi:10.1007/s00440-014-0583-7. URL https://doi.org/10.1007/s00440-014-0583-7

work page doi:10.1007/s00440-014-0583-7 2015
[16]

Fukunaga and L

K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21 0 (1): 0 32--40, 1975. doi:10.1109/TIT.1975.1055330

work page doi:10.1109/tit.1975.1055330 1975
[17]

The emergence of clusters in self-attention dynamics

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems, 36: 0 57026--57037, 2023

2023
[18]

A mathematical perspective on transformers

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers. Bulletin of the American Mathematical Society, 62 0 (3): 0 427--479, 2025

2025
[19]

Energy-based transformers are scalable learners and thinkers

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, and Tariq Iqbal. Energy-based transformers are scalable learners and thinkers. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ZBj3Qp1bYg

2026
[20]

The fast gauss transform

Leslie Greengard and John Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12 0 (1): 0 79--94, 1991

1991
[21]

Manifold denoising

Matthias Hein and Markus Maier. Manifold denoising. Advances in neural information processing systems, 19, 2006

2006
[22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

2020
[23]

Energy transformer

Benjamin Hoover, Yuchen Liang, Bao Pham, Rameswar Panda, Hendrik Strobelt, Duen Horng Chau, Mohammed J Zaki, and Dmitry Krotov. Energy transformer. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=MbwVNEx9KS

2023
[24]

DiScoFormer: Plug-In Density and Score Estimation with Transformers

Vasily Ilin and Peter Sushko. Discoformer: Plug-in density and score estimation with transformers, 2026. URL https://arxiv.org/abs/2511.05924

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Constrained denoising, empirical bayes, and optimal transport

Adam Quinn Jaffe, Nikolaos Ignatiadis, and Bodhisattva Sen. Constrained denoising, empirical bayes, and optimal transport. arXiv preprint arXiv:2506.09986, 2025

work page arXiv 2025
[26]

Johnstone and Bernard W

Iain M. Johnstone and Bernard W. Silverman. Empirical bayes selection of wavelet thresholds. Annals of Statistics, 33, 2005. ISSN 00905364. doi:10.1214/009053605000000345

work page doi:10.1214/009053605000000345 2005
[27]

Dense associative memory for pattern recognition

Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016

2016
[28]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In Computer Vision -- ECCV 2024: 18th European Conference, Milan, Italy, September 29--October 4, 2024, Proceedings, Part LXXVII, pages 23--40, Berlin, Heid...

work page doi:10.1007/978-3-031-72980-5_2 2024
[30]

An empirical bayes estimator of the mean of a normal population

Koichi Miyasawa. An empirical bayes estimator of the mean of a normal population. Bull. Inst. Internat. Statist, 38 0 (181-188): 0 1--2, 1961

1961
[31]

Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants

Chirag Modi, Jiequn Han, Eric Vanden-Eijnden, and Joan Bruna. Generative modeling from black-box corruptions via self-consistent stochastic interpolants. arXiv preprint arXiv:2512.10857, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

On estimating regression

EA Nadaraya. On estimating regression. theor. Probab. Appl, 9 0 (1), 1964

1964
[33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195--4205, October 2023

2023
[34]

a fl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David P. Kreil, Michael K. Kopp, G \

Hubert Ramsauer, Bernhard Sch \" a fl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David P. Kreil, Michael K. Kopp, G \" u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event,...

2021
[35]

Least squares estimation without priors or supervision

Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23 0 (2): 0 374--420, 2011

2011
[36]

Variational inference with normalizing flows

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530--1538. PMLR, 2015

2015
[37]

The mean-field dynamics of transformers

Philippe Rigollet. The mean-field dynamics of transformers. arXiv preprint arXiv:2512.01868, 2025

work page arXiv 2025
[38]

An empirical bayes approach to statistics

Herbert E Robbins. An empirical bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, pages 388--394. Springer, 1956

1956
[39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\"orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684--10695, June 2022

2022
[40]

From softmax to score: Transformers can effectively implement in-context denoising steps

Paul Rosu, Lawrence Carin, and Xiang Cheng. From softmax to score: Transformers can effectively implement in-context denoising steps. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4QRoLzD11x

2025
[41]

Sinkformers: Transformers with doubly stochastic attention

Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr \'e . Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pages 3515--3530. PMLR, 2022

2022
[42]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=din0lGfZFd

2025
[43]

Emergent properties of collective gene-expression patterns in multicellular systems

Matthew Smart and Anton Zilman. Emergent properties of collective gene-expression patterns in multicellular systems. Cell Reports Physical Science, 4 0 (2), 2023

2023
[44]

In-context denoising with one-layer transformers: Connections between attention and associative memory retrieval

Matthew Smart, Alberto Bietti, and Anirvan M Sengupta. In-context denoising with one-layer transformers: Connections between attention and associative memory retrieval. In International Conference on Machine Learning, pages 55950--55971. PMLR, 2025

2025
[45]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS

2021
[46]

Estimation of the mean of a multivariate normal distribution

Charles M Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135--1151, 1981

1981
[47]

Topics in Propagation of Chaos

Alain-Sol Sznitman. Topics in propagation of chaos. In \'Ecole d'\'Et\'e de P robabilit\'es de S aint- F lour XIX ---1989 , volume 1464 of Lecture Notes in Math., pages 165--251. Springer, Berlin, 1991. ISBN 3-540-53841-0. doi:10.1007/BFb0085169. URL https://doi.org/10.1007/BFb0085169

work page doi:10.1007/bfb0085169 1989
[48]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

Solving empirical bayes via transformers

Anzo Teh, Mark Jabbour, and Yury Polyanskiy. Solving empirical bayes via transformers. arXiv preprint arXiv:2502.09844, 2025

work page arXiv 2025
[50]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

2017
[51]

Springer, Berlin, Heidelberg, 2009

C\'edric Villani. Optimal transport, volume 338 of Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2009. ISBN 978-3-540-71049-3. doi:10.1007/978-3-540-71050-9. URL https://doi.org/10.1007/978-3-540-71050-9. Old and new

work page doi:10.1007/978-3-540-71050-9 2009
[52]

A connection between score matching and denoising autoencoders

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23 0 (7): 0 1661--1674, 2011. doi:10.1162/NECO_a_00142

work page doi:10.1162/neco_a_00142 2011
[53]

Transformers learn in-context by gradient descent

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151--35174. PMLR, 2023

2023
[54]

Attention-only transformers via unrolled subspace denoising

Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, and Yi Ma. Attention-only transformers via unrolled subspace denoising. In International Conference on Machine Learning, pages 63840--63859. PMLR, 2025

2025
[55]

Smooth regression analysis

Geoffrey S Watson. Smooth regression analysis. Sankhy \=a : The Indian Journal of Statistics, Series A , pages 359--372, 1964

1964
[56]

o mformer: A nystr \

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr \"o mformer: A nystr \"o m-based algorithm for approximating self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138--14148, 2021

2021
[57]

Looped transformers are better at learning learning algorithms

Liu Yang, Kangwook Lee, Robert D Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=HHbRxoDTxE

2024

[1] [1]

urich. Birkh\

Luigi Ambrosio, Nicola Gigli, and Giuseppe Savar\'e. Gradient flows in metric spaces and in the space of probability measures. Lectures in Mathematics ETH Z\"urich. Birkh\"auser Verlag, Basel, second edition, 2008. ISBN 978-3-7643-8721-1

2008

[2] [2]

Laplacian eigenmaps for dimensionality reduction and data representation

Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15 0 (6): 0 1373--1396, 2003

2003

[3] [3]

I. Bihari. A generalization of a lemma of B ellman and its application to uniqueness problems of differential equations. Acta Math. Acad. Sci. Hungar., 7: 0 81--94, 1956. ISSN 0001-5954,1588-2632. doi:10.1007/BF02022967. URL https://doi.org/10.1007/BF02022967

work page doi:10.1007/bf02022967 1956

[4] [4]

Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006

2006

[5] [5]

A multiscale analysis of mean-field transformers in the moderate interaction regime

Giuseppe Bruno, Federico Pasqualotto, and Andrea Agazzi. A multiscale analysis of mean-field transformers in the moderate interaction regime. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=WCRPgBpbcA

2025

[6] [6]

Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization

Martin Burger, Samira Kabri, Yury Korolev, Tim Roith, and Lukas Weigand. Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 383 0 (2298): 0 20240233, 06 2025. ISSN 1364-503X. doi:10.1098...

work page doi:10.1098/rsta.2024.0233 2025

[7] [7]

Propagation of chaos: a review of models, methods and applications

Louis-Pierre Chaintron and Antoine Diez. Propagation of chaos: a review of models, methods and applications. I . M odels and methods. Kinet. Relat. Models, 15 0 (6): 0 895--1015, 2022. ISSN 1937-5093,1937-5077. doi:10.3934/krm.2022017. URL https://doi.org/10.3934/krm.2022017

work page doi:10.3934/krm.2022017 2022

[8] [8]

Neural ordinary differential equations

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018

2018

[9] [9]

Skyformer: Remodel self-attention with gaussian kernel and nyström method

Yifan Chen, Qi Zeng, Heng Ji, and Yun Yang. Skyformer: Remodel self-attention with gaussian kernel and nyström method. Advances in Neural Information Processing Systems, 34: 0 2122--2135, 2021

2021

[10] [10]

Comaniciu and P

D. Comaniciu and P. Meer. Mean shift analysis and applications. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 1197--1203 vol.2, 1999. doi:10.1109/ICCV.1999.790416

work page doi:10.1109/iccv.1999.790416 1999

[11] [11]

NRGPT : An energy-based alternative for GPT

Nima Dehmamy, Benjamin Hoover, Bishwajit Saha, Leo Kozachkov, Jean-Jacques Slotine, and Dmitry Krotov. NRGPT : An energy-based alternative for GPT . In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=B3Muyi2zgo

2026

[12] [12]

Mean field simulation for monte carlo integration

Pierre Del Moral. Mean field simulation for monte carlo integration. Monographs on Statistics and Applied Probability, 126 0 (26): 0 6, 2013

2013

[13] [13]

Vlasov equations

Roland L’vovich Dobrushin. Vlasov equations. Functional Analysis and Its Applications, 13 0 (2): 0 115--123, 1979

1979

[14] [14]

Tweedie’s formula and selection bias

Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106 0 (496): 0 1602--1614, 2011

2011

[15] [15]

On the rate of convergence in W asserstein distance of the empirical measure

Nicolas Fournier and Arnaud Guillin. On the rate of convergence in W asserstein distance of the empirical measure. Probab. Theory Related Fields, 162 0 (3-4): 0 707--738, 2015. ISSN 0178-8051,1432-2064. doi:10.1007/s00440-014-0583-7. URL https://doi.org/10.1007/s00440-014-0583-7

work page doi:10.1007/s00440-014-0583-7 2015

[16] [16]

Fukunaga and L

K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21 0 (1): 0 32--40, 1975. doi:10.1109/TIT.1975.1055330

work page doi:10.1109/tit.1975.1055330 1975

[17] [17]

The emergence of clusters in self-attention dynamics

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emergence of clusters in self-attention dynamics. Advances in Neural Information Processing Systems, 36: 0 57026--57037, 2023

2023

[18] [18]

A mathematical perspective on transformers

Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers. Bulletin of the American Mathematical Society, 62 0 (3): 0 427--479, 2025

2025

[19] [19]

Energy-based transformers are scalable learners and thinkers

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Peixuan Han, Hyeonjeong Ha, Aman Chadha, Yilun Du, Heng Ji, Jundong Li, and Tariq Iqbal. Energy-based transformers are scalable learners and thinkers. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=ZBj3Qp1bYg

2026

[20] [20]

The fast gauss transform

Leslie Greengard and John Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12 0 (1): 0 79--94, 1991

1991

[21] [21]

Manifold denoising

Matthias Hein and Markus Maier. Manifold denoising. Advances in neural information processing systems, 19, 2006

2006

[22] [22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 0 6840--6851, 2020

2020

[23] [23]

Energy transformer

Benjamin Hoover, Yuchen Liang, Bao Pham, Rameswar Panda, Hendrik Strobelt, Duen Horng Chau, Mohammed J Zaki, and Dmitry Krotov. Energy transformer. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=MbwVNEx9KS

2023

[24] [24]

DiScoFormer: Plug-In Density and Score Estimation with Transformers

Vasily Ilin and Peter Sushko. Discoformer: Plug-in density and score estimation with transformers, 2026. URL https://arxiv.org/abs/2511.05924

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Constrained denoising, empirical bayes, and optimal transport

Adam Quinn Jaffe, Nikolaos Ignatiadis, and Bodhisattva Sen. Constrained denoising, empirical bayes, and optimal transport. arXiv preprint arXiv:2506.09986, 2025

work page arXiv 2025

[26] [26]

Johnstone and Bernard W

Iain M. Johnstone and Bernard W. Silverman. Empirical bayes selection of wavelet thresholds. Annals of Statistics, 33, 2005. ISSN 00905364. doi:10.1214/009053605000000345

work page doi:10.1214/009053605000000345 2005

[27] [27]

Dense associative memory for pattern recognition

Dmitry Krotov and John J Hopfield. Dense associative memory for pattern recognition. Advances in neural information processing systems, 29, 2016

2016

[28] [28]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In Computer Vision -- ECCV 2024: 18th European Conference, Milan, Italy, September 29--October 4, 2024, Proceedings, Part LXXVII, pages 23--40, Berlin, Heid...

work page doi:10.1007/978-3-031-72980-5_2 2024

[30] [30]

An empirical bayes estimator of the mean of a normal population

Koichi Miyasawa. An empirical bayes estimator of the mean of a normal population. Bull. Inst. Internat. Statist, 38 0 (181-188): 0 1--2, 1961

1961

[31] [31]

Generative Modeling from Black-box Corruptions via Self-Consistent Stochastic Interpolants

Chirag Modi, Jiequn Han, Eric Vanden-Eijnden, and Joan Bruna. Generative modeling from black-box corruptions via self-consistent stochastic interpolants. arXiv preprint arXiv:2512.10857, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

On estimating regression

EA Nadaraya. On estimating regression. theor. Probab. Appl, 9 0 (1), 1964

1964

[33] [33]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195--4205, October 2023

2023

[34] [34]

a fl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David P. Kreil, Michael K. Kopp, G \

Hubert Ramsauer, Bernhard Sch \" a fl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David P. Kreil, Michael K. Kopp, G \" u nter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event,...

2021

[35] [35]

Least squares estimation without priors or supervision

Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23 0 (2): 0 374--420, 2011

2011

[36] [36]

Variational inference with normalizing flows

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530--1538. PMLR, 2015

2015

[37] [37]

The mean-field dynamics of transformers

Philippe Rigollet. The mean-field dynamics of transformers. arXiv preprint arXiv:2512.01868, 2025

work page arXiv 2025

[38] [38]

An empirical bayes approach to statistics

Herbert E Robbins. An empirical bayes approach to statistics. In Breakthroughs in Statistics: Foundations and basic theory, pages 388--394. Springer, 1956

1956

[39] [39]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj\"orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684--10695, June 2022

2022

[40] [40]

From softmax to score: Transformers can effectively implement in-context denoising steps

Paul Rosu, Lawrence Carin, and Xiang Cheng. From softmax to score: Transformers can effectively implement in-context denoising steps. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4QRoLzD11x

2025

[41] [41]

Sinkformers: Transformers with doubly stochastic attention

Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyr \'e . Sinkformers: Transformers with doubly stochastic attention. In International Conference on Artificial Intelligence and Statistics, pages 3515--3530. PMLR, 2022

2022

[42] [42]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reasoning with latent thoughts: On the power of looped transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=din0lGfZFd

2025

[43] [43]

Emergent properties of collective gene-expression patterns in multicellular systems

Matthew Smart and Anton Zilman. Emergent properties of collective gene-expression patterns in multicellular systems. Cell Reports Physical Science, 4 0 (2), 2023

2023

[44] [44]

In-context denoising with one-layer transformers: Connections between attention and associative memory retrieval

Matthew Smart, Alberto Bietti, and Anirvan M Sengupta. In-context denoising with one-layer transformers: Connections between attention and associative memory retrieval. In International Conference on Machine Learning, pages 55950--55971. PMLR, 2025

2025

[45] [45]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS

2021

[46] [46]

Estimation of the mean of a multivariate normal distribution

Charles M Stein. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, pages 1135--1151, 1981

1981

[47] [47]

Topics in Propagation of Chaos

Alain-Sol Sznitman. Topics in propagation of chaos. In \'Ecole d'\'Et\'e de P robabilit\'es de S aint- F lour XIX ---1989 , volume 1464 of Lecture Notes in Math., pages 165--251. Springer, Berlin, 1991. ISBN 3-540-53841-0. doi:10.1007/BFb0085169. URL https://doi.org/10.1007/BFb0085169

work page doi:10.1007/bfb0085169 1989

[48] [48]

Attention Residuals

Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

Solving empirical bayes via transformers

Anzo Teh, Mark Jabbour, and Yury Polyanskiy. Solving empirical bayes via transformers. arXiv preprint arXiv:2502.09844, 2025

work page arXiv 2025

[50] [50]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

2017

[51] [51]

Springer, Berlin, Heidelberg, 2009

C\'edric Villani. Optimal transport, volume 338 of Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer-Verlag, Berlin, 2009. ISBN 978-3-540-71049-3. doi:10.1007/978-3-540-71050-9. URL https://doi.org/10.1007/978-3-540-71050-9. Old and new

work page doi:10.1007/978-3-540-71050-9 2009

[52] [52]

A connection between score matching and denoising autoencoders

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23 0 (7): 0 1661--1674, 2011. doi:10.1162/NECO_a_00142

work page doi:10.1162/neco_a_00142 2011

[53] [53]

Transformers learn in-context by gradient descent

Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151--35174. PMLR, 2023

2023

[54] [54]

Attention-only transformers via unrolled subspace denoising

Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu, and Yi Ma. Attention-only transformers via unrolled subspace denoising. In International Conference on Machine Learning, pages 63840--63859. PMLR, 2025

2025

[55] [55]

Smooth regression analysis

Geoffrey S Watson. Smooth regression analysis. Sankhy \=a : The Indian Journal of Statistics, Series A , pages 359--372, 1964

1964

[56] [56]

o mformer: A nystr \

Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr \"o mformer: A nystr \"o m-based algorithm for approximating self-attention. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14138--14148, 2021

2021

[57] [57]

Looped transformers are better at learning learning algorithms

Liu Yang, Kangwook Lee, Robert D Nowak, and Dimitris Papailiopoulos. Looped transformers are better at learning learning algorithms. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=HHbRxoDTxE

2024