Adynamical systems view of training generativemodels and the memorization phenomenon

Chiranjib Bhattacharya; Siva Athreya; Vivek S. Borkar

arxiv: 2605.19483 · v1 · pith:BAUYJYRKnew · submitted 2026-05-19 · 💻 cs.LG

Adynamical systems view of training generativemodels and the memorization phenomenon

Siva Athreya , Chiranjib Bhattacharya , Vivek S. Borkar This is my paper

Pith reviewed 2026-05-20 06:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords memorizationgenerative modelsstochastic gradient descenttwo time scalesdynamical systemsmodel collapsetraining dynamicsdouble descent

0 comments

The pith

Memorization in generative models arises purely from two distinct time scales in constant-step SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper offers a dynamical systems account of memorization in generative models, where the model produces the same or similar outputs for extended periods during training. It relies on a stylized loss function that depends strongly on some variables and weakly on others, which naturally creates fast and slow adjustment rates under constant-step stochastic gradient descent. Drawing on prior models of collapse and two-time-scale dynamics, the analysis shows how these rates interact to produce prolonged output repetition. A reader would care because this view treats memorization as a direct consequence of standard training dynamics rather than an external failure.

Core claim

A stylized loss function with strong dependence on certain variables and weak dependence on the rest induces two distinct time scales in constant step size SGD. When this dynamics is combined with a mathematical model of the collapse phenomenon, the generative model yields the same or similar outputs for significant stretches of time.

What carries the argument

Stylized loss function with precise strong-weak variable dependencies that creates two time scales in SGD, analyzed together with collapse dynamics.

If this is right

Memorization is explained solely through the training dynamics of constant-step SGD.
The same two-time-scale mechanism accounts for the double descent phenomenon in the same setting.
Collapse dynamics interact with the time-scale separation to sustain stretches of similar outputs.
The explanation applies without needing details of the data distribution or network architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of time scales may appear in other high-dimensional optimization tasks that use constant-step SGD.
Monitoring the diversity of generated samples over successive training intervals could provide an early diagnostic for emerging memorization.
Loss functions engineered to reduce strong-weak dependency gaps might shorten or eliminate the repetition periods.

Load-bearing premise

The loss function in SGD has a strong dependence on some variables and a weak dependence on the rest in a precise sense.

What would settle it

Training runs on a loss with the described strong-weak split that show continuous output variation with no prolonged repetition periods would falsify the proposed link to memorization.

read the original abstract

Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. This relies purely on the dynamic aspects of the training phase. Specifically, we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD that is commonly used in machine learning. This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization wherein a generative model that is concurrently being tuned yields the same or similar outputs for significant stretches of time. This gives a novel perspective on the aforementioned phenomena reported in machine learning literature and their interrelationships, using a dynamical systems viewpoint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper synthesizes the authors' prior results on collapse and two-time-scale SGD to give a dynamical explanation for memorization, but the stylized loss model is applied without showing it fits generative objectives.

read the letter

The main point is that this paper uses a dynamical systems lens to connect memorization in generative models to the training process itself. It claims that constant-step SGD can produce stretches where the model outputs similar things because of two distinct time scales in the dynamics. It does this by taking a stylized loss function from Austin 2016 that depends strongly on some variables and weakly on others. This leads to fast and slow evolution under SGD. They combine that with their own prior model for collapse in generative models and use results from Azizian et al. to analyze the behavior. The result is an explanation for why memorization happens as a dynamical effect rather than something about the data or architecture. This integration is the paper's strength. It pulls together double descent, collapse, and memorization into one framework based on training dynamics. That could be helpful for researchers trying to intervene in large model training to avoid these issues. The soft spot is the stylized loss. The paper motivates it from Austin but does not show that typical generative losses exhibit the precise strong-weak dependence needed. Without that, or some verification for ELBO or similar objectives, the two-time-scale story may not apply directly. The work also builds directly on two earlier papers by the same authors, so the new element is the joint application and the narrative around memorization. This paper is for people who study the theory of optimization in machine learning and want to see how different training phenomena relate. It demonstrates clear thinking in linking these ideas, even if the assumptions could use more backing. I recommend sending it for peer review. A good referee would examine whether the loss model holds up for the generative setting and whether the analysis adds enough beyond the cited works.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to provide a dynamical-systems explanation of the memorization phenomenon in generative models. It posits that a stylized loss function with strong dependence on a subset of variables and weak dependence on the remainder (motivated by Austin 2016) produces two distinct time scales under constant-step-size SGD; this two-scale behavior is then combined with the collapse model from Borkar 2025a (analyzed via Azizian et al. 2024) to account for stretches of similar model outputs during training.

Significance. If the stylized loss model is shown to be a faithful abstraction of standard generative objectives, the work could offer a unified dynamical account linking memorization, collapse, and double descent. The approach correctly invokes existing results on two-time-scale SGD and stochastic approximation, which is a methodological strength, but the incremental contribution is primarily interpretive rather than deriving new theorems or providing fresh verification.

major comments (2)

[Abstract / Model Description] Abstract and model description: the manuscript invokes Austin 2016 to motivate a stylized loss with 'strong dependence on some variables and weak dependence on the rest in a precise sense,' yet supplies no Hessian-block analysis, eigenvalue separation argument, or reference establishing that this separation holds for generative-model objectives such as the ELBO, GAN minimax, or diffusion score-matching losses. Because the two-time-scale claim and the subsequent link to collapse/memorization rest directly on this separation, the absence of justification for the stylized model in the generative setting is load-bearing.
[Analysis Section] Analysis of memorization regime: the explanation reduces the memorization phenomenon to the interaction of the two-time-scale dynamics with the collapse model already derived in Borkar 2025a. Without a new verification step, simulation, or explicit mapping showing how the memorization regime emerges distinctly from quantities defined in the prior collapse paper, the account risks being a direct reapplication rather than an independent derivation.

minor comments (2)

[Title] Title contains typographical errors: 'Adynamical' should be 'A dynamical' and 'generativemodels' should be 'generative models'.
[References] Citations to Borkar 2025a and Borkar 2026 appear as in-preparation or forthcoming works; the manuscript should clarify their status and ensure they are publicly available or properly referenced for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below with clarifications on the manuscript's approach and note revisions to strengthen the justification and novelty of the analysis.

read point-by-point responses

Referee: [Abstract / Model Description] Abstract and model description: the manuscript invokes Austin 2016 to motivate a stylized loss with 'strong dependence on some variables and weak dependence on the rest in a precise sense,' yet supplies no Hessian-block analysis, eigenvalue separation argument, or reference establishing that this separation holds for generative-model objectives such as the ELBO, GAN minimax, or diffusion score-matching losses. Because the two-time-scale claim and the subsequent link to collapse/memorization rest directly on this separation, the absence of justification for the stylized model in the generative setting is load-bearing.

Authors: We acknowledge that the manuscript does not include a dedicated Hessian-block analysis or eigenvalue separation proof tailored to specific generative objectives such as the ELBO or diffusion score-matching losses. The stylized loss is introduced as a modeling assumption motivated by Austin 2016 to capture a common high-dimensional structure in machine learning losses, consistent with its prior use in explaining double descent. This separation is treated as a plausible abstraction rather than a rigorously derived property for every generative loss. In revision we will expand the model description section with a short discussion of why such separation is expected in overparameterized settings, citing relevant empirical and theoretical work on loss landscapes in deep generative models. A complete derivation for all listed objectives lies outside the interpretive scope of the paper. revision: partial
Referee: [Analysis Section] Analysis of memorization regime: the explanation reduces the memorization phenomenon to the interaction of the two-time-scale dynamics with the collapse model already derived in Borkar 2025a. Without a new verification step, simulation, or explicit mapping showing how the memorization regime emerges distinctly from quantities defined in the prior collapse paper, the account risks being a direct reapplication rather than an independent derivation.

Authors: The manuscript's contribution lies in combining the two-time-scale SGD dynamics with the existing collapse model to furnish a dynamical-systems account of the specific memorization stretches observed in generative training. While the collapse analysis is taken from Borkar 2025a and the two-scale results from Azizian et al. 2024, the explicit linkage to prolonged similar outputs during constant-step training of generative models is the novel interpretive step. In the revised version we will insert an explicit mapping subsection that derives, step by step, how the fast and slow variables produce the memorization regime from the quantities already defined in the collapse paper, thereby clarifying the distinct role of the two-scale interaction. revision: yes

Circularity Check

1 steps flagged

Memorization explanation reduces to authors' prior collapse and two-time-scale models via self-citation

specific steps

self citation load bearing [Abstract]
"Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. ... This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization"

The paper presents its explanation of memorization as a novel dynamical-systems perspective, yet the load-bearing steps are the direct invocation of the collapse model from Borkar [2025a] and the two-time-scale dynamics from Borkar [2026] (same author group). The memorization regime is therefore obtained by applying quantities and models already defined in those earlier self-cited works rather than deriving them anew from the stylized loss or external data.

full rationale

The paper's central system-theoretic account of memorization is framed as relying purely on training dynamics, but the derivation explicitly combines a stylized loss (motivated externally by Austin 2016) with the authors' own prior collapse model (Borkar 2025a) and two-time-scale SGD analysis (Borkar 2026). The abstract states that the two-time-scale fact 'has been used to explain the double descent phenomenon in SGD in Borkar [2026]' and that the memorization analysis proceeds 'in conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a]'. This makes the claimed explanation load-bearing on self-citations whose content is not re-derived or independently validated here, reducing the novel contribution to an application of previously defined quantities and results by the same author group.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The account rests on a single stylized loss-function assumption drawn from Austin 2016 and on the correctness of two prior mathematical models published by one co-author; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The loss function for SGD has a strong dependence on some variables and weak dependence on the rest in a precise sense.
Invoked to produce two distinct time scales in constant-step-size SGD.

pith-pipeline@v0.9.0 · 5761 in / 1316 out tokens · 44917 ms · 2026-05-20T06:56:05.591854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

A , title =

Abascal, J. A , title =

work page
[2]

D , title =

Anderson, B. D , title =. Stochastic Processes and their Applications , volume =

work page
[3]

and Casco-Rodriguez, J

Alemohammad, S. and Casco-Rodriguez, J. and Luzi, L. and Humayun, A. I. and Babaei, H. and LeJeune, D. and Siahkoohi, A. and Baraniuk, R. , title =. The Twelfth International Conference on Learning Representations, May 7-11, 2024, Vienna , pages =

work page 2024
[4]

Israel Journal of Mathematics , volume =

Austin, T , title =. Israel Journal of Mathematics , volume =

work page
[5]

and Iutzeler, F

Azizian, W. and Iutzeler, F. and Malick, J. and Mertikopoulos, P. , title =. 2024 , eprint =

work page 2024
[6]

and Dasgupta, A

Baptista, R. and Dasgupta, A. and Kovachki, N. B. and Oberai, A. and Stuart, A. M. , title =. 2025 , eprint =

work page 2025
[7]

and Hsu, D

Belkin, M. and Hsu, D. and Ma, S. and Mandal, S. , title =. Proceedings of the National Academy of Sciences , volume =

work page
[8]

and Hsu, D

Belkin, M. and Hsu, D. and Xu, J. , title =. SIAM Journal on Mathematics of Data Science , volume =

work page
[9]

and Borkar, V

Biswas, A. and Borkar, V. S. , title=. Journal of Mathematical Analysis and Applications , volume=. 2009 , pages=

work page 2009
[10]

Benveniste, A. and M\'. 1990 , title =

work page 1990
[11]

and Gentz, B

Berglund, N. and Gentz, B. , title =. Springer: Berlin Heidelberg , year =

work page
[12]

Billingsley, P , title =

work page
[13]

and Urfin, R

Bonnaire, T. and Urfin, R. and Biroli, G. and M. Why diffusion models don't memorize: the role of implicit dynamical regularization in training , journal =

work page
[14]

S , title =

Borkar, V. S , title =

work page
[15]

S , title =

Borkar, V. S , title =. Proccedns of the 61st Allerton Conference on Communication, Control and Computing, Uni. of Illinois at Urbana-Champaign, Sept. 17-19, 2025, arXiv preprint arXiv:2506.09401 , year =

work page arXiv 2025
[16]

Borkar, V. S. , title =. Systems and Control Letters , volume=. 1997 , pages =

work page 1997
[17]

S , title =

Borkar, V. S , title =. Stochastic Processes and their Applications , pages =

work page
[18]

S , title =

Borkar, V. S , title =. Systems & Control Letters , volume =

work page
[19]

Breiman, L , title =

work page
[20]

On the edge of memorization in diffusion models.arXiv preprint arXiv:2508.17689, 2025

Buchanan,. On the edge of memorization in diffusion models , year =. 2508.17689 , archivePrefix =

work page arXiv
[21]

and Min, Y

Chen, L. and Min, Y. and Belkin, M. and Karbasi, A. , title =. Advances in Neural Information Processing Systems , volume =

work page
[22]

and Liu, D

Chen, C. and Liu, D. and Xu, C. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8425-8434 , pages =

work page
[23]

and Ma, X

Chen, Y. and Ma, X. and Zou, D. and Jiang, Y.G. , title =. Thirteenth International Conference on Learning Representations, Singapore , year =

work page
[24]

and Lee, E

Cherkassky, V. and Lee, E. H. , title =. IEEE Transactions on Neural Networks and Learning Systems 169 , pages =

work page
[25]

Danskin, J. M. , title=

work page
[26]

and Sagun, L

d'Ascoli, S. and Sagun, L. and Biroli, G. , title =. Advances in neural information processing systems , volume =

work page
[27]

and Langosco, L

Davies, X. and Langosco, L. and Krueger, D. , title =. 2023 , eprint =

work page 2023
[28]

and Feng, Y

Dohmatob, E. and Feng, Y. and Kempe, J. , title =. 2024 , note =. 2402.07712 , archivePrefix =

work page arXiv 2024
[29]

and Feng, Y

Dohmatob, E. and Feng, Y. and Yang, P. and Kempe, J. , title =. Forty-first International Conference on Machine Learning, 2024b , year =

work page
[30]

Flaxman, A. D. and Kalai, A. T. and McMahan, H. B. , title =. Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC , year =

work page
[31]

Freidlin, M. I. and Wentzell, A. D. , title =. 2012 , publisher =

work page 2012
[32]

Stochastic Differential Systems Filtering and Control: Proceedings of the IFIP-WG 7/1 Working Conference Marseille-Luminy, France, March 12--17, 1984 (pp

Föllmer , title =. Stochastic Differential Systems Filtering and Control: Proceedings of the IFIP-WG 7/1 Working Conference Marseille-Luminy, France, March 12--17, 1984 (pp. 156-163). Springer: Berlin Heidelberg , pages =

work page 1984
[33]

and Schaeffer, R

Gerstgrasser, M. and Schaeffer, R. and Dey, A. and Rafailov, R. and Sleight, H. and Hughes, J. and Korbak, T. and Agrawal, R. and Pai, D. and Gromov, A. et al. , title =. 2024 , eprint =

work page 2024
[34]

and Du, C

Gu, X. and Du, C. and Pang, T. and Li, C. and Lin, M. and Wang, Y. , title =. 2023 , eprint =

work page 2023
[35]

Haussmann, U. G. and Pardoux, E. , title =. The Annals of Probability , pages =

work page
[36]

and Yilmaz, F

Heckel, R. and Yilmaz, F. F. , title =. 2020 , eprint =

work page 2020
[37]

and Struppek, L

Hintersdorf, D. and Struppek, L. and Kersting, K. and Dziedzic, A. and Boenisch, F. , title =. Advances in Neural Information Processing Systems , volume =

work page
[38]

Proceedings of the American Mathematical Society , volume =

Hwang, C.-R , title =. Proceedings of the American Mathematical Society , volume =

work page
[39]

and Wolfowitz, J

Kiefer, J. and Wolfowitz, J. , title =. Annals of Mathematical Statistics , volume =

work page
[40]

and Kim, S

Kim, J. and Kim, S. and Lee, J.S. , title =. 2025 , eprint =

work page 2025
[41]

and Szepesv

Kuzborskij, I. and Szepesv. On the role of optimization in double descent: A least squares study , journal =

work page
[42]

and Shen, Z

Li, X. and Shen, Z. and Hsieh, Y. P. and He, N. , title=. Preprint , year=

work page
[43]

and Viering, T

Loog, M. and Viering, T. and Mey, A. and Krijthe, J. H. and Tax, D. M. , title =. Proceedings of the National Academy of Sciences , volume =

work page
[44]

and Hoffman, M

Mandt, S. and Hoffman, M. D. and Blei, D. M. , title=. Journal of Machine Learning Research , volume=. 2017 , pages=

work page 2017
[45]

and Soatto, S

Marchi, M. and Soatto, S. and Chaudhari, P. and Tabuada, P. , title =. 2024 , eprint =

work page 2024
[46]

and Montanari, A

Mei, S. and Montanari, A. , title =. Communications on Pure and Applied Mathematics 75(4) , pages =

work page
[47]

and Kavukcuoglu, K

Mnih, V. and Kavukcuoglu, K. and Silver, D. and Graves, A. and Antonoglou, I. and Wierstra, D, Riedmiller, M. A. , title =

work page
[48]

and Wu, Q

Mukherjee, S. and Wu, Q. and Zhou, D.-X. , title =. Bernoulli 16(1) , pages =

work page
[49]

and Kaplun, G

Nakkiran, P. and Kaplun, G. and Bansal, Y. and Yang, T. and Barak, B, Sutskever, I , title =. Journal of Statistical Mechanics: Theory and Experiment , volume =

work page
[50]

and Lindsten, F

Olmin, A. and Lindsten, F. , title =. 2024 , eprint =

work page 2024
[51]

and Mitra, A

Pezeshki, M. and Mitra, A. and Bengio, Y. and Lajoie, G. , title =. Fortieth International Conference on Machine Learning, 17669-17690. PMLR , pages =

work page
[52]

and Raya, G

Pham, B. and Raya, G. and Negri, M. and Zaki, M.J. and Ambrogioni, L. and Krotov, D. , title =. 2025 , eprint =

work page 2025
[53]

and Burda, Y

Power, A. and Burda, Y. and Edwards, H. and Babuschkin, I. and Misra, V. , title=. 2022 , eprint =

work page 2022
[54]

and Robertson, Z

Schaeffer, R., and Khona, M. and Robertson, Z. and Boopathy, A. and Pistunova, K. and Rocks, J. W. and Fiete, I. R. and Koyejo, O. , title=. 2023 , eprint =

work page 2023
[55]

, title=

Sheu, S.-J. , title=. SIAM Journal on Mathematical Analysis , volume=. 1986 , pages=

work page 1986
[56]

and Shumaylov, Z

Shumailov,I. and Shumaylov, Z. and Zhao, Y. and Papernot, N. and Anderson, R. and Gal, Y. , title=. Nature , volume =

work page
[57]

and Shumaylov, Z

Shumailov, I. and Shumaylov, Z. and Zhao, Y. and Gal, Y. and Papernot, N. and Anderson, R. , title=. 2023 , eprint =

work page 2023
[58]

and Sohl-Dickstein, J

Song, Y. and Sohl-Dickstein, J. and Kingma, D.P. and Kumar, A. and Ermon, S. and Poole, B. , title =. 2020 , eprint =

work page 2020
[59]

C , title =

Spall, J. C , title =

work page
[60]

and Lee, T

Stephenson, C. and Lee, T. , title=. 2021 , eprint =

work page 2021
[61]

Suresh, A. T. and Thangaraj, A. and Khandavally, A. N. K. , title =. Proceedings of the 28th International Conference on Artificial. Intelligence and Statistics (Y. Li, S. Mandt, S. Agrawal and E. Khan, eds.), PMLR vol. 258 , volume =

work page
[62]

arXiv preprint arXiv:2309.02390 , year=

Varma, V. and Shah, R. and Kenton, Z. and Kram\'. Explaining grokking through circuit efficiency , year =. 2309.02390 , archivePrefix =

work page arXiv
[63]

Learning in-context n -grams with transformers: sub- n -grams are near-stationary points , journal =

Varre, A., Y\". Learning in-context n -grams with transformers: sub- n -grams are near-stationary points , journal =

work page
[64]

and Han, Y

Wang,H. and Han, Y. and Zou, D. , title=. ICML 2024 Workshop on Foundation Models in the Wild , year =

work page 2024
[65]

and Liu, Y

Wen, Y. and Liu, Y. and Chen, C. and Lyu, L. , title=. The Twelfth International Conference on Learning Representations, Vienna , year =

work page
[66]

and Marion, P

Wu,Y.H. and Marion, P. and Biau, G. and Boyer, C. , title=. Proceedings of the 38th Annual Conference on Learning Theory , year=

work page
[67]

and Zhang, Z

Yang, L. and Zhang, Z. and Song, Y. and Hong, S. and Xu, R. and Zhao, Y. and Zhang, W. and Cui, B. and Yang, M. H. , title =. ACM computing surveys , volume =

work page
[68]

and Zhu, Q

Ye, Z. and Zhu, Q. and Tao, M. and Chen, M. , title=. 2025 , eprint =

work page 2025
[69]

and Liu, C

Zhu, L. and Liu, C. and Radhakrishnan, A. and Belkin, M. , title=. 2022 , eprint =

work page 2022
[70]

and Liu, C

Zhu, L. and Liu, C. and Radhakrishnan, A. and Belkin, M. , title=. 2023 , eprint =

work page 2023

[1] [1]

A , title =

Abascal, J. A , title =

work page

[2] [2]

D , title =

Anderson, B. D , title =. Stochastic Processes and their Applications , volume =

work page

[3] [3]

and Casco-Rodriguez, J

Alemohammad, S. and Casco-Rodriguez, J. and Luzi, L. and Humayun, A. I. and Babaei, H. and LeJeune, D. and Siahkoohi, A. and Baraniuk, R. , title =. The Twelfth International Conference on Learning Representations, May 7-11, 2024, Vienna , pages =

work page 2024

[4] [4]

Israel Journal of Mathematics , volume =

Austin, T , title =. Israel Journal of Mathematics , volume =

work page

[5] [5]

and Iutzeler, F

Azizian, W. and Iutzeler, F. and Malick, J. and Mertikopoulos, P. , title =. 2024 , eprint =

work page 2024

[6] [6]

and Dasgupta, A

Baptista, R. and Dasgupta, A. and Kovachki, N. B. and Oberai, A. and Stuart, A. M. , title =. 2025 , eprint =

work page 2025

[7] [7]

and Hsu, D

Belkin, M. and Hsu, D. and Ma, S. and Mandal, S. , title =. Proceedings of the National Academy of Sciences , volume =

work page

[8] [8]

and Hsu, D

Belkin, M. and Hsu, D. and Xu, J. , title =. SIAM Journal on Mathematics of Data Science , volume =

work page

[9] [9]

and Borkar, V

Biswas, A. and Borkar, V. S. , title=. Journal of Mathematical Analysis and Applications , volume=. 2009 , pages=

work page 2009

[10] [10]

Benveniste, A. and M\'. 1990 , title =

work page 1990

[11] [11]

and Gentz, B

Berglund, N. and Gentz, B. , title =. Springer: Berlin Heidelberg , year =

work page

[12] [12]

Billingsley, P , title =

work page

[13] [13]

and Urfin, R

Bonnaire, T. and Urfin, R. and Biroli, G. and M. Why diffusion models don't memorize: the role of implicit dynamical regularization in training , journal =

work page

[14] [14]

S , title =

Borkar, V. S , title =

work page

[15] [15]

S , title =

Borkar, V. S , title =. Proccedns of the 61st Allerton Conference on Communication, Control and Computing, Uni. of Illinois at Urbana-Champaign, Sept. 17-19, 2025, arXiv preprint arXiv:2506.09401 , year =

work page arXiv 2025

[16] [16]

Borkar, V. S. , title =. Systems and Control Letters , volume=. 1997 , pages =

work page 1997

[17] [17]

S , title =

Borkar, V. S , title =. Stochastic Processes and their Applications , pages =

work page

[18] [18]

S , title =

Borkar, V. S , title =. Systems & Control Letters , volume =

work page

[19] [19]

Breiman, L , title =

work page

[20] [20]

On the edge of memorization in diffusion models.arXiv preprint arXiv:2508.17689, 2025

Buchanan,. On the edge of memorization in diffusion models , year =. 2508.17689 , archivePrefix =

work page arXiv

[21] [21]

and Min, Y

Chen, L. and Min, Y. and Belkin, M. and Karbasi, A. , title =. Advances in Neural Information Processing Systems , volume =

work page

[22] [22]

and Liu, D

Chen, C. and Liu, D. and Xu, C. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8425-8434 , pages =

work page

[23] [23]

and Ma, X

Chen, Y. and Ma, X. and Zou, D. and Jiang, Y.G. , title =. Thirteenth International Conference on Learning Representations, Singapore , year =

work page

[24] [24]

and Lee, E

Cherkassky, V. and Lee, E. H. , title =. IEEE Transactions on Neural Networks and Learning Systems 169 , pages =

work page

[25] [25]

Danskin, J. M. , title=

work page

[26] [26]

and Sagun, L

d'Ascoli, S. and Sagun, L. and Biroli, G. , title =. Advances in neural information processing systems , volume =

work page

[27] [27]

and Langosco, L

Davies, X. and Langosco, L. and Krueger, D. , title =. 2023 , eprint =

work page 2023

[28] [28]

and Feng, Y

Dohmatob, E. and Feng, Y. and Kempe, J. , title =. 2024 , note =. 2402.07712 , archivePrefix =

work page arXiv 2024

[29] [29]

and Feng, Y

Dohmatob, E. and Feng, Y. and Yang, P. and Kempe, J. , title =. Forty-first International Conference on Machine Learning, 2024b , year =

work page

[30] [30]

Flaxman, A. D. and Kalai, A. T. and McMahan, H. B. , title =. Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC , year =

work page

[31] [31]

Freidlin, M. I. and Wentzell, A. D. , title =. 2012 , publisher =

work page 2012

[32] [32]

Stochastic Differential Systems Filtering and Control: Proceedings of the IFIP-WG 7/1 Working Conference Marseille-Luminy, France, March 12--17, 1984 (pp

Föllmer , title =. Stochastic Differential Systems Filtering and Control: Proceedings of the IFIP-WG 7/1 Working Conference Marseille-Luminy, France, March 12--17, 1984 (pp. 156-163). Springer: Berlin Heidelberg , pages =

work page 1984

[33] [33]

and Schaeffer, R

Gerstgrasser, M. and Schaeffer, R. and Dey, A. and Rafailov, R. and Sleight, H. and Hughes, J. and Korbak, T. and Agrawal, R. and Pai, D. and Gromov, A. et al. , title =. 2024 , eprint =

work page 2024

[34] [34]

and Du, C

Gu, X. and Du, C. and Pang, T. and Li, C. and Lin, M. and Wang, Y. , title =. 2023 , eprint =

work page 2023

[35] [35]

Haussmann, U. G. and Pardoux, E. , title =. The Annals of Probability , pages =

work page

[36] [36]

and Yilmaz, F

Heckel, R. and Yilmaz, F. F. , title =. 2020 , eprint =

work page 2020

[37] [37]

and Struppek, L

Hintersdorf, D. and Struppek, L. and Kersting, K. and Dziedzic, A. and Boenisch, F. , title =. Advances in Neural Information Processing Systems , volume =

work page

[38] [38]

Proceedings of the American Mathematical Society , volume =

Hwang, C.-R , title =. Proceedings of the American Mathematical Society , volume =

work page

[39] [39]

and Wolfowitz, J

Kiefer, J. and Wolfowitz, J. , title =. Annals of Mathematical Statistics , volume =

work page

[40] [40]

and Kim, S

Kim, J. and Kim, S. and Lee, J.S. , title =. 2025 , eprint =

work page 2025

[41] [41]

and Szepesv

Kuzborskij, I. and Szepesv. On the role of optimization in double descent: A least squares study , journal =

work page

[42] [42]

and Shen, Z

Li, X. and Shen, Z. and Hsieh, Y. P. and He, N. , title=. Preprint , year=

work page

[43] [43]

and Viering, T

Loog, M. and Viering, T. and Mey, A. and Krijthe, J. H. and Tax, D. M. , title =. Proceedings of the National Academy of Sciences , volume =

work page

[44] [44]

and Hoffman, M

Mandt, S. and Hoffman, M. D. and Blei, D. M. , title=. Journal of Machine Learning Research , volume=. 2017 , pages=

work page 2017

[45] [45]

and Soatto, S

Marchi, M. and Soatto, S. and Chaudhari, P. and Tabuada, P. , title =. 2024 , eprint =

work page 2024

[46] [46]

and Montanari, A

Mei, S. and Montanari, A. , title =. Communications on Pure and Applied Mathematics 75(4) , pages =

work page

[47] [47]

and Kavukcuoglu, K

Mnih, V. and Kavukcuoglu, K. and Silver, D. and Graves, A. and Antonoglou, I. and Wierstra, D, Riedmiller, M. A. , title =

work page

[48] [48]

and Wu, Q

Mukherjee, S. and Wu, Q. and Zhou, D.-X. , title =. Bernoulli 16(1) , pages =

work page

[49] [49]

and Kaplun, G

Nakkiran, P. and Kaplun, G. and Bansal, Y. and Yang, T. and Barak, B, Sutskever, I , title =. Journal of Statistical Mechanics: Theory and Experiment , volume =

work page

[50] [50]

and Lindsten, F

Olmin, A. and Lindsten, F. , title =. 2024 , eprint =

work page 2024

[51] [51]

and Mitra, A

Pezeshki, M. and Mitra, A. and Bengio, Y. and Lajoie, G. , title =. Fortieth International Conference on Machine Learning, 17669-17690. PMLR , pages =

work page

[52] [52]

and Raya, G

Pham, B. and Raya, G. and Negri, M. and Zaki, M.J. and Ambrogioni, L. and Krotov, D. , title =. 2025 , eprint =

work page 2025

[53] [53]

and Burda, Y

Power, A. and Burda, Y. and Edwards, H. and Babuschkin, I. and Misra, V. , title=. 2022 , eprint =

work page 2022

[54] [54]

and Robertson, Z

Schaeffer, R., and Khona, M. and Robertson, Z. and Boopathy, A. and Pistunova, K. and Rocks, J. W. and Fiete, I. R. and Koyejo, O. , title=. 2023 , eprint =

work page 2023

[55] [55]

, title=

Sheu, S.-J. , title=. SIAM Journal on Mathematical Analysis , volume=. 1986 , pages=

work page 1986

[56] [56]

and Shumaylov, Z

Shumailov,I. and Shumaylov, Z. and Zhao, Y. and Papernot, N. and Anderson, R. and Gal, Y. , title=. Nature , volume =

work page

[57] [57]

and Shumaylov, Z

Shumailov, I. and Shumaylov, Z. and Zhao, Y. and Gal, Y. and Papernot, N. and Anderson, R. , title=. 2023 , eprint =

work page 2023

[58] [58]

and Sohl-Dickstein, J

Song, Y. and Sohl-Dickstein, J. and Kingma, D.P. and Kumar, A. and Ermon, S. and Poole, B. , title =. 2020 , eprint =

work page 2020

[59] [59]

C , title =

Spall, J. C , title =

work page

[60] [60]

and Lee, T

Stephenson, C. and Lee, T. , title=. 2021 , eprint =

work page 2021

[61] [61]

Suresh, A. T. and Thangaraj, A. and Khandavally, A. N. K. , title =. Proceedings of the 28th International Conference on Artificial. Intelligence and Statistics (Y. Li, S. Mandt, S. Agrawal and E. Khan, eds.), PMLR vol. 258 , volume =

work page

[62] [62]

arXiv preprint arXiv:2309.02390 , year=

Varma, V. and Shah, R. and Kenton, Z. and Kram\'. Explaining grokking through circuit efficiency , year =. 2309.02390 , archivePrefix =

work page arXiv

[63] [63]

Learning in-context n -grams with transformers: sub- n -grams are near-stationary points , journal =

Varre, A., Y\". Learning in-context n -grams with transformers: sub- n -grams are near-stationary points , journal =

work page

[64] [64]

and Han, Y

Wang,H. and Han, Y. and Zou, D. , title=. ICML 2024 Workshop on Foundation Models in the Wild , year =

work page 2024

[65] [65]

and Liu, Y

Wen, Y. and Liu, Y. and Chen, C. and Lyu, L. , title=. The Twelfth International Conference on Learning Representations, Vienna , year =

work page

[66] [66]

and Marion, P

Wu,Y.H. and Marion, P. and Biau, G. and Boyer, C. , title=. Proceedings of the 38th Annual Conference on Learning Theory , year=

work page

[67] [67]

and Zhang, Z

Yang, L. and Zhang, Z. and Song, Y. and Hong, S. and Xu, R. and Zhao, Y. and Zhang, W. and Cui, B. and Yang, M. H. , title =. ACM computing surveys , volume =

work page

[68] [68]

and Zhu, Q

Ye, Z. and Zhu, Q. and Tao, M. and Chen, M. , title=. 2025 , eprint =

work page 2025

[69] [69]

and Liu, C

Zhu, L. and Liu, C. and Radhakrishnan, A. and Belkin, M. , title=. 2022 , eprint =

work page 2022

[70] [70]

and Liu, C

Zhu, L. and Liu, C. and Radhakrishnan, A. and Belkin, M. , title=. 2023 , eprint =

work page 2023