Adynamical systems view of training generativemodels and the memorization phenomenon
Pith reviewed 2026-05-20 06:56 UTC · model grok-4.3
The pith
Memorization in generative models arises purely from two distinct time scales in constant-step SGD.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A stylized loss function with strong dependence on certain variables and weak dependence on the rest induces two distinct time scales in constant step size SGD. When this dynamics is combined with a mathematical model of the collapse phenomenon, the generative model yields the same or similar outputs for significant stretches of time.
What carries the argument
Stylized loss function with precise strong-weak variable dependencies that creates two time scales in SGD, analyzed together with collapse dynamics.
If this is right
- Memorization is explained solely through the training dynamics of constant-step SGD.
- The same two-time-scale mechanism accounts for the double descent phenomenon in the same setting.
- Collapse dynamics interact with the time-scale separation to sustain stretches of similar outputs.
- The explanation applies without needing details of the data distribution or network architecture.
Where Pith is reading between the lines
- The same separation of time scales may appear in other high-dimensional optimization tasks that use constant-step SGD.
- Monitoring the diversity of generated samples over successive training intervals could provide an early diagnostic for emerging memorization.
- Loss functions engineered to reduce strong-weak dependency gaps might shorten or eliminate the repetition periods.
Load-bearing premise
The loss function in SGD has a strong dependence on some variables and a weak dependence on the rest in a precise sense.
What would settle it
Training runs on a loss with the described strong-weak split that show continuous output variation with no prolonged repetition periods would falsify the proposed link to memorization.
read the original abstract
Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. This relies purely on the dynamic aspects of the training phase. Specifically, we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD that is commonly used in machine learning. This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization wherein a generative model that is concurrently being tuned yields the same or similar outputs for significant stretches of time. This gives a novel perspective on the aforementioned phenomena reported in machine learning literature and their interrelationships, using a dynamical systems viewpoint.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to provide a dynamical-systems explanation of the memorization phenomenon in generative models. It posits that a stylized loss function with strong dependence on a subset of variables and weak dependence on the remainder (motivated by Austin 2016) produces two distinct time scales under constant-step-size SGD; this two-scale behavior is then combined with the collapse model from Borkar 2025a (analyzed via Azizian et al. 2024) to account for stretches of similar model outputs during training.
Significance. If the stylized loss model is shown to be a faithful abstraction of standard generative objectives, the work could offer a unified dynamical account linking memorization, collapse, and double descent. The approach correctly invokes existing results on two-time-scale SGD and stochastic approximation, which is a methodological strength, but the incremental contribution is primarily interpretive rather than deriving new theorems or providing fresh verification.
major comments (2)
- [Abstract / Model Description] Abstract and model description: the manuscript invokes Austin 2016 to motivate a stylized loss with 'strong dependence on some variables and weak dependence on the rest in a precise sense,' yet supplies no Hessian-block analysis, eigenvalue separation argument, or reference establishing that this separation holds for generative-model objectives such as the ELBO, GAN minimax, or diffusion score-matching losses. Because the two-time-scale claim and the subsequent link to collapse/memorization rest directly on this separation, the absence of justification for the stylized model in the generative setting is load-bearing.
- [Analysis Section] Analysis of memorization regime: the explanation reduces the memorization phenomenon to the interaction of the two-time-scale dynamics with the collapse model already derived in Borkar 2025a. Without a new verification step, simulation, or explicit mapping showing how the memorization regime emerges distinctly from quantities defined in the prior collapse paper, the account risks being a direct reapplication rather than an independent derivation.
minor comments (2)
- [Title] Title contains typographical errors: 'Adynamical' should be 'A dynamical' and 'generativemodels' should be 'generative models'.
- [References] Citations to Borkar 2025a and Borkar 2026 appear as in-preparation or forthcoming works; the manuscript should clarify their status and ensure they are publicly available or properly referenced for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below with clarifications on the manuscript's approach and note revisions to strengthen the justification and novelty of the analysis.
read point-by-point responses
-
Referee: [Abstract / Model Description] Abstract and model description: the manuscript invokes Austin 2016 to motivate a stylized loss with 'strong dependence on some variables and weak dependence on the rest in a precise sense,' yet supplies no Hessian-block analysis, eigenvalue separation argument, or reference establishing that this separation holds for generative-model objectives such as the ELBO, GAN minimax, or diffusion score-matching losses. Because the two-time-scale claim and the subsequent link to collapse/memorization rest directly on this separation, the absence of justification for the stylized model in the generative setting is load-bearing.
Authors: We acknowledge that the manuscript does not include a dedicated Hessian-block analysis or eigenvalue separation proof tailored to specific generative objectives such as the ELBO or diffusion score-matching losses. The stylized loss is introduced as a modeling assumption motivated by Austin 2016 to capture a common high-dimensional structure in machine learning losses, consistent with its prior use in explaining double descent. This separation is treated as a plausible abstraction rather than a rigorously derived property for every generative loss. In revision we will expand the model description section with a short discussion of why such separation is expected in overparameterized settings, citing relevant empirical and theoretical work on loss landscapes in deep generative models. A complete derivation for all listed objectives lies outside the interpretive scope of the paper. revision: partial
-
Referee: [Analysis Section] Analysis of memorization regime: the explanation reduces the memorization phenomenon to the interaction of the two-time-scale dynamics with the collapse model already derived in Borkar 2025a. Without a new verification step, simulation, or explicit mapping showing how the memorization regime emerges distinctly from quantities defined in the prior collapse paper, the account risks being a direct reapplication rather than an independent derivation.
Authors: The manuscript's contribution lies in combining the two-time-scale SGD dynamics with the existing collapse model to furnish a dynamical-systems account of the specific memorization stretches observed in generative training. While the collapse analysis is taken from Borkar 2025a and the two-scale results from Azizian et al. 2024, the explicit linkage to prolonged similar outputs during constant-step training of generative models is the novel interpretive step. In the revised version we will insert an explicit mapping subsection that derives, step by step, how the fast and slow variables produce the memorization regime from the quantities already defined in the collapse paper, thereby clarifying the distinct role of the two-scale interaction. revision: yes
Circularity Check
Memorization explanation reduces to authors' prior collapse and two-time-scale models via self-citation
specific steps
-
self citation load bearing
[Abstract]
"Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. ... This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization"
The paper presents its explanation of memorization as a novel dynamical-systems perspective, yet the load-bearing steps are the direct invocation of the collapse model from Borkar [2025a] and the two-time-scale dynamics from Borkar [2026] (same author group). The memorization regime is therefore obtained by applying quantities and models already defined in those earlier self-cited works rather than deriving them anew from the stylized loss or external data.
full rationale
The paper's central system-theoretic account of memorization is framed as relying purely on training dynamics, but the derivation explicitly combines a stylized loss (motivated externally by Austin 2016) with the authors' own prior collapse model (Borkar 2025a) and two-time-scale SGD analysis (Borkar 2026). The abstract states that the two-time-scale fact 'has been used to explain the double descent phenomenon in SGD in Borkar [2026]' and that the memorization analysis proceeds 'in conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a]'. This makes the claimed explanation load-bearing on self-citations whose content is not re-derived or independently validated here, reducing the novel contribution to an application of previously defined quantities and results by the same author group.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The loss function for SGD has a strong dependence on some variables and weak dependence on the rest in a precise sense.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Anderson, B. D , title =. Stochastic Processes and their Applications , volume =
-
[3]
Alemohammad, S. and Casco-Rodriguez, J. and Luzi, L. and Humayun, A. I. and Babaei, H. and LeJeune, D. and Siahkoohi, A. and Baraniuk, R. , title =. The Twelfth International Conference on Learning Representations, May 7-11, 2024, Vienna , pages =
work page 2024
-
[4]
Israel Journal of Mathematics , volume =
Austin, T , title =. Israel Journal of Mathematics , volume =
-
[5]
Azizian, W. and Iutzeler, F. and Malick, J. and Mertikopoulos, P. , title =. 2024 , eprint =
work page 2024
-
[6]
Baptista, R. and Dasgupta, A. and Kovachki, N. B. and Oberai, A. and Stuart, A. M. , title =. 2025 , eprint =
work page 2025
-
[7]
Belkin, M. and Hsu, D. and Ma, S. and Mandal, S. , title =. Proceedings of the National Academy of Sciences , volume =
-
[8]
Belkin, M. and Hsu, D. and Xu, J. , title =. SIAM Journal on Mathematics of Data Science , volume =
-
[9]
Biswas, A. and Borkar, V. S. , title=. Journal of Mathematical Analysis and Applications , volume=. 2009 , pages=
work page 2009
-
[10]
Benveniste, A. and M\'. 1990 , title =
work page 1990
- [11]
-
[12]
Billingsley, P , title =
-
[13]
Bonnaire, T. and Urfin, R. and Biroli, G. and M. Why diffusion models don't memorize: the role of implicit dynamical regularization in training , journal =
- [14]
-
[15]
Borkar, V. S , title =. Proccedns of the 61st Allerton Conference on Communication, Control and Computing, Uni. of Illinois at Urbana-Champaign, Sept. 17-19, 2025, arXiv preprint arXiv:2506.09401 , year =
-
[16]
Borkar, V. S. , title =. Systems and Control Letters , volume=. 1997 , pages =
work page 1997
- [17]
- [18]
-
[19]
Breiman, L , title =
-
[20]
On the edge of memorization in diffusion models.arXiv preprint arXiv:2508.17689, 2025
Buchanan,. On the edge of memorization in diffusion models , year =. 2508.17689 , archivePrefix =
-
[21]
Chen, L. and Min, Y. and Belkin, M. and Karbasi, A. , title =. Advances in Neural Information Processing Systems , volume =
-
[22]
Chen, C. and Liu, D. and Xu, C. , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8425-8434 , pages =
- [23]
-
[24]
Cherkassky, V. and Lee, E. H. , title =. IEEE Transactions on Neural Networks and Learning Systems 169 , pages =
-
[25]
Danskin, J. M. , title=
-
[26]
d'Ascoli, S. and Sagun, L. and Biroli, G. , title =. Advances in neural information processing systems , volume =
-
[27]
Davies, X. and Langosco, L. and Krueger, D. , title =. 2023 , eprint =
work page 2023
-
[28]
Dohmatob, E. and Feng, Y. and Kempe, J. , title =. 2024 , note =. 2402.07712 , archivePrefix =
-
[29]
Dohmatob, E. and Feng, Y. and Yang, P. and Kempe, J. , title =. Forty-first International Conference on Machine Learning, 2024b , year =
-
[30]
Flaxman, A. D. and Kalai, A. T. and McMahan, H. B. , title =. Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, Vancouver, BC , year =
-
[31]
Freidlin, M. I. and Wentzell, A. D. , title =. 2012 , publisher =
work page 2012
-
[32]
Föllmer , title =. Stochastic Differential Systems Filtering and Control: Proceedings of the IFIP-WG 7/1 Working Conference Marseille-Luminy, France, March 12--17, 1984 (pp. 156-163). Springer: Berlin Heidelberg , pages =
work page 1984
-
[33]
Gerstgrasser, M. and Schaeffer, R. and Dey, A. and Rafailov, R. and Sleight, H. and Hughes, J. and Korbak, T. and Agrawal, R. and Pai, D. and Gromov, A. et al. , title =. 2024 , eprint =
work page 2024
- [34]
-
[35]
Haussmann, U. G. and Pardoux, E. , title =. The Annals of Probability , pages =
- [36]
-
[37]
Hintersdorf, D. and Struppek, L. and Kersting, K. and Dziedzic, A. and Boenisch, F. , title =. Advances in Neural Information Processing Systems , volume =
-
[38]
Proceedings of the American Mathematical Society , volume =
Hwang, C.-R , title =. Proceedings of the American Mathematical Society , volume =
-
[39]
Kiefer, J. and Wolfowitz, J. , title =. Annals of Mathematical Statistics , volume =
- [40]
-
[41]
Kuzborskij, I. and Szepesv. On the role of optimization in double descent: A least squares study , journal =
- [42]
-
[43]
Loog, M. and Viering, T. and Mey, A. and Krijthe, J. H. and Tax, D. M. , title =. Proceedings of the National Academy of Sciences , volume =
-
[44]
Mandt, S. and Hoffman, M. D. and Blei, D. M. , title=. Journal of Machine Learning Research , volume=. 2017 , pages=
work page 2017
-
[45]
Marchi, M. and Soatto, S. and Chaudhari, P. and Tabuada, P. , title =. 2024 , eprint =
work page 2024
-
[46]
Mei, S. and Montanari, A. , title =. Communications on Pure and Applied Mathematics 75(4) , pages =
-
[47]
Mnih, V. and Kavukcuoglu, K. and Silver, D. and Graves, A. and Antonoglou, I. and Wierstra, D, Riedmiller, M. A. , title =
- [48]
-
[49]
Nakkiran, P. and Kaplun, G. and Bansal, Y. and Yang, T. and Barak, B, Sutskever, I , title =. Journal of Statistical Mechanics: Theory and Experiment , volume =
- [50]
-
[51]
Pezeshki, M. and Mitra, A. and Bengio, Y. and Lajoie, G. , title =. Fortieth International Conference on Machine Learning, 17669-17690. PMLR , pages =
-
[52]
Pham, B. and Raya, G. and Negri, M. and Zaki, M.J. and Ambrogioni, L. and Krotov, D. , title =. 2025 , eprint =
work page 2025
-
[53]
Power, A. and Burda, Y. and Edwards, H. and Babuschkin, I. and Misra, V. , title=. 2022 , eprint =
work page 2022
-
[54]
Schaeffer, R., and Khona, M. and Robertson, Z. and Boopathy, A. and Pistunova, K. and Rocks, J. W. and Fiete, I. R. and Koyejo, O. , title=. 2023 , eprint =
work page 2023
- [55]
-
[56]
Shumailov,I. and Shumaylov, Z. and Zhao, Y. and Papernot, N. and Anderson, R. and Gal, Y. , title=. Nature , volume =
-
[57]
Shumailov, I. and Shumaylov, Z. and Zhao, Y. and Gal, Y. and Papernot, N. and Anderson, R. , title=. 2023 , eprint =
work page 2023
-
[58]
Song, Y. and Sohl-Dickstein, J. and Kingma, D.P. and Kumar, A. and Ermon, S. and Poole, B. , title =. 2020 , eprint =
work page 2020
- [59]
- [60]
-
[61]
Suresh, A. T. and Thangaraj, A. and Khandavally, A. N. K. , title =. Proceedings of the 28th International Conference on Artificial. Intelligence and Statistics (Y. Li, S. Mandt, S. Agrawal and E. Khan, eds.), PMLR vol. 258 , volume =
-
[62]
arXiv preprint arXiv:2309.02390 , year=
Varma, V. and Shah, R. and Kenton, Z. and Kram\'. Explaining grokking through circuit efficiency , year =. 2309.02390 , archivePrefix =
-
[63]
Learning in-context n -grams with transformers: sub- n -grams are near-stationary points , journal =
Varre, A., Y\". Learning in-context n -grams with transformers: sub- n -grams are near-stationary points , journal =
-
[64]
Wang,H. and Han, Y. and Zou, D. , title=. ICML 2024 Workshop on Foundation Models in the Wild , year =
work page 2024
-
[65]
Wen, Y. and Liu, Y. and Chen, C. and Lyu, L. , title=. The Twelfth International Conference on Learning Representations, Vienna , year =
-
[66]
Wu,Y.H. and Marion, P. and Biau, G. and Boyer, C. , title=. Proceedings of the 38th Annual Conference on Learning Theory , year=
-
[67]
Yang, L. and Zhang, Z. and Song, Y. and Hong, S. and Xu, R. and Zhao, Y. and Zhang, W. and Cui, B. and Yang, M. H. , title =. ACM computing surveys , volume =
- [68]
-
[69]
Zhu, L. and Liu, C. and Radhakrishnan, A. and Belkin, M. , title=. 2022 , eprint =
work page 2022
-
[70]
Zhu, L. and Liu, C. and Radhakrishnan, A. and Belkin, M. , title=. 2023 , eprint =
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.