Recognition: unknown
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces
Pith reviewed 2026-05-10 14:58 UTC · model grok-4.3
The pith
Stochastic gradient descent in infinite-dimensional Hilbert spaces approximates a cylindrical Brownian motion SDE with second-order weak error in the step size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The discrete dynamics of SGD in infinite-dimensional Hilbert spaces can be approximated by an SDE driven by cylindrical Brownian motion. The analysis extends diffusion-approximation results from Euclidean spaces by addressing two difficulties: establishing well-posedness of the stochastic evolution equation through structural conditions on the covariance operator, and performing the comparison in a weak sense via a suitable class of smooth functionals on the Hilbert space. The discrepancy between SGD and the limiting SDE, when evaluated through these functionals, is of second order in the step size.
What carries the argument
The limiting stochastic evolution equation driven by cylindrical Brownian motion, together with the class of smooth functionals used to measure weak discrepancy of order two in the step size.
If this is right
- The continuous-time limit permits the use of stochastic analysis tools to study SGD behavior in infinite-dimensional optimization problems.
- Numerical experiments confirm the predicted second-order weak convergence behavior between the discrete and continuous dynamics.
- The framework directly extends previous diffusion approximations that were restricted to Euclidean spaces.
- The result applies to inverse problems whose unknowns lie in Hilbert spaces, such as those arising in scientific computing.
Where Pith is reading between the lines
- The second-order weak approximation could be used to derive explicit convergence rates or stability estimates for SGD by analyzing the associated SDE instead of the discrete recursion.
- Similar diffusion limits may exist for other stochastic optimization schemes when the parameter space is infinite-dimensional.
- The technique of testing against smooth functionals could be adapted to obtain quantitative error bounds in related settings such as stochastic approximation on manifolds or Banach spaces.
Load-bearing premise
The covariance operator must satisfy appropriate structural conditions so that the stochastic evolution equation driven by cylindrical Brownian motion is well-posed.
What would settle it
A numerical experiment in which the weak error between SGD trajectories and the SDE solution fails to decrease quadratically with the step size, or a counterexample in which the SDE ceases to be well-posed once the covariance conditions are removed, would falsify the central approximation result.
Figures
read the original abstract
Inverse problems in scientific computing often require optimization over infinite-dimensional Hilbert spaces. A commonly used solver in such settings is stochastic gradient descent (SGD), where gradients are approximated using randomly sampled sub-objective functions. In this work we study the continuous-time limit of SGD in the small step-size regime. We show that the discrete dynamics can be approximated by a stochastic differential equation (SDE) driven by cylindrical Brownian motion. The analysis extends diffusion-approximation results previously established in Euclidean spaces to the infinite-dimensional setting. Two analytical difficulties arise in this extension. First, the cylindrical nature of the noise requires establishing well-posedness of the resulting stochastic evolution equation through appropriate structural conditions on the covariance operator. Second, since the randomness in SGD originates from discrete sampling while the limiting equation is driven by Gaussian noise, the comparison between the two dynamics must be carried out in a weak sense. We therefore introduce a suitable class of smooth functionals on the Hilbert space and prove that the discrepancy between SGD and the limiting SDE, when evaluated through these functionals, is of second order in the step size. Numerical experiments confirm the predicted convergence behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends diffusion-approximation results for stochastic gradient descent (SGD) from Euclidean spaces to infinite-dimensional Hilbert spaces. It claims that the discrete SGD iterates can be approximated by a stochastic evolution equation driven by cylindrical Brownian motion, with the discrepancy between the discrete dynamics and the limiting SDE being of second order in the step size when evaluated on expectations of a suitable class of smooth test functionals. The analysis requires establishing well-posedness of the limiting equation under structural conditions on the covariance operator and proceeds via weak-convergence arguments to handle the mismatch between discrete sampling noise and Gaussian driving noise.
Significance. If the central derivations hold, the result supplies a rigorous continuous-time limit for SGD in function spaces, which is directly relevant to optimization arising in inverse problems and PDE-constrained settings. The work correctly identifies the two technical obstacles (cylindrical noise and weak-sense comparison) and supplies the necessary technical machinery; the second-order weak approximation on smooth functionals is a concrete strengthening of first-order limits that appear in the Euclidean literature.
major comments (2)
- [Abstract and well-posedness section] Abstract and the well-posedness section: the structural conditions imposed on the covariance operator to guarantee existence of the stochastic convolution with cylindrical Brownian motion are stated but never verified for the noise that actually arises from random sampling of sub-objectives. In typical inverse-problem settings the covariance is only bounded or compact; without an explicit check that it maps into the required Hilbert-Schmidt or trace-class space after regularization, the limiting SDE is not known to be well-posed and the subsequent weak-convergence argument cannot be invoked.
- [Main approximation theorem] The second-order claim for the weak error: the manuscript asserts that the discrepancy, when tested against smooth functionals, is O(h^2) where h is the step size. Because the full proof is not reproduced in the supplied abstract, it is impossible to confirm that the Itô-Taylor expansion or generator comparison used to obtain the second-order term remains valid once the cylindrical noise and the infinite-dimensional geometry are taken into account; an explicit statement of the precise regularity assumed on the test functionals and on the drift/diffusion coefficients is needed.
minor comments (2)
- [Notation and assumptions] The precise definition of the class of admissible test functionals (e.g., the required Fréchet differentiability order and growth conditions) should be stated once in a dedicated notation subsection rather than scattered across lemmas.
- [Numerical section] Numerical experiments are mentioned but the discretization of the Hilbert space and the approximation of the cylindrical noise are not described; adding a short paragraph on the finite-dimensional truncation used would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive major comments. These have highlighted areas where additional clarification will improve the manuscript. We respond to each point below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Abstract and well-posedness section] Abstract and the well-posedness section: the structural conditions imposed on the covariance operator to guarantee existence of the stochastic convolution with cylindrical Brownian motion are stated but never verified for the noise that actually arises from random sampling of sub-objectives. In typical inverse-problem settings the covariance is only bounded or compact; without an explicit check that it maps into the required Hilbert-Schmidt or trace-class space after regularization, the limiting SDE is not known to be well-posed and the subsequent weak-convergence argument cannot be invoked.
Authors: We agree that the structural conditions on the covariance operator (ensuring the stochastic convolution exists as a mild solution) are stated as assumptions without an explicit verification tied to the random-sampling noise. In the manuscript these conditions are formulated in a general form that applies once the second-moment operator of the gradient noise satisfies the required Hilbert-Schmidt or trace-class property. For typical inverse-problem settings the covariance is indeed compact, but the sampling of sub-objectives often incorporates smoothing from the forward operator or regularization, which upgrades the covariance to the necessary class. To make this transparent we will add a dedicated remark in the well-posedness section that verifies the conditions under standard assumptions on the loss (Lipschitz gradients) and uniform random sampling; an illustrative example with a compact forward operator will be included. This revision directly addresses the applicability concern while preserving the generality of the framework. revision: yes
-
Referee: [Main approximation theorem] The second-order claim for the weak error: the manuscript asserts that the discrepancy, when tested against smooth functionals, is O(h^2) where h is the step size. Because the full proof is not reproduced in the supplied abstract, it is impossible to confirm that the Itô-Taylor expansion or generator comparison used to obtain the second-order term remains valid once the cylindrical noise and the infinite-dimensional geometry are taken into account; an explicit statement of the precise regularity assumed on the test functionals and on the drift/diffusion coefficients is needed.
Authors: The complete proof of the second-order weak error appears in Sections 3–4 of the full manuscript and proceeds via an Itô-Taylor expansion of the continuous process combined with a generator comparison between the discrete SGD increments and the limiting evolution. The cylindrical noise is handled by working in the reproducing-kernel Hilbert space induced by the covariance operator, which is assumed Hilbert-Schmidt; this replaces the finite-dimensional Itô calculus with the corresponding infinite-dimensional version while preserving the cancellation of first-order terms due to zero-mean noise. To make the argument immediately verifiable we will insert, immediately before the statement of the main theorem, an explicit list of the regularity hypotheses: the test functional is twice Fréchet differentiable with bounded and continuous first and second derivatives, the drift satisfies a global Lipschitz condition, and the diffusion coefficient (square root of the covariance) is Lipschitz with linear growth. These assumptions are already used throughout the proofs but were not collected in one place; adding the list will resolve the concern without altering the result. revision: partial
Circularity Check
No circularity: derivation applies standard techniques to new setting
full rationale
The paper extends existing diffusion-approximation results for SGD from Euclidean spaces to infinite-dimensional Hilbert spaces by establishing well-posedness of an SDE driven by cylindrical Brownian motion under structural conditions on the covariance operator, then proving second-order weak convergence on smooth functionals. No equations, definitions, or claims in the provided text reduce the target approximation result to a fitted parameter, a self-referential definition, or a load-bearing self-citation chain; the argument rests on classical stochastic-analysis tools applied to the infinite-dimensional case without tautological reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structural conditions on the covariance operator are imposed to guarantee well-posedness of the limiting SDE.
Reference graph
Works this paper leans on
-
[1]
Arridge , Optical tomography in medical imaging , Inverse problems, 15 (1999), pp
S. Arridge , Optical tomography in medical imaging , Inverse problems, 15 (1999), pp. R41--R93
1999
-
[2]
O ktem, and C.-B. Sch \
S. Arridge, P. Maass, O. \"O ktem, and C.-B. Sch \"o nlieb , Solving inverse problems using data-driven models , Acta numerica, 28 (2019), pp. 1--174
2019
-
[3]
Bottou, F
L. Bottou, F. E. Curtis, and J. Nocedal , Optimization methods for large-scale machine learning , SIAM Review, 60 (2018), pp. 223--311
2018
-
[4]
Cartan , Differential forms , Courier Corporation, 2012
H. Cartan , Differential forms , Courier Corporation, 2012
2012
-
[5]
D. L. Colton and R. Kress , Inverse acoustic and electromagnetic scattering theory , vol. 93, Springer, 1998
1998
-
[6]
Da Prato and J
G. Da Prato and J. Zabczyk , Stochastic Equations in Infinite Dimensions , Encyclopedia of Mathematics and its Applications, Cambridge University Press, 2nd ed., 2014
2014
-
[7]
H. W. Engl, M. Hanke, and A. Neubauer , Regularization of inverse problems , vol. 375, Springer, 1996
1996
-
[8]
Gawarecki and V
L. Gawarecki and V. Mandrekar , Stochastic differential equations in infinite dimensions: with applications to stochastic partial differential equations , Springer, 2010
2010
-
[9]
Hardt, B
M. Hardt, B. Recht, and Y. Singer , Train faster, generalize better: Stability of stochastic gradient descent , in Proceedings of the 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, PMLR, 2016, pp. 1225--1234
2016
-
[10]
J. P. Kaipio and E. Somersalo , Statistical and computational inverse problems , Springer, 2005
2005
-
[11]
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang , On large-batch training for deep learning: Generalization gap and sharp minima , arXiv:1609.04836, (2016)
work page internal anchor Pith review arXiv 2016
-
[12]
LeCun, L
Y. LeCun, L. Bottou, G. B. Orr, and K.-R. M \"u ller , Efficient backprop , in Neural networks: Tricks of the trade, Springer, 2002, pp. 9--50
2002
-
[13]
Q. Li, C. Tai, and W. E , Stochastic modified equations and adaptive stochastic gradient algorithms , in Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research, PMLR, 06--11 Aug 2017, pp. 2101--2110
2017
-
[14]
height 2pt depth -1.6pt width 23pt, Stochastic modified equations and dynamics of stochastic gradient algorithms I : Mathematical foundations , Journal of Machine Learning Research, 20 (2019), pp. 1--47
2019
-
[15]
Z. Li, S. Malladi, and S. Arora , On the validity of modeling SGD with stochastic differential equations ( SDEs ) , Advances in Neural Information Processing Systems, 34 (2021), pp. 12712--12725
2021
-
[16]
G. J. Lord, C. E. Powell, and T. Shardlow , An Introduction to Computational Stochastic PDEs , Cambridge Texts in Applied Mathematics, Cambridge University Press, 2014
2014
-
[17]
Mandt, M
S. Mandt, M. D. Hoffman, and D. M. Blei , Stochastic gradient descent as approximate bayesian inference , Journal of Machine Learning Research, 18 (2017), pp. 1--35
2017
-
[18]
Mao , Stochastic differential equations and applications , Horwood Publishing Limited, Chichester, 2nd ed., 2008
X. Mao , Stochastic differential equations and applications , Horwood Publishing Limited, Chichester, 2nd ed., 2008
2008
-
[19]
S. Mei, A. Montanari, and P.-M. Nguyen , A mean field view of the landscape of two-layer neural networks , Proceedings of the National Academy of Sciences, 115 (2018), pp. E7665--E7671
2018
-
[20]
Natterer , The mathematics of computerized tomography , SIAM, 2001
F. Natterer , The mathematics of computerized tomography , SIAM, 2001
2001
-
[21]
Pfeiffer , X -ray ptychography , Nature Photonics, 12 (2018), pp
F. Pfeiffer , X -ray ptychography , Nature Photonics, 12 (2018), pp. 9--17
2018
-
[22]
Prévôt and M
C. Prévôt and M. Röckner , A Concise Course on Stochastic Partial Differential Equations , Springer, 2007
2007
-
[23]
Robbins and S
H. Robbins and S. Monro , A stochastic approximation method , The annals of mathematical statistics, 22 (1951), pp. 400--407
1951
-
[24]
Sirignano and K
J. Sirignano and K. Spiliopoulos , Mean field analysis of deep neural networks , Mathematics of Operations Research, 47 (2022), pp. 120--152
2022
-
[25]
A. M. Stuart , Inverse problems: a bayesian perspective , A cta N umerica, 19 (2010), pp. 451--559
2010
-
[26]
Sutskever, J
I. Sutskever, J. Martens, G. Dahl, and G. Hinton , On the importance of initialization and momentum in deep learning , in International conference on machine learning, pmlr, 2013, pp. 1139--1147
2013
-
[27]
Tarantola , Inverse problem theory and methods for model parameter estimation , SIAM, 2005
A. Tarantola , Inverse problem theory and methods for model parameter estimation , SIAM, 2005
2005
-
[28]
C. R. Vogel , Computational methods for inverse problems , SIAM, 2002
2002
-
[29]
Understanding deep learning requires rethinking generalization
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals , Understanding deep learning requires rethinking generalization , arXiv:1611.03530, (2016)
work page internal anchor Pith review arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.