Recognition: unknown
When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold
Pith reviewed 2026-05-08 05:17 UTC · model grok-4.3
The pith
Dynamic preconditioners preserve the Polyak-Ruppert CLT for averaged SGD when their stabilization rate exceeds (α+1)/2.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The dynamic remainder √n R_n vanishes in L² whenever β > (α+1)/2, and sequences exist satisfying the polynomial rate hypotheses for which it does not vanish when β ≤ (α+1)/2. Under standard regularity conditions this yields the CLT √n (x̄_n - x*) → N(0, H^{-1} S H^{-1}) for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.
What carries the argument
The preconditioner-isolating decomposition of the averaged error that confines P_t to a dynamic remainder R_n, together with the effective inverse drift matrix M_t = (P_t H)^{-1} whose operator-norm differences satisfy ||M_t - M_{t-1}||_op ≲ t^{-β}.
Load-bearing premise
The effective inverse drift matrix changes at a polynomial rate faster than t to the minus (α+1)/2.
What would settle it
Construct sequences of preconditioners satisfying the polynomial hypotheses with β exactly equal to (α+1)/2 and check whether √n R_n fails to converge to zero in L².
read the original abstract
Polyak-Ruppert averaging yields an asymptotically normal estimator with sandwich covariance $H^{-1}SH^{-1}$, the foundation of online inference. When the gradient step is preconditioned by a data-driven matrix $P_t$, we ask how fast $P_t$ must stabilize for the central limit theorem (CLT) to remain valid. We resolve this via an exact preconditioner-isolating decomposition of the averaged error that confines $P_t$ to a dynamic remainder $R_n$, leaving the martingale and Taylor terms preconditioner-free. Let $M_t = (P_t H)^{-1}$ denote the effective inverse drift matrix, with $\|M_t - M_{t-1}\|_{\mathrm{op}} \lesssim t^{-\beta}$ and step-size exponent $\alpha \in (1/2, 1)$. We identify a stabilization-rate threshold $\beta > (\alpha+1)/2$ and prove that, within the class of polynomial rate hypotheses used in our upper bound, it cannot be weakened: the dynamic remainder $\sqrt{n}\,R_n$ vanishes in $L^2$ whenever $\beta > (\alpha+1)/2$, and we exhibit sequences satisfying those hypotheses for which it does not vanish when $\beta \le (\alpha+1)/2$. A single stabilization argument certifies three SA variants - SA-AdaGrad, SA-RMSProp, and SA-ONS - with gain $\rho_t = c/t$, each delivering one-step $L^2(\mathrm{op})$ stabilization of order $t^{-1}$, yielding the CLT $\sqrt{n}(\bar{x}_n - x^*) \to N(0, H^{-1}SH^{-1})$; under bounded inputs the pathwise rate $\beta = 1$ further preserves the $n^{-1/6}$ Wasserstein rate at $\alpha^* = 2/3$. Under standard regularity conditions, Wald-type online inference remains valid for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript establishes conditions under which dynamic preconditioning in stochastic approximation algorithms preserves the Polyak-Ruppert central limit theorem. Through an exact decomposition isolating the dynamic remainder R_n driven by changes in the effective inverse drift matrix M_t = (P_t H)^{-1}, the authors show that under polynomial stabilization ||M_t - M_{t-1}||_op ≲ t^{-β} and step sizes with exponent α ∈ (1/2, 1), the term √n R_n vanishes in L² precisely when β > (α + 1)/2. This threshold is sharp within the class of polynomial hypotheses, as demonstrated by matching lower bounds via explicit sequence constructions. The result is applied to certify the CLT for SA-AdaGrad, SA-RMSProp, and SA-ONS, each achieving β = 1, and extends to Wasserstein rates under additional boundedness assumptions.
Significance. If the results hold, this work provides a sharp, actionable criterion for the stabilization rate of data-driven preconditioners to ensure valid asymptotic normality and online inference in adaptive stochastic gradient methods. By confirming that common adaptive schemes like AdaGrad and RMSProp satisfy the threshold, it supports the use of Wald-type confidence intervals for dynamically preconditioned averaged SGD without additional modifications. The isolation of the preconditioner effect from standard martingale terms offers a clean framework for analyzing other adaptive variants. The exact decomposition and matching upper/lower bounds within the polynomial class are notable strengths.
minor comments (2)
- [Notation and §2] The notation for the operator norm and the precise statement of the polynomial rate hypothesis on M_t could be introduced earlier and used consistently to aid readability.
- [Applications] In the applications section, a short explicit verification that SA-ONS achieves the one-step L²(op) stabilization of order t^{-1} under the stated gain ρ_t = c/t would be helpful.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee summary accurately reflects the main results on the stabilization threshold for dynamic preconditioning to preserve the Polyak-Ruppert CLT, the exact decomposition, and the applications to SA-AdaGrad, SA-RMSProp, and SA-ONS.
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper isolates the dynamic remainder R_n via an exact decomposition that separates it from the preconditioner-free martingale and Taylor terms, which obey the standard Polyak-Ruppert CLT under regularity conditions. The threshold β > (α+1)/2 is obtained by direct analysis of the remainder under the stated polynomial hypothesis ||M_t - M_{t-1}||_op ≲ t^{-β}, with sharpness demonstrated by explicit counterexample sequences that satisfy the same hypotheses. No load-bearing step reduces to a self-definition, a fitted input renamed as prediction, or a self-citation chain; the limiting covariance H^{-1}SH^{-1} follows from the fixed-P sandwich after the remainder vanishes. The result is internally consistent and relies on external stochastic-approximation theory rather than circular inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard regularity conditions for stochastic approximation (strong convexity of the objective, bounded variance of the noise, Lipschitz gradients)
- ad hoc to paper Polynomial rate hypotheses on the stabilization of M_t = (P_t H)^{-1}
Reference graph
Works this paper leans on
-
[1]
and MOULINES, E
BACH, F. and MOULINES, E. (2013). Non-strongly-convex smooth stochastic approximation with conver- gence rate O(1/n). InAdvances in Neural Information Processing Systems26773–781
2013
-
[2]
and PORTIER, B
BERCU, B., GODICHON-BAGGIONI, A. and PORTIER, B. (2020). An efficient stochastic Newton algorithm for parameter estimation in logistic regressions.SIAM Journal on Control and Optimization58348– 367
2020
-
[3]
(1997).Matrix Analysis.Graduate Texts in Mathematics169
BHATIA, R. (1997).Matrix Analysis.Graduate Texts in Mathematics169. Springer
1997
-
[4]
BONIS, T. (2020). Stein’s method for normal approximation in Wasserstein distances with application to the multivariate central limit theorem.Probability Theory and Related Fields178827–860
2020
-
[5]
BOTTOU, L., CURTIS, F. E. and NOCEDAL, J. (2018). Optimization methods for large-scale machine learn- ing.SIAM Review60223–311
2018
-
[6]
and GODICHON-BAGGIONI, A
BOYER, C. and GODICHON-BAGGIONI, A. (2023). On the asymptotic rate of convergence of stochastic Newton algorithms and their weighted averaged versions.Computational Optimization and Applica- tions84921–972
2023
-
[7]
BUTYRIN, B., MOULINES, E., NAUMOV, A., SAMSONOV, S., SHAO, Q.-M. and ZHANG, Z.-S. (2025). Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approximation. arXiv preprint arXiv:2510.12375
-
[8]
H., HANSEN, S
BYRD, R. H., HANSEN, S. L., NOCEDAL, J. and SINGER, Y. (2016). A stochastic quasi-Newton method for large-scale optimization.SIAM Journal on Optimization261008–1031
2016
-
[9]
and PORTIER, B
CÉNAC, P., GODICHON-BAGGIONI, A. and PORTIER, B. (2025). An efficient averaged stochastic Gauss– Newton algorithm for estimating parameters of nonlinear regressions models.Bernoulli31
2025
-
[10]
D., TONG, X
CHEN, X., LEE, J. D., TONG, X. T. and ZHANG, Y. (2020). Statistical inference for model parameters in stochastic gradient descent.The Annals of Statistics48251–273
2020
-
[11]
and SINGER, Y
DUCHI, J., HAZAN, E. and SINGER, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research122121–2159
2011
-
[12]
and YANG, L
FANG, Y., XU, J. and YANG, L. (2018). Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research191–21
2018
-
[13]
and PANLOUP, F
GADAT, S. and PANLOUP, F. (2023). Optimal non-asymptotic analysis of the Ruppert–Polyak averaging stochastic algorithm.Stochastic Processes and their Applications156312–348
2023
-
[14]
GODICHON-BAGGIONI, A., LU, W. and PORTIER, B. (2024). A full AdaGrad algorithm withO(N d) operations.arXiv preprint arXiv:2405.01908
-
[15]
GODICHON-BAGGIONI, A., LU, W. and PORTIER, B. (2024). Online estimation of the inverse of the Hessian for stochastic optimization with application to universal stochastic Newton algorithms.arXiv preprint arXiv:2401.10923
-
[16]
and TARRAGO, P
GODICHON-BAGGIONI, A. and TARRAGO, P. (2025). Non-asymptotic analysis of adaptive stochastic gra- dient algorithms and applications.Transactions on Machine Learning Research
2025
-
[17]
and WERGE, N
GODICHON-BAGGIONI, A. and WERGE, N. (2025). On adaptive stochastic optimization for streaming data: A Newton’s method withO(dN)operations.Journal of Machine Learning Research261–49
2025
-
[18]
and HEYDE, C
HALL, P. and HEYDE, C. C. (1980).Martingale Limit Theory and Its Application. Academic Press, New York
1980
-
[19]
and KALE, S
HAZAN, E., AGARWAL, A. and KALE, S. (2007). Logarithmic regret algorithms for online convex opti- mization.Machine Learning69169–192
2007
-
[20]
JAGANNATH, A., JONES-MCCORMICK, T. and SARANGIAN, V. (2025). High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952
-
[21]
KINGMA, D. P. and BA, J. (2015). Adam: A Method for Stochastic Optimization. InInternational Confer- ence on Learning Representations
2015
- [22]
-
[23]
LEE, S., LIAO, Y., SEO, M. H. and SHIN, Y. (2022). Fast and robust online inference with stochastic gradient descent via random scaling. InProceedings of the AAAI Conference on Artificial Intelligence 7381–7389
2022
-
[24]
LEHMANN, E. L. and CASELLA, G. (1998).Theory of Point Estimation, 2nd ed.Springer Texts in Statistics. Springer
1998
-
[25]
and PORTIER, F
LELUC, R. and PORTIER, F. (2023). Asymptotic analysis of conditioned stochastic gradient descent.Trans- actions on Machine Learning Research. STABILIZATION THRESHOLD FOR DYNAMICALLY PRECONDITIONED SGD25
2023
-
[26]
LI, X.-L. (2018). Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems291454–1466
2018
-
[27]
J., WAINWRIGHT, M
MOU, W., LI, C. J., WAINWRIGHT, M. J., BARTLETT, P. L. and JORDAN, M. I. (2020). On linear stochas- tic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration. InConference on Learning Theory2947–2997. PMLR
2020
-
[28]
POLYAK, B. T. and JUDITSKY, A. B. (1992). Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization30838–855
1992
-
[29]
RIO, E. (2009). Upper bounds for minimal distances in the central limit theorem.Annales de l’Institut Henri Poincaré, Probabilités et Statistiques45802–817
2009
-
[30]
and MONRO, S
ROBBINS, H. and MONRO, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics22400–407
1951
-
[31]
and SIEGMUND, D
ROBBINS, H. and SIEGMUND, D. (1971). A convergence theorem for non negative almost supermartingales and some applications.Optimizing Methods in Statistics233–257
1971
-
[32]
RUPPERT, D. (1988). Efficient estimations from a slowly convergent Robbins–Monro process Technical Report No. 781, Cornell University, School of Operations Research and Industrial Engineering
1988
-
[33]
and NAUMOV, A
SAMSONOV, S., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2024). Gaussian Ap- proximation and Multiplier Bootstrap for Polyak–Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning. InAdvances in Neural Information Processing Systems37
2024
-
[34]
N., YU, J
SCHRAUDOLPH, N. N., YU, J. and GÜNTER, S. (2007). A stochastic quasi-Newton method for online convex optimization. InProceedings of the Eleventh International Conference on Artificial Intelligence and Statistics436–443. PMLR
2007
- [35]
-
[36]
and ZHANG, Z.-S
SHAO, Q.-M. and ZHANG, Z.-S. (2022). Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli281548–1576
2022
-
[37]
Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent
SHESHUKOVA, M., SAMSONOV, S., BELOMESTNY, D., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2025). Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent.arXiv preprint arXiv:2502.06719
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
and SUN, R
SHI, N., LI, D., HONG, M. and SUN, R. (2021). RMSProp converges with proper hyper-parameter. In International Conference on Learning Representations
2021
-
[39]
and LECORFF, S
SURENDRAN, S., FERMANIAN, A., GODICHON-BAGGIONI, A. and LECORFF, S. (2024). Non- asymptotic analysis of biased adaptive stochastic approximation. InAdvances in Neural Information Processing Systems3712897–12943
2024
-
[40]
TANG, K., LIU, W., ZHANG, Y. and CHEN, X. (2023). Acceleration of stochastic gradient descent with mo- mentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665
-
[41]
and HINTON, G
TIELEMAN, T. and HINTON, G. (2012). Lecture 6.5—RMSProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning. [42]VAN DERVAART, A. W. (1998).Asymptotic Statistics.Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press
2012
- [42]
-
[43]
H., STREET, W
WOLBERG, W. H., STREET, W. N. and MANGASARIAN, O. L. (1995). Breast Cancer Wisconsin (Di- agnostic) Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Breast+ Cancer+Wisconsin+(Diagnostic)
1995
-
[44]
and LI, Z
XIE, S., WANG, T., REDDI, S., KUMAR, S. and LI, Z. (2025). Structured Preconditioners in Adaptive Optimization: A Unified Analysis. InInternational Conference on Machine Learning
2025
-
[45]
When Does Dynamic Preconditioning Preserve the Polyak–Ruppert CLT? A Stabilization Threshold
ZHU, W., CHEN, X. and WU, W. B. (2023). Online covariance matrix estimation in stochastic gradient descent.Journal of the American Statistical Association118393–404. 26S. AN AND X. HUO Supplement to “When Does Dynamic Preconditioning Preserve the Polyak–Ruppert CLT? A Stabilization Threshold” This supplement contains deferred proofs of all results stated ...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.