pith. machine review for the scientific record. sign in

arxiv: 2604.23498 · v1 · submitted 2026-04-26 · 🧮 math.ST · math.OC· stat.ML· stat.TH

Recognition: unknown

When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold

Sunyoung An, Xiaoming Huo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:17 UTC · model grok-4.3

classification 🧮 math.ST math.OCstat.MLstat.TH
keywords dynamic preconditioningPolyak-Ruppert averagingcentral limit theoremstochastic approximationstabilization rateadaptive SGDonline inference
0
0 comments X

The pith

Dynamic preconditioners preserve the Polyak-Ruppert CLT for averaged SGD when their stabilization rate exceeds (α+1)/2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper determines the minimal speed at which a time-varying preconditioner must settle down for the averaged stochastic gradient estimator to retain its usual asymptotic normality with sandwich covariance. It isolates the effect of the changing preconditioner into a single dynamic remainder term via an exact decomposition of the averaged error, leaving the martingale and Taylor terms preconditioner-free. This remainder vanishes in mean square precisely when the operator-norm change of the effective inverse drift matrix decays faster than t to the power minus (α+1)/2, where α is the step-size exponent between 1/2 and 1. The threshold is shown to be sharp inside the class of polynomial-rate assumptions, and a single stabilization argument then certifies the central limit theorem for three common adaptive methods.

Core claim

The dynamic remainder √n R_n vanishes in L² whenever β > (α+1)/2, and sequences exist satisfying the polynomial rate hypotheses for which it does not vanish when β ≤ (α+1)/2. Under standard regularity conditions this yields the CLT √n (x̄_n - x*) → N(0, H^{-1} S H^{-1}) for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.

What carries the argument

The preconditioner-isolating decomposition of the averaged error that confines P_t to a dynamic remainder R_n, together with the effective inverse drift matrix M_t = (P_t H)^{-1} whose operator-norm differences satisfy ||M_t - M_{t-1}||_op ≲ t^{-β}.

Load-bearing premise

The effective inverse drift matrix changes at a polynomial rate faster than t to the minus (α+1)/2.

What would settle it

Construct sequences of preconditioners satisfying the polynomial hypotheses with β exactly equal to (α+1)/2 and check whether √n R_n fails to converge to zero in L².

read the original abstract

Polyak-Ruppert averaging yields an asymptotically normal estimator with sandwich covariance $H^{-1}SH^{-1}$, the foundation of online inference. When the gradient step is preconditioned by a data-driven matrix $P_t$, we ask how fast $P_t$ must stabilize for the central limit theorem (CLT) to remain valid. We resolve this via an exact preconditioner-isolating decomposition of the averaged error that confines $P_t$ to a dynamic remainder $R_n$, leaving the martingale and Taylor terms preconditioner-free. Let $M_t = (P_t H)^{-1}$ denote the effective inverse drift matrix, with $\|M_t - M_{t-1}\|_{\mathrm{op}} \lesssim t^{-\beta}$ and step-size exponent $\alpha \in (1/2, 1)$. We identify a stabilization-rate threshold $\beta > (\alpha+1)/2$ and prove that, within the class of polynomial rate hypotheses used in our upper bound, it cannot be weakened: the dynamic remainder $\sqrt{n}\,R_n$ vanishes in $L^2$ whenever $\beta > (\alpha+1)/2$, and we exhibit sequences satisfying those hypotheses for which it does not vanish when $\beta \le (\alpha+1)/2$. A single stabilization argument certifies three SA variants - SA-AdaGrad, SA-RMSProp, and SA-ONS - with gain $\rho_t = c/t$, each delivering one-step $L^2(\mathrm{op})$ stabilization of order $t^{-1}$, yielding the CLT $\sqrt{n}(\bar{x}_n - x^*) \to N(0, H^{-1}SH^{-1})$; under bounded inputs the pathwise rate $\beta = 1$ further preserves the $n^{-1/6}$ Wasserstein rate at $\alpha^* = 2/3$. Under standard regularity conditions, Wald-type online inference remains valid for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript establishes conditions under which dynamic preconditioning in stochastic approximation algorithms preserves the Polyak-Ruppert central limit theorem. Through an exact decomposition isolating the dynamic remainder R_n driven by changes in the effective inverse drift matrix M_t = (P_t H)^{-1}, the authors show that under polynomial stabilization ||M_t - M_{t-1}||_op ≲ t^{-β} and step sizes with exponent α ∈ (1/2, 1), the term √n R_n vanishes in L² precisely when β > (α + 1)/2. This threshold is sharp within the class of polynomial hypotheses, as demonstrated by matching lower bounds via explicit sequence constructions. The result is applied to certify the CLT for SA-AdaGrad, SA-RMSProp, and SA-ONS, each achieving β = 1, and extends to Wasserstein rates under additional boundedness assumptions.

Significance. If the results hold, this work provides a sharp, actionable criterion for the stabilization rate of data-driven preconditioners to ensure valid asymptotic normality and online inference in adaptive stochastic gradient methods. By confirming that common adaptive schemes like AdaGrad and RMSProp satisfy the threshold, it supports the use of Wald-type confidence intervals for dynamically preconditioned averaged SGD without additional modifications. The isolation of the preconditioner effect from standard martingale terms offers a clean framework for analyzing other adaptive variants. The exact decomposition and matching upper/lower bounds within the polynomial class are notable strengths.

minor comments (2)
  1. [Notation and §2] The notation for the operator norm and the precise statement of the polynomial rate hypothesis on M_t could be introduced earlier and used consistently to aid readability.
  2. [Applications] In the applications section, a short explicit verification that SA-ONS achieves the one-step L²(op) stabilization of order t^{-1} under the stated gain ρ_t = c/t would be helpful.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee summary accurately reflects the main results on the stabilization threshold for dynamic preconditioning to preserve the Polyak-Ruppert CLT, the exact decomposition, and the applications to SA-AdaGrad, SA-RMSProp, and SA-ONS.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper isolates the dynamic remainder R_n via an exact decomposition that separates it from the preconditioner-free martingale and Taylor terms, which obey the standard Polyak-Ruppert CLT under regularity conditions. The threshold β > (α+1)/2 is obtained by direct analysis of the remainder under the stated polynomial hypothesis ||M_t - M_{t-1}||_op ≲ t^{-β}, with sharpness demonstrated by explicit counterexample sequences that satisfy the same hypotheses. No load-bearing step reduces to a self-definition, a fitted input renamed as prediction, or a self-citation chain; the limiting covariance H^{-1}SH^{-1} follows from the fixed-P sandwich after the remainder vanishes. The result is internally consistent and relies on external stochastic-approximation theory rather than circular inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard stochastic-approximation regularity conditions (strong convexity, moment bounds) and on the polynomial stabilization-rate hypotheses introduced for the upper and lower bounds.

axioms (2)
  • domain assumption Standard regularity conditions for stochastic approximation (strong convexity of the objective, bounded variance of the noise, Lipschitz gradients)
    Invoked to ensure the preconditioner-free martingale and Taylor terms obey the classical Polyak-Ruppert CLT.
  • ad hoc to paper Polynomial rate hypotheses on the stabilization of M_t = (P_t H)^{-1}
    Used both for the upper bound on the dynamic remainder and for constructing the matching lower-bound sequences.

pith-pipeline@v0.9.0 · 5686 in / 1636 out tokens · 52860 ms · 2026-05-08T05:17:20.219538+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    and MOULINES, E

    BACH, F. and MOULINES, E. (2013). Non-strongly-convex smooth stochastic approximation with conver- gence rate O(1/n). InAdvances in Neural Information Processing Systems26773–781

  2. [2]

    and PORTIER, B

    BERCU, B., GODICHON-BAGGIONI, A. and PORTIER, B. (2020). An efficient stochastic Newton algorithm for parameter estimation in logistic regressions.SIAM Journal on Control and Optimization58348– 367

  3. [3]

    (1997).Matrix Analysis.Graduate Texts in Mathematics169

    BHATIA, R. (1997).Matrix Analysis.Graduate Texts in Mathematics169. Springer

  4. [4]

    BONIS, T. (2020). Stein’s method for normal approximation in Wasserstein distances with application to the multivariate central limit theorem.Probability Theory and Related Fields178827–860

  5. [5]

    BOTTOU, L., CURTIS, F. E. and NOCEDAL, J. (2018). Optimization methods for large-scale machine learn- ing.SIAM Review60223–311

  6. [6]

    and GODICHON-BAGGIONI, A

    BOYER, C. and GODICHON-BAGGIONI, A. (2023). On the asymptotic rate of convergence of stochastic Newton algorithms and their weighted averaged versions.Computational Optimization and Applica- tions84921–972

  7. [7]

    and ZHANG, Z.-S

    BUTYRIN, B., MOULINES, E., NAUMOV, A., SAMSONOV, S., SHAO, Q.-M. and ZHANG, Z.-S. (2025). Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approximation. arXiv preprint arXiv:2510.12375

  8. [8]

    H., HANSEN, S

    BYRD, R. H., HANSEN, S. L., NOCEDAL, J. and SINGER, Y. (2016). A stochastic quasi-Newton method for large-scale optimization.SIAM Journal on Optimization261008–1031

  9. [9]

    and PORTIER, B

    CÉNAC, P., GODICHON-BAGGIONI, A. and PORTIER, B. (2025). An efficient averaged stochastic Gauss– Newton algorithm for estimating parameters of nonlinear regressions models.Bernoulli31

  10. [10]

    D., TONG, X

    CHEN, X., LEE, J. D., TONG, X. T. and ZHANG, Y. (2020). Statistical inference for model parameters in stochastic gradient descent.The Annals of Statistics48251–273

  11. [11]

    and SINGER, Y

    DUCHI, J., HAZAN, E. and SINGER, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research122121–2159

  12. [12]

    and YANG, L

    FANG, Y., XU, J. and YANG, L. (2018). Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research191–21

  13. [13]

    and PANLOUP, F

    GADAT, S. and PANLOUP, F. (2023). Optimal non-asymptotic analysis of the Ruppert–Polyak averaging stochastic algorithm.Stochastic Processes and their Applications156312–348

  14. [14]

    and PORTIER, B

    GODICHON-BAGGIONI, A., LU, W. and PORTIER, B. (2024). A full AdaGrad algorithm withO(N d) operations.arXiv preprint arXiv:2405.01908

  15. [15]

    and PORTIER, B

    GODICHON-BAGGIONI, A., LU, W. and PORTIER, B. (2024). Online estimation of the inverse of the Hessian for stochastic optimization with application to universal stochastic Newton algorithms.arXiv preprint arXiv:2401.10923

  16. [16]

    and TARRAGO, P

    GODICHON-BAGGIONI, A. and TARRAGO, P. (2025). Non-asymptotic analysis of adaptive stochastic gra- dient algorithms and applications.Transactions on Machine Learning Research

  17. [17]

    and WERGE, N

    GODICHON-BAGGIONI, A. and WERGE, N. (2025). On adaptive stochastic optimization for streaming data: A Newton’s method withO(dN)operations.Journal of Machine Learning Research261–49

  18. [18]

    and HEYDE, C

    HALL, P. and HEYDE, C. C. (1980).Martingale Limit Theory and Its Application. Academic Press, New York

  19. [19]

    and KALE, S

    HAZAN, E., AGARWAL, A. and KALE, S. (2007). Logarithmic regret algorithms for online convex opti- mization.Machine Learning69169–192

  20. [20]

    High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952, 2025

    JAGANNATH, A., JONES-MCCORMICK, T. and SARANGIAN, V. (2025). High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952

  21. [21]

    KINGMA, D. P. and BA, J. (2015). Adam: A Method for Stochastic Optimization. InInternational Confer- ence on Learning Representations

  22. [22]

    KONG, S. T. and SRIKANT, R. (2026). Finite-sample Wasserstein error bounds and concentration inequal- ities for nonlinear stochastic approximation.arXiv preprint arXiv:2602.02445

  23. [23]

    LEE, S., LIAO, Y., SEO, M. H. and SHIN, Y. (2022). Fast and robust online inference with stochastic gradient descent via random scaling. InProceedings of the AAAI Conference on Artificial Intelligence 7381–7389

  24. [24]

    LEHMANN, E. L. and CASELLA, G. (1998).Theory of Point Estimation, 2nd ed.Springer Texts in Statistics. Springer

  25. [25]

    and PORTIER, F

    LELUC, R. and PORTIER, F. (2023). Asymptotic analysis of conditioned stochastic gradient descent.Trans- actions on Machine Learning Research. STABILIZATION THRESHOLD FOR DYNAMICALLY PRECONDITIONED SGD25

  26. [26]

    LI, X.-L. (2018). Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems291454–1466

  27. [27]

    J., WAINWRIGHT, M

    MOU, W., LI, C. J., WAINWRIGHT, M. J., BARTLETT, P. L. and JORDAN, M. I. (2020). On linear stochas- tic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration. InConference on Learning Theory2947–2997. PMLR

  28. [28]

    POLYAK, B. T. and JUDITSKY, A. B. (1992). Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization30838–855

  29. [29]

    RIO, E. (2009). Upper bounds for minimal distances in the central limit theorem.Annales de l’Institut Henri Poincaré, Probabilités et Statistiques45802–817

  30. [30]

    and MONRO, S

    ROBBINS, H. and MONRO, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics22400–407

  31. [31]

    and SIEGMUND, D

    ROBBINS, H. and SIEGMUND, D. (1971). A convergence theorem for non negative almost supermartingales and some applications.Optimizing Methods in Statistics233–257

  32. [32]

    RUPPERT, D. (1988). Efficient estimations from a slowly convergent Robbins–Monro process Technical Report No. 781, Cornell University, School of Operations Research and Industrial Engineering

  33. [33]

    and NAUMOV, A

    SAMSONOV, S., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2024). Gaussian Ap- proximation and Multiplier Bootstrap for Polyak–Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning. InAdvances in Neural Information Processing Systems37

  34. [34]

    N., YU, J

    SCHRAUDOLPH, N. N., YU, J. and GÜNTER, S. (2007). A stochastic quasi-Newton method for online convex optimization. InProceedings of the Eleventh International Conference on Artificial Intelligence and Statistics436–443. PMLR

  35. [35]

    and XI, Y

    SCOTT, M., XU, T., TANG, Z., PICHETTE-EMMONS, A., YE, Q., SAAD, Y. and XI, Y. (2025). Design- ing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability.arXiv preprint arXiv:2511.19716

  36. [36]

    and ZHANG, Z.-S

    SHAO, Q.-M. and ZHANG, Z.-S. (2022). Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli281548–1576

  37. [37]

    Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

    SHESHUKOVA, M., SAMSONOV, S., BELOMESTNY, D., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2025). Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent.arXiv preprint arXiv:2502.06719

  38. [38]

    and SUN, R

    SHI, N., LI, D., HONG, M. and SUN, R. (2021). RMSProp converges with proper hyper-parameter. In International Conference on Learning Representations

  39. [39]

    and LECORFF, S

    SURENDRAN, S., FERMANIAN, A., GODICHON-BAGGIONI, A. and LECORFF, S. (2024). Non- asymptotic analysis of biased adaptive stochastic approximation. InAdvances in Neural Information Processing Systems3712897–12943

  40. [40]

    and CHEN, X

    TANG, K., LIU, W., ZHANG, Y. and CHEN, X. (2023). Acceleration of stochastic gradient descent with mo- mentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665

  41. [41]

    and HINTON, G

    TIELEMAN, T. and HINTON, G. (2012). Lecture 6.5—RMSProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning. [42]VAN DERVAART, A. W. (1998).Asymptotic Statistics.Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press

  42. [42]

    and WU, W

    WEI, Z., ZHU, W. and WU, W. B. (2025). Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality.arXiv preprint arXiv:2307.06915

  43. [43]

    H., STREET, W

    WOLBERG, W. H., STREET, W. N. and MANGASARIAN, O. L. (1995). Breast Cancer Wisconsin (Di- agnostic) Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Breast+ Cancer+Wisconsin+(Diagnostic)

  44. [44]

    and LI, Z

    XIE, S., WANG, T., REDDI, S., KUMAR, S. and LI, Z. (2025). Structured Preconditioners in Adaptive Optimization: A Unified Analysis. InInternational Conference on Machine Learning

  45. [45]

    When Does Dynamic Preconditioning Preserve the Polyak–Ruppert CLT? A Stabilization Threshold

    ZHU, W., CHEN, X. and WU, W. B. (2023). Online covariance matrix estimation in stochastic gradient descent.Journal of the American Statistical Association118393–404. 26S. AN AND X. HUO Supplement to “When Does Dynamic Preconditioning Preserve the Polyak–Ruppert CLT? A Stabilization Threshold” This supplement contains deferred proofs of all results stated ...