arxiv: 2604.23498 · v1 · submitted 2026-04-26 · 🧮 math.ST · math.OC· stat.ML· stat.TH

Recognition: unknown

When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold

Sunyoung An, Xiaoming Huo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:17 UTC · model grok-4.3

classification 🧮 math.ST math.OCstat.MLstat.TH

keywords dynamic preconditioningPolyak-Ruppert averagingcentral limit theoremstochastic approximationstabilization rateadaptive SGDonline inference

0 comments

The pith

Dynamic preconditioners preserve the Polyak-Ruppert CLT for averaged SGD when their stabilization rate exceeds (α+1)/2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper determines the minimal speed at which a time-varying preconditioner must settle down for the averaged stochastic gradient estimator to retain its usual asymptotic normality with sandwich covariance. It isolates the effect of the changing preconditioner into a single dynamic remainder term via an exact decomposition of the averaged error, leaving the martingale and Taylor terms preconditioner-free. This remainder vanishes in mean square precisely when the operator-norm change of the effective inverse drift matrix decays faster than t to the power minus (α+1)/2, where α is the step-size exponent between 1/2 and 1. The threshold is shown to be sharp inside the class of polynomial-rate assumptions, and a single stabilization argument then certifies the central limit theorem for three common adaptive methods.

Core claim

The dynamic remainder √n R_n vanishes in L² whenever β > (α+1)/2, and sequences exist satisfying the polynomial rate hypotheses for which it does not vanish when β ≤ (α+1)/2. Under standard regularity conditions this yields the CLT √n (x̄_n - x*) → N(0, H^{-1} S H^{-1}) for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.

What carries the argument

The preconditioner-isolating decomposition of the averaged error that confines P_t to a dynamic remainder R_n, together with the effective inverse drift matrix M_t = (P_t H)^{-1} whose operator-norm differences satisfy ||M_t - M_{t-1}||_op ≲ t^{-β}.

Load-bearing premise

The effective inverse drift matrix changes at a polynomial rate faster than t to the minus (α+1)/2.

What would settle it

Construct sequences of preconditioners satisfying the polynomial hypotheses with β exactly equal to (α+1)/2 and check whether √n R_n fails to converge to zero in L².

read the original abstract

Polyak-Ruppert averaging yields an asymptotically normal estimator with sandwich covariance $H^{-1}SH^{-1}$, the foundation of online inference. When the gradient step is preconditioned by a data-driven matrix $P_t$, we ask how fast $P_t$ must stabilize for the central limit theorem (CLT) to remain valid. We resolve this via an exact preconditioner-isolating decomposition of the averaged error that confines $P_t$ to a dynamic remainder $R_n$, leaving the martingale and Taylor terms preconditioner-free. Let $M_t = (P_t H)^{-1}$ denote the effective inverse drift matrix, with $\|M_t - M_{t-1}\|_{\mathrm{op}} \lesssim t^{-\beta}$ and step-size exponent $\alpha \in (1/2, 1)$. We identify a stabilization-rate threshold $\beta > (\alpha+1)/2$ and prove that, within the class of polynomial rate hypotheses used in our upper bound, it cannot be weakened: the dynamic remainder $\sqrt{n}\,R_n$ vanishes in $L^2$ whenever $\beta > (\alpha+1)/2$, and we exhibit sequences satisfying those hypotheses for which it does not vanish when $\beta \le (\alpha+1)/2$. A single stabilization argument certifies three SA variants - SA-AdaGrad, SA-RMSProp, and SA-ONS - with gain $\rho_t = c/t$, each delivering one-step $L^2(\mathrm{op})$ stabilization of order $t^{-1}$, yielding the CLT $\sqrt{n}(\bar{x}_n - x^*) \to N(0, H^{-1}SH^{-1})$; under bounded inputs the pathwise rate $\beta = 1$ further preserves the $n^{-1/6}$ Wasserstein rate at $\alpha^* = 2/3$. Under standard regularity conditions, Wald-type online inference remains valid for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper isolates the preconditioner effect into one remainder and pins down a sharp stabilization threshold β > (α+1)/2 that keeps the Polyak-Ruppert CLT intact for averaged SGD.

read the letter

The main thing to know is that the authors separate the time-varying preconditioner into a single dynamic remainder R_n and show that sqrt(n) R_n vanishes in L2 exactly when the stabilization rate exceeds (α+1)/2, with matching sequences where it fails below that line. This gives a usable cutoff rather than a loose sufficient condition. They then check that SA-AdaGrad, SA-RMSProp, and SA-ONS all clear the bar with their β=1 rate and therefore keep the usual sandwich covariance for online inference. The decomposition is the cleanest part: it leaves the martingale and Taylor terms untouched by P_t, so the argument reduces to standard stochastic approximation results plus one extra estimate. That modularity is what makes the applications straightforward. The lower-bound constructions under the same polynomial hypotheses are a real strength; they show the threshold cannot be relaxed inside the stated class without breaking the limit. Soft spots are modest and mostly scope-related. Everything assumes strong convexity, bounded moments, and polynomial decay of the effective inverse drift, so the result does not speak to non-convex problems or preconditioners that change at irregular rates. The n^{-1/6} Wasserstein claim also needs bounded inputs, which narrows its reach but is not load-bearing for the CLT. The abstract and stress-test description are internally consistent with no circularity or self-referential fitting. This is for researchers working on stochastic approximation, adaptive online methods, or asymptotic inference for SGD. Anyone who needs to certify that a particular preconditioner preserves normality and sandwich variance will get a concrete test from it. It is worth a serious referee because the technical separation and sharpness result are precise enough to be checked and built on.

Referee Report

0 major / 2 minor

Summary. The manuscript establishes conditions under which dynamic preconditioning in stochastic approximation algorithms preserves the Polyak-Ruppert central limit theorem. Through an exact decomposition isolating the dynamic remainder R_n driven by changes in the effective inverse drift matrix M_t = (P_t H)^{-1}, the authors show that under polynomial stabilization ||M_t - M_{t-1}||_op ≲ t^{-β} and step sizes with exponent α ∈ (1/2, 1), the term √n R_n vanishes in L² precisely when β > (α + 1)/2. This threshold is sharp within the class of polynomial hypotheses, as demonstrated by matching lower bounds via explicit sequence constructions. The result is applied to certify the CLT for SA-AdaGrad, SA-RMSProp, and SA-ONS, each achieving β = 1, and extends to Wasserstein rates under additional boundedness assumptions.

Significance. If the results hold, this work provides a sharp, actionable criterion for the stabilization rate of data-driven preconditioners to ensure valid asymptotic normality and online inference in adaptive stochastic gradient methods. By confirming that common adaptive schemes like AdaGrad and RMSProp satisfy the threshold, it supports the use of Wald-type confidence intervals for dynamically preconditioned averaged SGD without additional modifications. The isolation of the preconditioner effect from standard martingale terms offers a clean framework for analyzing other adaptive variants. The exact decomposition and matching upper/lower bounds within the polynomial class are notable strengths.

minor comments (2)

[Notation and §2] The notation for the operator norm and the precise statement of the polynomial rate hypothesis on M_t could be introduced earlier and used consistently to aid readability.
[Applications] In the applications section, a short explicit verification that SA-ONS achieves the one-step L²(op) stabilization of order t^{-1} under the stated gain ρ_t = c/t would be helpful.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee summary accurately reflects the main results on the stabilization threshold for dynamic preconditioning to preserve the Polyak-Ruppert CLT, the exact decomposition, and the applications to SA-AdaGrad, SA-RMSProp, and SA-ONS.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper isolates the dynamic remainder R_n via an exact decomposition that separates it from the preconditioner-free martingale and Taylor terms, which obey the standard Polyak-Ruppert CLT under regularity conditions. The threshold β > (α+1)/2 is obtained by direct analysis of the remainder under the stated polynomial hypothesis ||M_t - M_{t-1}||_op ≲ t^{-β}, with sharpness demonstrated by explicit counterexample sequences that satisfy the same hypotheses. No load-bearing step reduces to a self-definition, a fitted input renamed as prediction, or a self-citation chain; the limiting covariance H^{-1}SH^{-1} follows from the fixed-P sandwich after the remainder vanishes. The result is internally consistent and relies on external stochastic-approximation theory rather than circular inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard stochastic-approximation regularity conditions (strong convexity, moment bounds) and on the polynomial stabilization-rate hypotheses introduced for the upper and lower bounds.

axioms (2)

domain assumption Standard regularity conditions for stochastic approximation (strong convexity of the objective, bounded variance of the noise, Lipschitz gradients)
Invoked to ensure the preconditioner-free martingale and Taylor terms obey the classical Polyak-Ruppert CLT.
ad hoc to paper Polynomial rate hypotheses on the stabilization of M_t = (P_t H)^{-1}
Used both for the upper bound on the dynamic remainder and for constructing the matching lower-bound sequences.

pith-pipeline@v0.9.0 · 5686 in / 1636 out tokens · 52860 ms · 2026-05-08T05:17:20.219538+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 9 canonical work pages · 1 internal anchor

[1]

and MOULINES, E

BACH, F. and MOULINES, E. (2013). Non-strongly-convex smooth stochastic approximation with conver- gence rate O(1/n). InAdvances in Neural Information Processing Systems26773–781

2013
[2]

and PORTIER, B

BERCU, B., GODICHON-BAGGIONI, A. and PORTIER, B. (2020). An efficient stochastic Newton algorithm for parameter estimation in logistic regressions.SIAM Journal on Control and Optimization58348– 367

2020
[3]

(1997).Matrix Analysis.Graduate Texts in Mathematics169

BHATIA, R. (1997).Matrix Analysis.Graduate Texts in Mathematics169. Springer

1997
[4]

BONIS, T. (2020). Stein’s method for normal approximation in Wasserstein distances with application to the multivariate central limit theorem.Probability Theory and Related Fields178827–860

2020
[5]

BOTTOU, L., CURTIS, F. E. and NOCEDAL, J. (2018). Optimization methods for large-scale machine learn- ing.SIAM Review60223–311

2018
[6]

and GODICHON-BAGGIONI, A

BOYER, C. and GODICHON-BAGGIONI, A. (2023). On the asymptotic rate of convergence of stochastic Newton algorithms and their weighted averaged versions.Computational Optimization and Applica- tions84921–972

2023
[7]

and ZHANG, Z.-S

BUTYRIN, B., MOULINES, E., NAUMOV, A., SAMSONOV, S., SHAO, Q.-M. and ZHANG, Z.-S. (2025). Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approximation. arXiv preprint arXiv:2510.12375

work page arXiv 2025
[8]

H., HANSEN, S

BYRD, R. H., HANSEN, S. L., NOCEDAL, J. and SINGER, Y. (2016). A stochastic quasi-Newton method for large-scale optimization.SIAM Journal on Optimization261008–1031

2016
[9]

and PORTIER, B

CÉNAC, P., GODICHON-BAGGIONI, A. and PORTIER, B. (2025). An efficient averaged stochastic Gauss– Newton algorithm for estimating parameters of nonlinear regressions models.Bernoulli31

2025
[10]

D., TONG, X

CHEN, X., LEE, J. D., TONG, X. T. and ZHANG, Y. (2020). Statistical inference for model parameters in stochastic gradient descent.The Annals of Statistics48251–273

2020
[11]

and SINGER, Y

DUCHI, J., HAZAN, E. and SINGER, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research122121–2159

2011
[12]

and YANG, L

FANG, Y., XU, J. and YANG, L. (2018). Online bootstrap confidence intervals for the stochastic gradient descent estimator.Journal of Machine Learning Research191–21

2018
[13]

and PANLOUP, F

GADAT, S. and PANLOUP, F. (2023). Optimal non-asymptotic analysis of the Ruppert–Polyak averaging stochastic algorithm.Stochastic Processes and their Applications156312–348

2023
[14]

and PORTIER, B

GODICHON-BAGGIONI, A., LU, W. and PORTIER, B. (2024). A full AdaGrad algorithm withO(N d) operations.arXiv preprint arXiv:2405.01908

work page arXiv 2024
[15]

and PORTIER, B

GODICHON-BAGGIONI, A., LU, W. and PORTIER, B. (2024). Online estimation of the inverse of the Hessian for stochastic optimization with application to universal stochastic Newton algorithms.arXiv preprint arXiv:2401.10923

work page arXiv 2024
[16]

and TARRAGO, P

GODICHON-BAGGIONI, A. and TARRAGO, P. (2025). Non-asymptotic analysis of adaptive stochastic gra- dient algorithms and applications.Transactions on Machine Learning Research

2025
[17]

and WERGE, N

GODICHON-BAGGIONI, A. and WERGE, N. (2025). On adaptive stochastic optimization for streaming data: A Newton’s method withO(dN)operations.Journal of Machine Learning Research261–49

2025
[18]

and HEYDE, C

HALL, P. and HEYDE, C. C. (1980).Martingale Limit Theory and Its Application. Academic Press, New York

1980
[19]

and KALE, S

HAZAN, E., AGARWAL, A. and KALE, S. (2007). Logarithmic regret algorithms for online convex opti- mization.Machine Learning69169–192

2007
[20]

High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952, 2025

JAGANNATH, A., JONES-MCCORMICK, T. and SARANGIAN, V. (2025). High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952

work page arXiv 2025
[21]

KINGMA, D. P. and BA, J. (2015). Adam: A Method for Stochastic Optimization. InInternational Confer- ence on Learning Representations

2015
[22]

KONG, S. T. and SRIKANT, R. (2026). Finite-sample Wasserstein error bounds and concentration inequal- ities for nonlinear stochastic approximation.arXiv preprint arXiv:2602.02445

work page arXiv 2026
[23]

LEE, S., LIAO, Y., SEO, M. H. and SHIN, Y. (2022). Fast and robust online inference with stochastic gradient descent via random scaling. InProceedings of the AAAI Conference on Artificial Intelligence 7381–7389

2022
[24]

LEHMANN, E. L. and CASELLA, G. (1998).Theory of Point Estimation, 2nd ed.Springer Texts in Statistics. Springer

1998
[25]

and PORTIER, F

LELUC, R. and PORTIER, F. (2023). Asymptotic analysis of conditioned stochastic gradient descent.Trans- actions on Machine Learning Research. STABILIZATION THRESHOLD FOR DYNAMICALLY PRECONDITIONED SGD25

2023
[26]

LI, X.-L. (2018). Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems291454–1466

2018
[27]

J., WAINWRIGHT, M

MOU, W., LI, C. J., WAINWRIGHT, M. J., BARTLETT, P. L. and JORDAN, M. I. (2020). On linear stochas- tic approximation: Fine-grained Polyak–Ruppert and non-asymptotic concentration. InConference on Learning Theory2947–2997. PMLR

2020
[28]

POLYAK, B. T. and JUDITSKY, A. B. (1992). Acceleration of stochastic approximation by averaging.SIAM Journal on Control and Optimization30838–855

1992
[29]

RIO, E. (2009). Upper bounds for minimal distances in the central limit theorem.Annales de l’Institut Henri Poincaré, Probabilités et Statistiques45802–817

2009
[30]

and MONRO, S

ROBBINS, H. and MONRO, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics22400–407

1951
[31]

and SIEGMUND, D

ROBBINS, H. and SIEGMUND, D. (1971). A convergence theorem for non negative almost supermartingales and some applications.Optimizing Methods in Statistics233–257

1971
[32]

RUPPERT, D. (1988). Efficient estimations from a slowly convergent Robbins–Monro process Technical Report No. 781, Cornell University, School of Operations Research and Industrial Engineering

1988
[33]

and NAUMOV, A

SAMSONOV, S., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2024). Gaussian Ap- proximation and Multiplier Bootstrap for Polyak–Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning. InAdvances in Neural Information Processing Systems37

2024
[34]

N., YU, J

SCHRAUDOLPH, N. N., YU, J. and GÜNTER, S. (2007). A stochastic quasi-Newton method for online convex optimization. InProceedings of the Eleventh International Conference on Artificial Intelligence and Statistics436–443. PMLR

2007
[35]

and XI, Y

SCOTT, M., XU, T., TANG, Z., PICHETTE-EMMONS, A., YE, Q., SAAD, Y. and XI, Y. (2025). Design- ing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability.arXiv preprint arXiv:2511.19716

work page arXiv 2025
[36]

and ZHANG, Z.-S

SHAO, Q.-M. and ZHANG, Z.-S. (2022). Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms.Bernoulli281548–1576

2022
[37]

Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

SHESHUKOVA, M., SAMSONOV, S., BELOMESTNY, D., MOULINES, E., SHAO, Q.-M., ZHANG, Z.-S. and NAUMOV, A. (2025). Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent.arXiv preprint arXiv:2502.06719

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

and SUN, R

SHI, N., LI, D., HONG, M. and SUN, R. (2021). RMSProp converges with proper hyper-parameter. In International Conference on Learning Representations

2021
[39]

and LECORFF, S

SURENDRAN, S., FERMANIAN, A., GODICHON-BAGGIONI, A. and LECORFF, S. (2024). Non- asymptotic analysis of biased adaptive stochastic approximation. InAdvances in Neural Information Processing Systems3712897–12943

2024
[40]

and CHEN, X

TANG, K., LIU, W., ZHANG, Y. and CHEN, X. (2023). Acceleration of stochastic gradient descent with mo- mentum by averaging: finite-sample rates and asymptotic normality.arXiv preprint arXiv:2305.17665

work page arXiv 2023
[41]

and HINTON, G

TIELEMAN, T. and HINTON, G. (2012). Lecture 6.5—RMSProp: Divide the Gradient by a Running Average of Its Recent Magnitude. COURSERA: Neural Networks for Machine Learning. [42]VAN DERVAART, A. W. (1998).Asymptotic Statistics.Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press

2012
[42]

and WU, W

WEI, Z., ZHU, W. and WU, W. B. (2025). Weighted Averaged Stochastic Gradient Descent: Asymptotic Normality and Optimality.arXiv preprint arXiv:2307.06915

work page arXiv 2025
[43]

H., STREET, W

WOLBERG, W. H., STREET, W. N. and MANGASARIAN, O. L. (1995). Breast Cancer Wisconsin (Di- agnostic) Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Breast+ Cancer+Wisconsin+(Diagnostic)

1995
[44]

and LI, Z

XIE, S., WANG, T., REDDI, S., KUMAR, S. and LI, Z. (2025). Structured Preconditioners in Adaptive Optimization: A Unified Analysis. InInternational Conference on Machine Learning

2025
[45]

When Does Dynamic Preconditioning Preserve the Polyak–Ruppert CLT? A Stabilization Threshold

ZHU, W., CHEN, X. and WU, W. B. (2023). Online covariance matrix estimation in stochastic gradient descent.Journal of the American Statistical Association118393–404. 26S. AN AND X. HUO Supplement to “When Does Dynamic Preconditioning Preserve the Polyak–Ruppert CLT? A Stabilization Threshold” This supplement contains deferred proofs of all results stated ...

2023