pith. sign in

arxiv: 2509.18964 · v3 · submitted 2025-09-23 · 💻 cs.LG · math.OC· stat.ML

Central Limit Theorems for Asynchronous Averaged Q-Learning

Pith reviewed 2026-05-18 14:52 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords central limit theoremQ-learningasynchronous updatesPolyak-Ruppert averagingreinforcement learningWasserstein distancefunctional central limit theoremBrownian motion
0
0 comments X

The pith

Averaged asynchronous Q-learning satisfies a non-asymptotic central limit theorem with rates depending on key parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proves central limit theorems for Polyak-Ruppert averaged Q-learning when updates occur asynchronously. It supplies explicit rates for how the distribution of the iterates approaches normality in Wasserstein distance, with the rate depending on iteration count, state-action space size, discount factor, and exploration quality. A functional central limit theorem further shows that the partial-sum process converges weakly to Brownian motion. These results matter because they give a precise statistical description of how reliable the learned Q-values become over time in reinforcement learning.

Core claim

The paper establishes a non-asymptotic central limit theorem for Polyak-Ruppert averaged Q-learning under asynchronous updates, in which the Wasserstein distance to the limiting normal distribution depends explicitly on the number of iterations, state-action space size, discount factor, and quality of exploration. It also derives a functional central limit theorem under which the partial-sum process converges weakly to a Brownian motion.

What carries the argument

The non-asymptotic central limit theorem applied to Polyak-Ruppert averaged asynchronous Q-learning iterates, which quantifies distributional convergence to normality with explicit parameter dependence.

If this is right

  • The Wasserstein error shrinks with more iterations and with stronger exploration.
  • The explicit dependence on discount factor and state-action space size permits scaling predictions for larger problems.
  • The functional central limit theorem links the discrete learning trajectory to a continuous Brownian motion limit.
  • The results hold under standard mixing and boundedness conditions on the MDP and step sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same averaging and limit arguments could extend to other temporal-difference methods such as SARSA.
  • The explicit rates could guide adaptive stopping criteria during large-scale training runs.
  • The Brownian motion limit suggests possible diffusion-based approximations for analyzing RL dynamics in continuous time.
  • Empirical checks on standard benchmark environments would test whether the predicted rates appear in practice.

Load-bearing premise

The underlying Markov decision process, step-size schedules, and exploration policy must guarantee sufficient mixing and boundedness for the averaged iterates to satisfy the conditions of the central limit theorem.

What would settle it

Simulate multiple independent runs of asynchronous averaged Q-learning on a small finite MDP, then check whether the empirical distribution of the averaged Q-estimates approaches a Gaussian at the exact Wasserstein rate predicted by the theorem as the iteration count grows.

read the original abstract

This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We prove a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper establishes central limit theorems for Polyak-Ruppert averaged asynchronous Q-learning. It proves a non-asymptotic CLT in which the Wasserstein-distance convergence rate depends explicitly on iteration count, |S|×|A|, discount factor, and exploration quality. It further derives a functional CLT showing that the partial-sum process converges weakly to Brownian motion.

Significance. If the derivations hold, the explicit non-asymptotic rates and the functional CLT supply useful statistical guarantees for averaged Q-learning in the asynchronous setting. The parameter dependencies are a concrete strength that can inform step-size and exploration design. The functional limit adds a process-level result that is not automatic from the finite-dimensional CLT.

major comments (2)
  1. [§3, Theorem 3.1] §3, Theorem 3.1 (non-asymptotic CLT): the theorem statement invokes 'standard conditions' on the MDP, step-size schedule, and exploration policy without listing them explicitly. Because these conditions are load-bearing for the mixing and boundedness arguments that justify the Wasserstein rate, they must be stated in full (including any quantitative mixing-time or variance bounds) so that the claimed dependence on |S|×|A| and exploration quality can be verified.
  2. [§4] §4, proof of the functional CLT: the reduction from the averaged asynchronous iterates to a martingale with the required quadratic-variation process is only sketched. The explicit constant that multiplies the Brownian-motion covariance must be derived from the asynchronous update structure; without this derivation the claimed weak convergence cannot be confirmed to hold at the stated rate.
minor comments (2)
  1. [§2] Notation for the averaged iterate Q-bar_t is introduced in §2 but used without re-definition in the statements of Theorems 3.1 and 4.1; a one-line reminder would improve readability.
  2. [Figure 1] Figure 1 (empirical Wasserstein distance vs. theory) lacks error bars and does not indicate the number of independent runs; this makes it hard to judge whether the observed scaling matches the derived rate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading of the manuscript and the constructive comments. We address each major comment below and describe the revisions we will make to strengthen the presentation.

read point-by-point responses
  1. Referee: [§3, Theorem 3.1] §3, Theorem 3.1 (non-asymptotic CLT): the theorem statement invokes 'standard conditions' on the MDP, step-size schedule, and exploration policy without listing them explicitly. Because these conditions are load-bearing for the mixing and boundedness arguments that justify the Wasserstein rate, they must be stated in full (including any quantitative mixing-time or variance bounds) so that the claimed dependence on |S|×|A| and exploration quality can be verified.

    Authors: We agree that the assumptions need to be stated explicitly for the rates to be verifiable. In the revised manuscript we will insert a dedicated subsection (new Section 2.3) that enumerates all standing assumptions in full, including the quantitative mixing-time bound for the exploration-induced Markov chain (in terms of |S|×|A| and the minimum visit probability) and the uniform variance bound on the temporal-difference noise. These explicit constants will be carried through the proof of Theorem 3.1 so that the dependence on state-action space size and exploration quality is transparent. revision: yes

  2. Referee: [§4] §4, proof of the functional CLT: the reduction from the averaged asynchronous iterates to a martingale with the required quadratic-variation process is only sketched. The explicit constant that multiplies the Brownian-motion covariance must be derived from the asynchronous update structure; without this derivation the claimed weak convergence cannot be confirmed to hold at the stated rate.

    Authors: We accept that the sketch in Section 4 leaves the quadratic-variation calculation implicit. We will expand the argument to derive the precise asymptotic covariance explicitly: starting from the asynchronous update indicator, we compute the long-run quadratic variation as the stationary expectation of the squared noise term weighted by the visit probabilities under the exploration policy. The resulting constant (a scalar multiple of the usual Q-learning covariance) will be stated in closed form and inserted into the statement of the functional CLT, confirming the weak convergence to Brownian motion at the claimed rate. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper derives non-asymptotic and functional central limit theorems for Polyak-Ruppert averaged asynchronous Q-learning by extending standard stochastic approximation results to the asynchronous setting. The Wasserstein convergence rate and weak convergence to Brownian motion are obtained under generic mixing and boundedness conditions on the MDP, step-size schedule, and exploration policy; these assumptions are stated as external to the target CLT and do not embed the claimed rates or limits by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The explicit dependence on iteration count, |S|×|A|, discount factor, and exploration quality follows directly from the martingale structure and averaging analysis rather than from tautological reduction to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on standard but unstated domain assumptions typical for Q-learning convergence analyses; no free parameters or invented entities are evident from the abstract.

axioms (1)
  • domain assumption The Markov decision process satisfies bounded rewards, sufficient exploration, and mixing conditions that allow asynchronous averaged iterates to obey a central limit theorem.
    Standard prerequisite for such stochastic approximation results but not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5583 in / 1220 out tokens · 45855 ms · 2026-05-18T14:52:29.773693+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation

    stat.ML 2026-05 unverdicted novelty 7.0

    Establishes non-asymptotic Gaussian approximation bounds for federated LSA with explicit communication-heterogeneity trade-offs and introduces an online multiplier bootstrap for last-iterate inference with validity gu...

  2. Gaussian Approximation for Asynchronous Q-learning

    stat.ML 2026-04 unverdicted novelty 7.0

    Derived rates of order up to n^{-1/6} log^4(n S A) for the high-dimensional CLT of averaged asynchronous Q-learning iterates, plus a general martingale-difference CLT.

  3. On Gaussian approximation for entropy-regularized Q-learning with function approximation

    stat.ML 2026-05 unverdicted novelty 5.0

    Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Siddharth Chandak.o(1/k) finite-time bound for non-linear two-time-scale stochastic ap- proximation.arXiv preprint arXiv:2504.19375,

  2. [2]

    Chandak, S

    Siddharth Chandak, Shaan Ul Haque, and Nicholas Bambos. Finite-time bounds for two- time-scale stochastic approximation with arbitrary norm contractions and markovian noise.arXiv preprint arXiv:2503.18391,

  3. [3]

    A L yapunov theory for finite- sample guarantees of asynchronous Q-learning and TD-learning va riants,

    5 Liu Shuhang Chen, Adithya Devraj, Ana Busic, and Sean Meyn. Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. InInternational Conference on Artificial Intelligence and Statistics, pages 4173–4183. PMLR, 2020a. Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam. Finite-sample analysis of co...

  4. [4]

    Decoupled functional central limit theorems for two-time-scale stochastic approximation.arXiv preprint arXiv:2412.17070,

    Yuze Han, Xiang Li, Jiadong Liang, and Zhihua Zhang. Decoupled functional central limit theorems for two-time-scale stochastic approximation.arXiv preprint arXiv:2412.17070,

  5. [5]

    A general-purpose theorem for high- probability bounds of stochastic approximation with polyak averaging.arXiv preprint arXiv:2505.21796,

    Sajad Khodadadian and Martin Zubeldia. A general-purpose theorem for high- probability bounds of stochastic approximation with polyak averaging.arXiv preprint arXiv:2505.21796,

  6. [6]

    Nonasymptotic clt and error bounds for two-time-scale stochastic approximation.arXiv preprint arXiv:2502.09884,

    Seo Taek Kong, Sihan Zeng, Thinh T Doan, and R Srikant. Nonasymptotic clt and error bounds for two-time-scale stochastic approximation.arXiv preprint arXiv:2502.09884,

  7. [7]

    Statistical inference for linear stochastic approximation with markovian noise.arXiv preprint arXiv:2505.19102,

    Sergey Samsonov, Marina Sheshukova, Eric Moulines, and Alexey Naumov. Statistical inference for linear stochastic approximation with markovian noise.arXiv preprint arXiv:2505.19102,

  8. [8]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  9. [9]

    Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning

    7 Liu R Srikant. Rates of convergence in the central limit theorem for markov chains, with an application to td learning.arXiv preprint arXiv:2401.15719,

  10. [10]

    Sim-to-Real: Learning Agile Locomotion For Quadruped Robots

    Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots.arXiv preprint arXiv:1804.10332,

  11. [11]

    M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharpℓ ∞- bounds for q-learning. Technical Report arxiv:1905.06265, UC Berkeley, May

  12. [12]

    and Xie, Q

    Yixuan Zhang and Qiaomin Xie. Constant stepsize q-learning: Distributional convergence, bias and extrapolation.arXiv preprint arXiv:2401.13884,

  13. [13]

    Proof of Theorem 4 We begin by establishing the following lemma

    8 Central Limit Theorems for Asynchronous A veraged Q-Learning Appendix A. Proof of Theorem 4 We begin by establishing the following lemma. A.1. Proof of Lemma 7 Lemma 7Denote∆ k =Q k −Q ∗. For allk∈[K], ifα k ≤1, then∆ k is bounded as follows: ∆↓ k ≤∆ k ≤∆ ↑ k, where∆ ↓ 0 = ∆0 = ∆↑ 0 and the upper and lower bounds evolve according to ∆↑ k+1 = (I−α kD+α k...

  14. [14]

    h 1√ K KX k=1 ∆↑ k ! −h((A −1ΣA−⊤)1/2N(0, I)) # . For anyh∈Lip 1, we have E

    We now decompose Term (3), K−1X k=0 A−1(F(Q k, Yk)−E[F(Q k, Yk)]) = K−1X k=0 A−1(Xk(Yk)−E[X k(Yk+1)|Yk]) = K−1X k=0 A−1(Xk(Yk)−X k+1(Yk+1) +X k+1(Yk+1)−X k(Yk+1) +X k(Yk+1)−E[X k(Yk+1)|Yk]) =A −1(X0(Y0)−X K(YK))| {z } Term (3a) + K−1X k=0 A−1(Xk+1(Yk+1)−X k(Yk+1)) | {z } Term (3b) 12 Central Limit Theorems for Asynchronous A veraged Q-Learning + K−1X k=0 ...

  15. [15]

    The rest of the proof follows from the proof of Theorem 2 in Srikant (2024), with necessary modifications to accommodate our setting

    Note that under Assumption 2, the three conditions in Theorem 8 are satisfied. The rest of the proof follows from the proof of Theorem 2 in Srikant (2024), with necessary modifications to accommodate our setting. To conclude, with the substitutions, we have nX k=1 ∥Σ1/2 ∞ ∥opE[∥Σ−1/2 ∞ mk∥β+2 2 +∥Σ −1/2 ∞ mk∥β 2] (n−k+

  16. [16]

    Appendix B

    (1+β)/2 ≤O n(1−β)/2/((1−γ)ρ) 2+β 17 Liu and nX k=1 1 n−k+ 1 Tr(Mk(Σ−1/2 ∞ E[Σk]Σ−1/2 ∞ −I))≤O 1/(1−γ) 2ρ2 which completes the proof. Appendix B. Proof of Theorem 6 Polish space is a separable and complete function space. It is a crucial structure for ap- plying convergence in distribution results such as FCLT. Recall that we denoteD[0,1] as the Skorokhod ...

  17. [17]

    The following lemma, which establishes the FCLT for 1√ K ϕ6(ζ), is a direct consequence of Theorem 4.2 in Hall and Heyde (2014). Lemma 11For anyζ∈[0,1], 1√ K ⌊ζK⌋X k=1 A−1(Xk−1(Yk)−E[X k−1(Yk)|Yk−1]) w→(A −1ΣA−⊤)1/2B(ζ) whereBis the standard Brownian motion andΣ := P i,j∈ ˜S ˜µ(i)˜P(i, j)(X(j)−E[X(Y 1)|Y0 = i])(X(j)−E[X(Y 1)|Y0 =i]) ⊤. Thus, we have 1√ K ...

  18. [18]

    (2021), Lemma 15 gives a non-asymptotic convergence rate for ∆ k =Q k −Q ∗ under asynchronous updates

    Next, by leveraging the results in Chen et al. (2021), Lemma 15 gives a non-asymptotic convergence rate for ∆ k =Q k −Q ∗ under asynchronous updates. Lastly, Lemma 16 provides a Lipschitz property for the operator F(·, s) defined in eq. (2). Lemma 12Letα i =α(i+b) −β for some problem-dependent constantsα, b >0andβ∈ (0,1). Then the following bounds hold: K...

  19. [19]

    β ·O 1 (ρ(1−γ)) 1 1−β ! ≤O 1 k1+β(ρ(1−γ)) 1 1−β ! ≤O 1 k fork≥(ρ(1−γ)) −1/β(1−β). For the second term, we observe KX i=k+1   i−1Y j=k+1 (I−α jA)   − KX i=k   i−1Y j=k (I−α jA)   =I+ KX i=k   i−1Y j=k+1 (I−α jA)− i−1Y j=k (I−α jA)   =I+ KX i=k  αkA i−1Y j=k+1 (I−α jA)   =O(I) fork≥(ρ(1−γ)) −1/β(1−β). Thus, the second term is of orderα k. Pu...