arxiv: 2605.08417 · v1 · submitted 2026-05-08 · 💻 cs.LG · math.OC

Recognition: no theorem link

Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL

Shengbo Wang , Zexi Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:35 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords distributionally robust reinforcement learningcentral limit theoremstochastic approximationtwo-time-scale algorithmsapproximate Bellman operatormodel-free reinforcement learningKullback-Leibler ambiguity set

0 comments

The pith

An approximate distributionally robust RL method satisfies a central limit theorem at the canonical n to the minus one-half rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses bias and computational cost in model-free distributionally robust reinforcement learning by restricting to the small-ambiguity regime under Kullback-Leibler sets. It replaces the nonlinear robust Bellman operator with a first-order linear approximation that eliminates the inner adversarial optimization. Learning the fixed point of this approximate equation is done via a lifted two-time-scale stochastic approximation algorithm called MVSA that uses only single-sample updates. The central result is that the main sequence of iterates obeys a central limit theorem whose asymptotic covariance matrix is given in closed form.

Core claim

We introduce an approximate robust Bellman equation obtained from a first-order expansion of the robust functional around zero ambiguity radius. We then design the Mean-Variance Stochastic Approximation algorithm that tracks both mean and variance quantities through a two-time-scale lifted dynamics. Under standard step-size conditions the main iterate converges and satisfies a central limit theorem at rate n to the minus one-half whose limiting covariance is explicitly characterized.

What carries the argument

Mean-Variance Stochastic Approximation (MVSA), a two-time-scale stochastic approximation scheme that maintains separate fast and slow iterates to solve the lifted system arising from the first-order approximate robust Bellman equation.

If this is right

The algorithm produces asymptotically normal estimators whose covariance can be used to build confidence intervals without additional simulation.
Only single-sample transitions are required at each step, removing the need to solve an inner maximization over transition kernels.
The same two-time-scale construction can be applied to any approximate Bellman operator that admits a similar mean-variance lifting.
Convergence and the CLT hold as long as the step-size sequences satisfy the usual summability conditions for stochastic approximation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the ambiguity radius is chosen small enough that the first-order error is negligible relative to statistical noise, the resulting policy should be nearly distributionally robust.
The explicit covariance formula opens the door to online variance reduction or adaptive step-size rules that exploit the predicted asymptotic behavior.
The same lifting technique might be reusable for other non-linear operators that appear in risk-sensitive or robust variants of reinforcement learning.
Numerical experiments on larger state spaces would be needed to check whether the two-time-scale separation remains practical when function approximation is introduced.

Load-bearing premise

The first-order expansion of the robust functional stays accurate enough in the small-ambiguity regime that higher-order remainder terms do not alter the limiting normal distribution of the iterates.

What would settle it

Run the MVSA algorithm on a finite-state MDP with known transition kernel, collect many independent trajectories of the scaled error sqrt(n) times (main iterate minus its limit), and test whether the empirical covariance converges to the paper's predicted matrix; a statistically significant mismatch would falsify the central limit theorem claim.

Figures

Figures reproduced from arXiv: 2605.08417 by Shengbo Wang, Zexi Zhang.

**Figure 1.** Figure 1: Approximation error ∥U ∗ − Q∗∥∞ as a function of the ambiguity radius δ. values in S := {−B, −B + 1, . . . , 0, . . . , I}, where I > 0 is the inventory capacity and B > 0 is the maximum allowable backlog. At each period, the inventory manager selects an order quantity from A := {0, 1, . . . , O}, where O > 0 is the maximum order size. At time t, an order quantity At is selected. Due to the inventory capac… view at source ↗

**Figure 2.** Figure 2: (Left) Estimation error ∥Un −U ∗∥∞ on a log-log scale. (Right) Empirical distribution of the scaled error p n/a[Un(z) − U ∗ (z)] at z = (0, 2) and (0, 3). 7 Conclusion In this paper, we propose an approximate DRRL framework for the small-ambiguity regime, leading to a onesample implementable model-free algorithm that attains a CLT under the canonical n −1/2 rate. Our approach is motivated by the observati… view at source ↗

**Figure 3.** Figure 3: (Left) Heatmap of U ∗ ε (s, a); red markers indicate the greedy action a ∗ (s). (Right) Optimal value function v ∗ (s) = maxa U ∗ ε (s, a) with optimal actions annotated. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_3.png] view at source ↗

read the original abstract

Designing model-free algorithms for distributionally robust reinforcement learning (DRRL) poses fundamental challenges. The robust Bellman operator is nonlinear in the transition kernel, which makes one-sample Bellman updates biased, while the adversarial optimization underlying robustness makes robust evaluation computationally demanding. To address these difficulties, we consider the natural small-ambiguity regime under Kullback--Leibler ambiguity sets and propose an approximate DRRL framework based on a first-order expansion of the relevant robust functional. This yields an approximate robust Bellman equation that removes the adversarial optimization while remaining first-order accurate in the ambiguity radius. To learn the fixed point of this approximate equation, we propose Mean-Variance Stochastic Approximation (MVSA), a model-free algorithm that uses only one-sample updates. This is achieved via a lifted stochastic approximation dynamics and a two-time-scale design. We then prove convergence and a central limit theorem for MVSA: its main iterate satisfies a central limit theorem at the canonical $n^{-1/2}$ scale, with explicitly characterized asymptotic covariances. Finally, we validate our theoretical findings with a numerical experiment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable first-order approximate DRRL method via MVSA with a two-time-scale SA and CLT for the iterates, and the CLT holds for the approximate dynamics.

read the letter

The paper's core move is to replace the full robust Bellman operator with its first-order expansion under small KL ambiguity sets. This removes the inner adversarial optimization and yields an approximate fixed-point equation that can be learned with one-sample updates. They then run a lifted two-time-scale stochastic approximation called MVSA and prove both convergence and a central limit theorem at the usual n^{-1/2} rate with explicit asymptotic covariances for the main iterate.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes an approximate framework for distributionally robust RL in the small-ambiguity regime under KL divergence sets, obtained by substituting a first-order expansion of the robust functional into the Bellman operator. This yields a tractable approximate robust Bellman equation free of adversarial optimization. The authors introduce the Mean-Variance Stochastic Approximation (MVSA) algorithm, a model-free two-time-scale stochastic approximation procedure that learns the fixed point of the approximate equation via single-sample updates. They prove almost-sure convergence of the iterates to the fixed point and establish a central limit theorem at the canonical n^{-1/2} rate, with explicitly characterized asymptotic covariance matrices for the main iterate. Numerical experiments on a simple MDP are included to illustrate the theory.

Significance. If the CLT holds, the work supplies a concrete asymptotic normality result for a computationally feasible approximation to DRRL, together with an explicit covariance formula that can support downstream statistical inference. The two-time-scale construction cleanly separates the mean and variance updates while preserving the one-sample property, and the fact that the approximation remainder does not enter the leading mean-field or diffusion terms of the MVSA dynamics is a useful structural observation. These elements together advance the theoretical toolkit for robust RL beyond convergence statements.

major comments (2)

[§4, Theorem 4.2] §4, Theorem 4.2 (CLT statement): the proof invokes a general two-time-scale SA CLT; the manuscript should explicitly verify that the specific mean-field drift and noise covariance of the lifted MVSA dynamics satisfy the required Lipschitz, growth, and positive-definiteness conditions, particularly the non-degeneracy of the asymptotic covariance matrix.
[§3.1, Eq. (8)] §3.1, Eq. (8) (first-order expansion): while the remainder is correctly stated to be o(ε), the paper should record the precise order of the remainder (e.g., O(ε²)) and confirm that it contributes only higher-order bias to the fixed-point error, leaving the n^{-1/2} CLT scaling and covariance formula for MVSA unaffected.

minor comments (3)

[Abstract] Abstract: the acronym MVSA is introduced without expansion; spell out “Mean-Variance Stochastic Approximation” on first use.
[§5] §5 (numerical experiment): the plots would be clearer if they included multiple independent runs with shaded standard-error bands, allowing visual comparison with the predicted n^{-1/2} rate and covariance.
[Notation] Notation: the lifted state vector (mean and variance iterates) is denoted differently across §3 and §4; adopt a single consistent symbol throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and recommendation for minor revision. We address each major comment below and will incorporate the requested clarifications.

read point-by-point responses

Referee: [§4, Theorem 4.2] §4, Theorem 4.2 (CLT statement): the proof invokes a general two-time-scale SA CLT; the manuscript should explicitly verify that the specific mean-field drift and noise covariance of the lifted MVSA dynamics satisfy the required Lipschitz, growth, and positive-definiteness conditions, particularly the non-degeneracy of the asymptotic covariance matrix.

Authors: We agree that explicit verification of the conditions strengthens the application of the general CLT. In the revision we will add a dedicated appendix (or subsection) that directly checks the Lipschitz continuity and linear growth of the mean-field drift, the boundedness and positive-definiteness of the noise covariance, and the non-degeneracy of the limiting covariance matrix for the lifted MVSA dynamics under the paper's standing assumptions. revision: yes
Referee: [§3.1, Eq. (8)] §3.1, Eq. (8) (first-order expansion): while the remainder is correctly stated to be o(ε), the paper should record the precise order of the remainder (e.g., O(ε²)) and confirm that it contributes only higher-order bias to the fixed-point error, leaving the n^{-1/2} CLT scaling and covariance formula for MVSA unaffected.

Authors: We thank the referee for this suggestion. The first-order Taylor expansion of the KL-robust functional indeed produces an O(ε²) remainder. We will revise §3.1 to state this order explicitly, include a brief derivation confirming that the induced bias in the approximate fixed point is O(ε²), and note that this higher-order term does not enter the leading mean-field or diffusion terms, thereby preserving the n^{-1/2} CLT rate and covariance formula already established for MVSA. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines an approximate robust Bellman operator via first-order expansion under KL ambiguity, introduces the MVSA algorithm as a two-time-scale stochastic approximation on the lifted dynamics of that operator, and then proves convergence plus a CLT for the main iterate at the standard n^{-1/2} rate with explicit asymptotic covariance derived from the mean-field and noise terms of the MVSA recursion. None of these steps reduces by construction to a fitted parameter, a self-definition, or a load-bearing self-citation; the CLT is a standard result for the defined dynamics and the covariance formula follows from the linearization of those dynamics rather than from any data-dependent fit or renaming. The remainder of the expansion affects only the distance to the true robust fixed point and is explicitly excluded from the MVSA mean-field and diffusion terms, so the CLT statement remains independent of that remainder.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard stochastic approximation theory for two-time-scale systems and the validity of the first-order Taylor expansion of the robust functional; no free parameters or invented entities are introduced in the abstract.

axioms (2)

standard math Standard assumptions for convergence of two-time-scale stochastic approximation (e.g., step-size conditions, bounded moments)
Invoked to obtain both convergence and the CLT
domain assumption The first-order expansion of the KL-robust functional is accurate enough that higher-order terms do not affect the limiting distribution
Central modeling choice stated in the abstract

pith-pipeline@v0.9.0 · 5486 in / 1270 out tokens · 37480 ms · 2026-05-12T01:35:34.611838+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

[1]

(2008).Stochastic Approximation: A Dynamical Systems Viewpoint

Borkar, V. (2008).Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press. 3

work page 2008
[2]

Borkar, V. S. (1997). Stochastic approximation with two time scales.Systems & Control Letters, 29(5):291–294. 3

work page 1997
[3]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A lyapunov theory for finite- sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367. 3

work page 2024
[4]

Chen, Z., Wang, S., and Si, N. (2025). Sample complexity of distributionally robust average-reward reinforcement learning.arXiv preprint arXiv:2505.10007. 3 10

work page arXiv 2025
[5]

T., Clarke, J.-P., and Maguluri, S

Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623. 3

work page 2022
[6]

Fabian, V. (1968). On asymptotic normality in stochastic approximation.The Annals of Mathematical Statistics, 39(4):1327–1332. 3

work page 1968
[7]

Iyengar, G. (2005). Robust dynamic programming.Math. Oper. Res., 30:257–280. 1, 3

work page 2005
[8]

and Tsitsiklis, J

Konda, V. and Tsitsiklis, J. (2004). Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability, 14. 3

work page 2004
[9]

Y., and Mannor, S

Kumar, N., Derman, E., Geist, M., Levy, K. Y., and Mannor, S. (2023). Policy gradient for rectangular robust markov decision processes. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 59477–59501. Curran Associates, Inc. 3

work page 2023
[10]

and Yin, G

Kushner, H. and Yin, G. (2003).Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer New York. 3

work page 2003
[11]

Lam, H. (2016). Robust sensitivity analysis for stochastic systems.Mathematics of Operations Research, 41(4):1248–1275. 4

work page 2016
[12]

Li, M., Kuhn, D., and Sutter, T. (2025a). Policy gradient algorithms for robust MDPs with non- rectangular uncertainty sets. 3

work page
[13]

Li, Z., Wang, S., and Si, N. (2025b). Near-optimal sample complexities of divergence-based s-rectangular distributionally robust reinforcement learning.arXiv preprint arXiv:2505.12202. 3

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Liang, Z., Ma, X., Blanchet, J., Zhang, J., and Zhou, Z. (2023). Single-trajectory distributionally robust reinforcement learning.arXiv preprint arXiv:2301.11721. 2, 3

work page arXiv 2023
[15]

Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust q-learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 13623–13643. PMLR. 2, 3

work page 2022
[16]

Ljung, L. (1977). Analysis of recursive stochastic algorithms.Automatic Control, IEEE Transactions on, 22:551 – 575. 3

work page 1977
[17]

Mescheder, L., Geiger, A., and Nowozin, S. (2018). Which training methods for gans do actually converge? InInternational conference on machine learning, pages 3481–3490. PMLR. 2

work page 2018
[18]

and Pelletier, M

Mokkadem, A. and Pelletier, M. (2006). Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms.The Annals of Applied Probability, 16(3):1671 – 1702. 3, 22, 23, 24

work page 2006
[19]

J., and Jordan, M

Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Trans. Inf. Theor., 56(11):5847–5861. 16

work page 2010
[20]

and Ghaoui, L

Nilim, A. and Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53:780–798. 1

work page 2005
[21]

and Kalathil, D

Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. 3

work page 2022
[22]

Polyak, B. (1990). New method of stochastic approximation type.Automation and Remote Control,

work page 1990
[23]

and Siegmund, D

Robbins, H. and Siegmund, D. (1971). A convergence theorem for non negative almost supermartingales and some applications. InOptimizing methods in statistics, pages 233–257. Elsevier. 19, 29

work page 1971
[24]

Robbins, H. E. (1951). A stochastic approximation method.Annals of Mathematical Statistics, 22:400–

work page 1951
[25]

and Chi, Y

Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. 3, 4

work page 2022
[26]

Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M., and Chi, Y. (2023). The curious price of distributional robustness in reinforcement learning with a generative model. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc. 3

work page 2023
[27]

Wainwright, M. J. (2019).Basic tail and concentration bounds, page 21–57. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. 16

work page 2019
[28]

P., and Petrik, M

Wang, Q., Ho, C. P., and Petrik, M. (2023a). Policy gradient in robust MDPs with global convergence guarantee. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 35763–35797. PMLR. 3

work page
[29]

P., and Petrik, M

Wang, Q., Zha, Y., Ho, C. P., and Petrik, M. (2025a). Provable policy gradient for robust average- reward MDPs beyond rectangularity. In Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., and Zhu, J., editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine...

work page
[30]

Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). A finite sample complexity bound for distribu- tionally robust Q-learning. 2, 3, 4

work page
[31]

Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2024). Sample complexity of variance-reduced distribu- tionally robust q-learning.Journal of Machine Learning Research, 25(341):1–77. 2, 3

work page 2024
[32]

Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2025b). On the foundation of distributionally robust reinforcement learning. 1, 3

work page
[33]

and Zou, S

Wang, Y. and Zou, S. (2022). Policy gradient method for robust reinforcement learning. InInternational conference on machine learning, pages 23484–23526. PMLR. 2, 3, 4

work page 2022
[34]

Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183. 1

work page 2013
[35]

Xu, Z., Panaganti, K., and Kalathil, D. M. (2023). Improved sample complexity bounds for distribu- tionally robust reinforcement learning.ArXiv, abs/2303.02783. 3

work page arXiv 2023
[36]

Yang, W., Wang, H., Kozuno, T., Jordan, S., and Zhang, Z. (2023). Avoiding model estimation in robust markov decision processes with a generative model. 2, 3

work page 2023
[37]

Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics.The Annals of Statistics, 50. 3

work page 2022
[38]

positive

Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Banerjee, A. and Fukumizu, K., editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Researc...

work page 2021