Recognition: no theorem link
Central Limit Theorem for Two-Time-Scale Approximate Distributionally Robust RL
Pith reviewed 2026-05-12 01:35 UTC · model grok-4.3
The pith
An approximate distributionally robust RL method satisfies a central limit theorem at the canonical n to the minus one-half rate.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce an approximate robust Bellman equation obtained from a first-order expansion of the robust functional around zero ambiguity radius. We then design the Mean-Variance Stochastic Approximation algorithm that tracks both mean and variance quantities through a two-time-scale lifted dynamics. Under standard step-size conditions the main iterate converges and satisfies a central limit theorem at rate n to the minus one-half whose limiting covariance is explicitly characterized.
What carries the argument
Mean-Variance Stochastic Approximation (MVSA), a two-time-scale stochastic approximation scheme that maintains separate fast and slow iterates to solve the lifted system arising from the first-order approximate robust Bellman equation.
If this is right
- The algorithm produces asymptotically normal estimators whose covariance can be used to build confidence intervals without additional simulation.
- Only single-sample transitions are required at each step, removing the need to solve an inner maximization over transition kernels.
- The same two-time-scale construction can be applied to any approximate Bellman operator that admits a similar mean-variance lifting.
- Convergence and the CLT hold as long as the step-size sequences satisfy the usual summability conditions for stochastic approximation.
Where Pith is reading between the lines
- If the ambiguity radius is chosen small enough that the first-order error is negligible relative to statistical noise, the resulting policy should be nearly distributionally robust.
- The explicit covariance formula opens the door to online variance reduction or adaptive step-size rules that exploit the predicted asymptotic behavior.
- The same lifting technique might be reusable for other non-linear operators that appear in risk-sensitive or robust variants of reinforcement learning.
- Numerical experiments on larger state spaces would be needed to check whether the two-time-scale separation remains practical when function approximation is introduced.
Load-bearing premise
The first-order expansion of the robust functional stays accurate enough in the small-ambiguity regime that higher-order remainder terms do not alter the limiting normal distribution of the iterates.
What would settle it
Run the MVSA algorithm on a finite-state MDP with known transition kernel, collect many independent trajectories of the scaled error sqrt(n) times (main iterate minus its limit), and test whether the empirical covariance converges to the paper's predicted matrix; a statistically significant mismatch would falsify the central limit theorem claim.
Figures
read the original abstract
Designing model-free algorithms for distributionally robust reinforcement learning (DRRL) poses fundamental challenges. The robust Bellman operator is nonlinear in the transition kernel, which makes one-sample Bellman updates biased, while the adversarial optimization underlying robustness makes robust evaluation computationally demanding. To address these difficulties, we consider the natural small-ambiguity regime under Kullback--Leibler ambiguity sets and propose an approximate DRRL framework based on a first-order expansion of the relevant robust functional. This yields an approximate robust Bellman equation that removes the adversarial optimization while remaining first-order accurate in the ambiguity radius. To learn the fixed point of this approximate equation, we propose Mean-Variance Stochastic Approximation (MVSA), a model-free algorithm that uses only one-sample updates. This is achieved via a lifted stochastic approximation dynamics and a two-time-scale design. We then prove convergence and a central limit theorem for MVSA: its main iterate satisfies a central limit theorem at the canonical $n^{-1/2}$ scale, with explicitly characterized asymptotic covariances. Finally, we validate our theoretical findings with a numerical experiment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an approximate framework for distributionally robust RL in the small-ambiguity regime under KL divergence sets, obtained by substituting a first-order expansion of the robust functional into the Bellman operator. This yields a tractable approximate robust Bellman equation free of adversarial optimization. The authors introduce the Mean-Variance Stochastic Approximation (MVSA) algorithm, a model-free two-time-scale stochastic approximation procedure that learns the fixed point of the approximate equation via single-sample updates. They prove almost-sure convergence of the iterates to the fixed point and establish a central limit theorem at the canonical n^{-1/2} rate, with explicitly characterized asymptotic covariance matrices for the main iterate. Numerical experiments on a simple MDP are included to illustrate the theory.
Significance. If the CLT holds, the work supplies a concrete asymptotic normality result for a computationally feasible approximation to DRRL, together with an explicit covariance formula that can support downstream statistical inference. The two-time-scale construction cleanly separates the mean and variance updates while preserving the one-sample property, and the fact that the approximation remainder does not enter the leading mean-field or diffusion terms of the MVSA dynamics is a useful structural observation. These elements together advance the theoretical toolkit for robust RL beyond convergence statements.
major comments (2)
- [§4, Theorem 4.2] §4, Theorem 4.2 (CLT statement): the proof invokes a general two-time-scale SA CLT; the manuscript should explicitly verify that the specific mean-field drift and noise covariance of the lifted MVSA dynamics satisfy the required Lipschitz, growth, and positive-definiteness conditions, particularly the non-degeneracy of the asymptotic covariance matrix.
- [§3.1, Eq. (8)] §3.1, Eq. (8) (first-order expansion): while the remainder is correctly stated to be o(ε), the paper should record the precise order of the remainder (e.g., O(ε²)) and confirm that it contributes only higher-order bias to the fixed-point error, leaving the n^{-1/2} CLT scaling and covariance formula for MVSA unaffected.
minor comments (3)
- [Abstract] Abstract: the acronym MVSA is introduced without expansion; spell out “Mean-Variance Stochastic Approximation” on first use.
- [§5] §5 (numerical experiment): the plots would be clearer if they included multiple independent runs with shaded standard-error bands, allowing visual comparison with the predicted n^{-1/2} rate and covariance.
- [Notation] Notation: the lifted state vector (mean and variance iterates) is denoted differently across §3 and §4; adopt a single consistent symbol throughout.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment, and recommendation for minor revision. We address each major comment below and will incorporate the requested clarifications.
read point-by-point responses
-
Referee: [§4, Theorem 4.2] §4, Theorem 4.2 (CLT statement): the proof invokes a general two-time-scale SA CLT; the manuscript should explicitly verify that the specific mean-field drift and noise covariance of the lifted MVSA dynamics satisfy the required Lipschitz, growth, and positive-definiteness conditions, particularly the non-degeneracy of the asymptotic covariance matrix.
Authors: We agree that explicit verification of the conditions strengthens the application of the general CLT. In the revision we will add a dedicated appendix (or subsection) that directly checks the Lipschitz continuity and linear growth of the mean-field drift, the boundedness and positive-definiteness of the noise covariance, and the non-degeneracy of the limiting covariance matrix for the lifted MVSA dynamics under the paper's standing assumptions. revision: yes
-
Referee: [§3.1, Eq. (8)] §3.1, Eq. (8) (first-order expansion): while the remainder is correctly stated to be o(ε), the paper should record the precise order of the remainder (e.g., O(ε²)) and confirm that it contributes only higher-order bias to the fixed-point error, leaving the n^{-1/2} CLT scaling and covariance formula for MVSA unaffected.
Authors: We thank the referee for this suggestion. The first-order Taylor expansion of the KL-robust functional indeed produces an O(ε²) remainder. We will revise §3.1 to state this order explicitly, include a brief derivation confirming that the induced bias in the approximate fixed point is O(ε²), and note that this higher-order term does not enter the leading mean-field or diffusion terms, thereby preserving the n^{-1/2} CLT rate and covariance formula already established for MVSA. revision: yes
Circularity Check
No significant circularity
full rationale
The paper defines an approximate robust Bellman operator via first-order expansion under KL ambiguity, introduces the MVSA algorithm as a two-time-scale stochastic approximation on the lifted dynamics of that operator, and then proves convergence plus a CLT for the main iterate at the standard n^{-1/2} rate with explicit asymptotic covariance derived from the mean-field and noise terms of the MVSA recursion. None of these steps reduces by construction to a fitted parameter, a self-definition, or a load-bearing self-citation; the CLT is a standard result for the defined dynamics and the covariance formula follows from the linearization of those dynamics rather than from any data-dependent fit or renaming. The remainder of the expansion affects only the distance to the true robust fixed point and is explicitly excluded from the MVSA mean-field and diffusion terms, so the CLT statement remains independent of that remainder.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard assumptions for convergence of two-time-scale stochastic approximation (e.g., step-size conditions, bounded moments)
- domain assumption The first-order expansion of the KL-robust functional is accurate enough that higher-order terms do not affect the limiting distribution
Reference graph
Works this paper leans on
-
[1]
(2008).Stochastic Approximation: A Dynamical Systems Viewpoint
Borkar, V. (2008).Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press. 3
work page 2008
-
[2]
Borkar, V. S. (1997). Stochastic approximation with two time scales.Systems & Control Letters, 29(5):291–294. 3
work page 1997
-
[3]
T., Shakkottai, S., and Shanmugam, K
Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A lyapunov theory for finite- sample guarantees of markovian stochastic approximation.Operations Research, 72(4):1352–1367. 3
work page 2024
- [4]
-
[5]
T., Clarke, J.-P., and Maguluri, S
Chen, Z., Zhang, S., Doan, T. T., Clarke, J.-P., and Maguluri, S. T. (2022). Finite-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning.Automatica, 146:110623. 3
work page 2022
-
[6]
Fabian, V. (1968). On asymptotic normality in stochastic approximation.The Annals of Mathematical Statistics, 39(4):1327–1332. 3
work page 1968
-
[7]
Iyengar, G. (2005). Robust dynamic programming.Math. Oper. Res., 30:257–280. 1, 3
work page 2005
-
[8]
Konda, V. and Tsitsiklis, J. (2004). Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability, 14. 3
work page 2004
-
[9]
Kumar, N., Derman, E., Geist, M., Levy, K. Y., and Mannor, S. (2023). Policy gradient for rectangular robust markov decision processes. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 59477–59501. Curran Associates, Inc. 3
work page 2023
-
[10]
Kushner, H. and Yin, G. (2003).Stochastic Approximation and Recursive Algorithms and Applications. Stochastic Modelling and Applied Probability. Springer New York. 3
work page 2003
-
[11]
Lam, H. (2016). Robust sensitivity analysis for stochastic systems.Mathematics of Operations Research, 41(4):1248–1275. 4
work page 2016
-
[12]
Li, M., Kuhn, D., and Sutter, T. (2025a). Policy gradient algorithms for robust MDPs with non- rectangular uncertainty sets. 3
-
[13]
Li, Z., Wang, S., and Si, N. (2025b). Near-optimal sample complexities of divergence-based s-rectangular distributionally robust reinforcement learning.arXiv preprint arXiv:2505.12202. 3
work page internal anchor Pith review Pith/arXiv arXiv
- [14]
-
[15]
Liu, Z., Bai, Q., Blanchet, J., Dong, P., Xu, W., Zhou, Z., and Zhou, Z. (2022). Distributionally robust q-learning. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 ofProceedings of Machine Learning Research, pages 13623–13643. PMLR. 2, 3
work page 2022
-
[16]
Ljung, L. (1977). Analysis of recursive stochastic algorithms.Automatic Control, IEEE Transactions on, 22:551 – 575. 3
work page 1977
-
[17]
Mescheder, L., Geiger, A., and Nowozin, S. (2018). Which training methods for gans do actually converge? InInternational conference on machine learning, pages 3481–3490. PMLR. 2
work page 2018
-
[18]
Mokkadem, A. and Pelletier, M. (2006). Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms.The Annals of Applied Probability, 16(3):1671 – 1702. 3, 22, 23, 24
work page 2006
-
[19]
Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Trans. Inf. Theor., 56(11):5847–5861. 16
work page 2010
-
[20]
Nilim, A. and Ghaoui, L. (2005). Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53:780–798. 1
work page 2005
-
[21]
Panaganti, K. and Kalathil, D. (2022). Sample complexity of robust reinforcement learning with a generative model. 3
work page 2022
-
[22]
Polyak, B. (1990). New method of stochastic approximation type.Automation and Remote Control,
work page 1990
-
[23]
Robbins, H. and Siegmund, D. (1971). A convergence theorem for non negative almost supermartingales and some applications. InOptimizing methods in statistics, pages 233–257. Elsevier. 19, 29
work page 1971
-
[24]
Robbins, H. E. (1951). A stochastic approximation method.Annals of Mathematical Statistics, 22:400–
work page 1951
-
[25]
Shi, L. and Chi, Y. (2022). Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. 3, 4
work page 2022
-
[26]
Shi, L., Li, G., Wei, Y., Chen, Y., Geist, M., and Chi, Y. (2023). The curious price of distributional robustness in reinforcement learning with a generative model. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc. 3
work page 2023
-
[27]
Wainwright, M. J. (2019).Basic tail and concentration bounds, page 21–57. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. 16
work page 2019
-
[28]
Wang, Q., Ho, C. P., and Petrik, M. (2023a). Policy gradient in robust MDPs with global convergence guarantee. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 35763–35797. PMLR. 3
-
[29]
Wang, Q., Zha, Y., Ho, C. P., and Petrik, M. (2025a). Provable policy gradient for robust average- reward MDPs beyond rectangularity. In Singh, A., Fazel, M., Hsu, D., Lacoste-Julien, S., Berkenkamp, F., Maharaj, T., Wagstaff, K., and Zhu, J., editors,Proceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine...
-
[30]
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2023b). A finite sample complexity bound for distribu- tionally robust Q-learning. 2, 3, 4
-
[31]
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2024). Sample complexity of variance-reduced distribu- tionally robust q-learning.Journal of Machine Learning Research, 25(341):1–77. 2, 3
work page 2024
-
[32]
Wang, S., Si, N., Blanchet, J., and Zhou, Z. (2025b). On the foundation of distributionally robust reinforcement learning. 1, 3
-
[33]
Wang, Y. and Zou, S. (2022). Policy gradient method for robust reinforcement learning. InInternational conference on machine learning, pages 23484–23526. PMLR. 2, 3, 4
work page 2022
-
[34]
Wiesemann, W., Kuhn, D., and Rustem, B. (2013). Robust Markov decision processes.Mathematics of Operations Research, 38(1):153–183. 1
work page 2013
- [35]
-
[36]
Yang, W., Wang, H., Kozuno, T., Jordan, S., and Zhang, Z. (2023). Avoiding model estimation in robust markov decision processes with a generative model. 2, 3
work page 2023
-
[37]
Yang, W., Zhang, L., and Zhang, Z. (2022). Toward theoretical understandings of robust markov decision processes: Sample complexity and asymptotics.The Annals of Statistics, 50. 3
work page 2022
-
[38]
Zhou, Z., Zhou, Z., Bai, Q., Qiu, L., Blanchet, J., and Glynn, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Banerjee, A. and Fukumizu, K., editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Researc...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.