Recognition: unknown
Achieving ε⁻² Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions
Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3
The pith
Single-loop off-policy actor-critic reaches an ε-optimal policy with Õ(ε^{-2}) samples under only irreducibility of one policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the single assumption that there exists a policy inducing an irreducible Markov chain, single-loop single-timescale off-policy actor-critic with a broad class of policy updates achieves last-iterate convergence with Õ(ε^{-2}) sample complexity for an ε-optimal policy. The argument establishes geometric convergence for the actor and Õ(1/T) convergence for the critic through coupled Lyapunov drift inequalities that are combined by a cross-domination property.
What carries the argument
Coupled Lyapunov drift framework with cross-domination property that links geometric actor progress to Õ(1/T) critic progress while handling interdependent updates and unbounded iterates.
If this is right
- The Õ(ε^{-2}) guarantee applies directly to approximate policy iteration and natural policy gradient inside the single-loop setting.
- Last-iterate convergence is obtained without any averaging of the iterates.
- The rate holds without requiring uniform mixing times or uniform exploration constants.
- Off-policy updates are covered even when parameter iterates remain unbounded.
Where Pith is reading between the lines
- The coupled-drift technique may extend to other single-timescale stochastic approximation schemes whose updates are interdependent.
- Single-loop implementations could become the default in practice once the rate is confirmed on standard benchmarks.
- Policy designers may begin checking irreducibility properties explicitly when seeking fast convergence guarantees.
Load-bearing premise
The existence of at least one policy that induces an irreducible Markov chain, plus the coupled Lyapunov drifts and cross-domination property holding for the single-loop off-policy updates.
What would settle it
An MDP in which every policy induces a reducible chain yet the single-loop algorithm still attains Õ(ε^{-2}) samples to ε-optimality, or an MDP satisfying the irreducibility assumption in which the observed sample complexity is substantially worse than Õ(ε^{-2}).
read the original abstract
In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity guarantee for finding an $\epsilon$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to prove the first Õ(ε^{-2}) sample complexity guarantee for finding an ε-optimal policy using single-loop, single-timescale off-policy actor-critic methods under minimal assumptions, specifically the existence of a policy that induces an irreducible Markov chain. The analysis relies on a coupled Lyapunov drift framework that establishes geometric convergence for the actor and Õ(1/T) for the critic, combined via a cross-domination property for a broad class of policy updates including approximate policy iteration and natural policy gradient.
Significance. If the central claim holds, this result would represent a notable advance in reinforcement learning theory by achieving near-optimal sample complexity without requiring nested-loop updates or strong assumptions such as uniform ergodicity or exploration. The introduction of the coupled Lyapunov drift framework with cross-domination could be of independent interest for analyzing other coupled iterative algorithms with unbounded iterates.
major comments (1)
- The cross-domination property (described in the abstract as the mechanism combining the geometric actor drift and Õ(1/T) critic drift) is load-bearing for the Õ(ε^{-2}) rate. It is not evident that this property holds under only the existence of one irreducible policy, because the off-policy behavior distribution used for sampling may be reducible or have arbitrarily slow mixing on some states, rendering the domination constant infinite or T-dependent and invalidating the combined Lyapunov inequality for unbounded iterates.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment of the work's significance, and the constructive comment on the cross-domination property. We address the concern directly below.
read point-by-point responses
-
Referee: The cross-domination property (described in the abstract as the mechanism combining the geometric actor drift and Õ(1/T) critic drift) is load-bearing for the Õ(ε^{-2}) rate. It is not evident that this property holds under only the existence of one irreducible policy, because the off-policy behavior distribution used for sampling may be reducible or have arbitrarily slow mixing on some states, rendering the domination constant infinite or T-dependent and invalidating the combined Lyapunov inequality for unbounded iterates.
Authors: We appreciate the referee highlighting the need for explicit verification of this load-bearing step. The manuscript establishes the cross-domination property in Lemma 4.5 (Section 4.3) under precisely the stated minimal assumption: the existence of at least one policy that induces an irreducible chain. The proof proceeds by showing that the actor's geometric drift keeps the policy sequence inside a compact set on which the behavior distribution (which has full support on the irreducible chain's transitions) yields a uniform positive lower bound on the visitation probabilities, independent of T. This bound is derived from the irreducibility of the reference policy and does not rely on uniform ergodicity of every intermediate policy or on the behavior distribution itself being irreducible. Consequently, the domination constant remains finite and T-independent, allowing the combined Lyapunov inequality to close for the unbounded iterates. We agree that the current exposition could make the T-independence argument more transparent and will add a short remark plus an expanded proof sketch in the revision. revision: partial
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper derives the Õ(ε^{-2}) sample complexity via a coupled Lyapunov drift framework that establishes geometric actor convergence and Õ(1/T) critic convergence, then combines them through the cross-domination property under the explicit minimal assumption of one irreducible policy. No quoted step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the rates follow from the stated drift inequalities without renaming or smuggling prior ansatzes. The framework is presented as new and of independent interest, with comparisons to external benchmarks rather than load-bearing self-references. This satisfies the self-contained criterion against the paper's own assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existence of a policy that induces an irreducible Markov chain
Reference graph
Works this paper leans on
-
[1]
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gra- dient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76
work page 2021
-
[2]
Alacaoglu, A., Viano, L., He, N., and Cevher, V. (2022). A natural actor-critic framework for zero-sum Markov games. InInternational Conference on Machine Learning, pages 307–366. PMLR
work page 2022
-
[3]
Banach, S. (1922). Sur les op´ erations dans les ensembles abstraits et leur application aux ´ equations int´ egrales.Fund. Math, 3(1):133–181
work page 1922
-
[4]
Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335
work page 2011
-
[5]
Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference On Learning Theory, pages 1691– 1692
work page 2018
-
[6]
Borkar, V. S. (2009).Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer
work page 2009
-
[7]
Borkar, V. S. and Konda, V. R. (1997). The actor-critic algorithm as multi-time-scale stochastic approximation.Sadhana, 22(4):525–543
work page 1997
-
[8]
Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approxi- mation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469
work page 2000
-
[9]
Chandak, S., Borkar, V. S., and Dodhia, P. (2022). Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430
work page 2022
-
[10]
Chen, X. and Zhao, L. (2023). Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049
work page 2023
-
[11]
Chen, Z., Khodadadian, S., and Maguluri, S. T. (2022). Finite-sample analysis of off-policy natural actor–critic with linear function approximation.IEEE Control Systems Letters, 6:2611– 2616
work page 2022
-
[12]
Chen, Z. and Maguluri, S. T. (2025). An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395. 15
work page 2025
-
[13]
T., Shakkottai, S., and Shanmugam, K
Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2021). Finite-sample analysis of off-policy TD-learning via generalized Bellman operators.Advances in Neural Information Processing Systems, 34:21440–21452
work page 2021
-
[14]
T., Shakkottai, S., and Shanmugam, K
Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov the- ory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367
work page 2024
-
[15]
Chen, Z., Zhang, K., Mazumdar, E., Ozdaglar, A., and Wierman, A. (2023). A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 75826–75883. Curran Associates, Inc
work page 2023
-
[16]
Degris, T., White, M., and Sutton, R. (2012). Off-policy actor-critic. InInternational Con- ference on Machine Learning
work page 2012
-
[17]
R., Legg, S., and Kavukcuoglu, K
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. R., Legg, S., and Kavukcuoglu, K. (2018). IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InInternational Conference on Machine Learning, pages 1407–1416
work page 2018
-
[18]
Fu, Z., Yang, Z., and Wang, Z. (2021). Single-timescale actor-critic provably finds globally optimal policy. InInternational Conference on Learning Representations
work page 2021
-
[19]
Ganesh, S., Chen, J., Mondal, W. U., and Aggarwal, V. (2025). Order-optimal global conver- gence for actor-critic with general policy and neural critic parametrization. In Chiappa, S. and Magliacane, S., editors,Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, volume 286 ofProceedings of Machine Learning Research, pages...
work page 2025
-
[20]
Gheshlaghi Azar, M., Munos, R., and Kappen, H. J. (2013). Minimax PAC bounds on the sam- ple complexity of reinforcement learning with a generative model.Machine learning, 91(3):325– 349
work page 2013
-
[21]
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor. In Dy, J. and Krause, A., editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870. PMLR
work page 2018
-
[22]
Haque, S. U. and Maguluri, S. T. (2025). Stochastic approximation with unbounded Markovian noise: A general-purpose theorem. InInternational Conference on Artificial Intelligence and Statistics, pages 3718–3726. PMLR
work page 2025
-
[23]
Jiang, Z.-P., Teel, A. R., and Praly, L. (1994). Small-gain theorem for ISS systems and applications.Mathematics of Control, Signals and Systems, 7(2):95–120
work page 1994
-
[24]
Ju, C. and Lan, G. (2025). Auto-exploration for online reinforcement learning.Preprint Arxiv:2512.06244. 16
-
[25]
Kakade, S. M. (2001). A natural policy gradient.Advances in Neural Information Processing Systems, 14
work page 2001
-
[26]
Khalil, H. K. and Grizzle, J. W. (2002).Nonlinear Systems, volume 3. Prentice hall Upper Saddle River, NJ
work page 2002
-
[27]
Khodadadian, S., Chen, Z., and Maguluri, S. T. (2021). Finite-sample analysis of off-policy natural actor-critic algorithm. InInternational Conference on Machine Learning, pages 5420–
work page 2021
-
[28]
T., Romberg, J., and Maguluri, S
Khodadadian, S., Doan, T. T., Romberg, J., and Maguluri, S. T. (2022). Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control
work page 2022
-
[29]
Konda, V. and Tsitsiklis, J. (1999). Actor-critic algorithms.Advances in Neural Information Processing Systems, 12
work page 1999
-
[30]
Kumar, H., Koppel, A., and Ribeiro, A. (2023). On the sample complexity of actor-critic method for reinforcement learning with function approximation.Machine Learning, pages 1–35
work page 2023
-
[31]
Kumar, N., Agrawal, P., Ramponi, G., Levy, K. Y., and Mannor, S. (2026a). On the conver- gence of single-timescale actor-critic. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems
-
[32]
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
Kumar, N., Dahan, T., Cohen, L., Barua, A., Ramponi, G., Levy, K. Y., and Mannor, S. (2026b). Optimal sample complexity for single time-scale actor-critic with momentum.arXiv preprint: 2602.01505
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Lan, G. (2022). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical Programming, pages 1–48
work page 2022
-
[34]
Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc
work page 2017
-
[35]
Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous Q- learning: sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc
work page 2020
-
[36]
Li, Y. and Lan, G. (2025). Policy mirror descent inherently explores action space.SIAM Journal on Optimization, 35(1):116–156
work page 2025
-
[37]
Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning.CoRR
work page 2015
-
[38]
P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pages 1928–1937
work page 2016
-
[39]
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, 17 I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning.Nature, 518(7540):529–533
work page 2015
-
[40]
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. InProceedings of the 30th International Conference on Neural Information Processing Systems, pages 1054–1062
work page 2016
-
[41]
A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies
Nanda, P. and Chen, Z. (2025). A minimal-assumption analysis ofQ-learning with time-varying policies.Preprint Arxiv:2510.16132
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Olshevsky, A. and Gharesifard, B. (2023). A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization, 61(2):980–1007
work page 2023
-
[43]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback. InProceedings of the 36th Intern...
work page 2022
-
[44]
Peters, J. and Schaal, S. (2008). Natural actor-critic.Neurocomputing, 71(7-9):1180–1190
work page 2008
-
[45]
Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal difference learning with function approximation. InProceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 417–424, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc
work page 2001
-
[46]
Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Program- ming. John Wiley & Sons
work page 2014
-
[47]
Qiu, S., Yang, Z., Ye, J., and Wang, Z. (2021). On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664
work page 2021
-
[48]
Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR
work page 2020
-
[49]
Robbins, H. and Monro, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407
work page 1951
-
[50]
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889–1897. PMLR
work page 2015
-
[51]
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms
work page 2017
-
[52]
Schweitzer, P. J. (1971). Iterative solution of the functional equations of undiscounted Markov renewal programming.Journal of Mathematical Analysis and Applications, 34(3):495–501
work page 1971
-
[53]
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., 18 Graepel, T., and Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676):354
work page 2017
-
[54]
Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830
work page 2019
-
[55]
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44
work page 1988
-
[56]
Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press
work page 2018
-
[57]
Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning.The Journal of Machine Learning Research, 17(1):2603– 2631
work page 2016
-
[58]
S., McAllester, D., Singh, S., and Mansour, Y
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. InProceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057–1063
work page 1999
-
[59]
S., Szepesv´ ari, C., and Maei, H
Sutton, R. S., Szepesv´ ari, C., and Maei, H. R. (2008). A convergentO(n) algorithm for off-policy temporal-difference learning with linear function approximation.Advances in Neural Information Processing Systems, 21(21):1609–1616
work page 2008
-
[60]
Tian, H., Olshevsky, A., and Paschalidis, Y. (2023). Convergence of actor-critic with multi- layer neural networks. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 9279–
work page 2023
-
[61]
Curran Associates, Inc
- [62]
-
[63]
Wang, Y., Wang, Y., Zhou, Y., and Zou, S. (2024). Non-asymptotic analysis for single-loop (Natural) actor-critic with compatible function approximation. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F., editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedin...
work page 2024
-
[64]
Watkins, C. J. and Dayan, P. (1992). Q-learning.Machine learning, 8(3-4):279–292
work page 1992
-
[65]
F., Zhang, W., Xu, P., and Gu, Q
Wu, Y. F., Zhang, W., Xu, P., and Gu, Q. (2020). A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628
work page 2020
-
[66]
Xiao, L. (2022). On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36
work page 2022
-
[67]
Xu, T. and Liang, Y. (2021). Sample complexity bounds for two timescale value-based rein- forcement learning algorithms. InInternational Conference on Artificial Intelligence and Statis- tics, pages 811–819. PMLR. 19
work page 2021
-
[68]
Xu, T., Wang, Z., and Liang, Y. (2020a). Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33
-
[69]
Xu, T., Wang, Z., and Liang, Y. (2020b). Non-asymptotic convergence analysis of two time- scale (natural) actor-critic algorithms.arXiv eprint: 2005.03557. 20 A Proof of Theorem 2.1 with the IS-Based Critic A.1 Analysis of the Actor A.1.1 Proof of Proposition 2.1 The following lemma is needed, whose proof is presented in Appendix A.4.1. Lemma A.1.For anyπ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.