arxiv: 2605.13639 · v1 · submitted 2026-05-13 · 💻 cs.LG · math.OC· stat.ML

Recognition: unknown

Achieving ε⁻² Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

Ishaq Hamza , Zaiwei Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords actor-criticreinforcement learningsample complexityoff-policysingle-loopLyapunov analysisconvergence rates

0 comments

The pith

Single-loop off-policy actor-critic reaches an ε-optimal policy with Õ(ε^{-2}) samples under only irreducibility of one policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proves that actor-critic methods can run in a single loop at a single timescale and still deliver the optimal Õ(ε^{-2}) sample complexity for finding an ε-optimal policy. The proof covers a broad family of updates that includes approximate policy iteration and natural policy gradient, and it requires only that some policy makes the Markov chain irreducible. Earlier guarantees needed either nested loops or strong extra conditions such as uniform mixing and uniform exploration. The analysis introduces a coupled Lyapunov drift argument that produces geometric progress on the actor and Õ(1/T) progress on the critic, then links the two via a cross-domination relation to control the interaction and the unboundedness of the iterates.

Core claim

Under the single assumption that there exists a policy inducing an irreducible Markov chain, single-loop single-timescale off-policy actor-critic with a broad class of policy updates achieves last-iterate convergence with Õ(ε^{-2}) sample complexity for an ε-optimal policy. The argument establishes geometric convergence for the actor and Õ(1/T) convergence for the critic through coupled Lyapunov drift inequalities that are combined by a cross-domination property.

What carries the argument

Coupled Lyapunov drift framework with cross-domination property that links geometric actor progress to Õ(1/T) critic progress while handling interdependent updates and unbounded iterates.

If this is right

The Õ(ε^{-2}) guarantee applies directly to approximate policy iteration and natural policy gradient inside the single-loop setting.
Last-iterate convergence is obtained without any averaging of the iterates.
The rate holds without requiring uniform mixing times or uniform exploration constants.
Off-policy updates are covered even when parameter iterates remain unbounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The coupled-drift technique may extend to other single-timescale stochastic approximation schemes whose updates are interdependent.
Single-loop implementations could become the default in practice once the rate is confirmed on standard benchmarks.
Policy designers may begin checking irreducibility properties explicitly when seeking fast convergence guarantees.

Load-bearing premise

The existence of at least one policy that induces an irreducible Markov chain, plus the coupled Lyapunov drifts and cross-domination property holding for the single-loop off-policy updates.

What would settle it

An MDP in which every policy induces a reducible chain yet the single-loop algorithm still attains Õ(ε^{-2}) samples to ε-optimality, or an MDP satisfying the irreducibility assumption in which the observed sample complexity is substantially worse than Õ(ε^{-2}).

read the original abstract

In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity guarantee for finding an $\epsilon$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims the first Õ(ε^{-2}) rate for single-loop off-policy actor-critic under only one irreducible policy, but the cross-domination step looks vulnerable to behavior-chain mixing issues.

read the letter

The headline result is a Õ(ε^{-2}) sample-complexity bound for single-loop, single-timescale off-policy actor-critic, including approximate policy iteration and natural policy gradient, under the single assumption that some policy makes the chain irreducible. Prior work needed nested loops or uniform mixing over all policies, so this is genuinely new on the surface and the coupled Lyapunov framework is the main technical contribution. They separate a geometric actor drift from an Õ(1/T) critic drift and glue them with a cross-domination property to control the single-loop coupling and unbounded iterates. That style of analysis could be reusable for other coupled stochastic approximation problems. The soft spot is exactly the one the stress-test flags. Off-policy sampling is driven by a behavior distribution whose ergodicity and coverage are not automatically inherited from the existence of one good policy. If that behavior chain mixes slowly or leaves some states poorly visited, the domination constant can blow up or pick up T-dependent factors, which would invalidate the combined rate. The abstract does not show how they close this gap, so the full proof needs checking on whether extra conditions on the behavior policy slipped in. This is for people who work on tight sample-complexity bounds in RL theory. A reader who already knows the standard actor-critic analyses will see the new drift technique clearly and can judge whether the assumption is truly minimal. It deserves a serious referee because the claim is strong, the framework is clean, and the potential flaw is concrete enough that experts can settle it quickly.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to prove the first Õ(ε^{-2}) sample complexity guarantee for finding an ε-optimal policy using single-loop, single-timescale off-policy actor-critic methods under minimal assumptions, specifically the existence of a policy that induces an irreducible Markov chain. The analysis relies on a coupled Lyapunov drift framework that establishes geometric convergence for the actor and Õ(1/T) for the critic, combined via a cross-domination property for a broad class of policy updates including approximate policy iteration and natural policy gradient.

Significance. If the central claim holds, this result would represent a notable advance in reinforcement learning theory by achieving near-optimal sample complexity without requiring nested-loop updates or strong assumptions such as uniform ergodicity or exploration. The introduction of the coupled Lyapunov drift framework with cross-domination could be of independent interest for analyzing other coupled iterative algorithms with unbounded iterates.

major comments (1)

The cross-domination property (described in the abstract as the mechanism combining the geometric actor drift and Õ(1/T) critic drift) is load-bearing for the Õ(ε^{-2}) rate. It is not evident that this property holds under only the existence of one irreducible policy, because the off-policy behavior distribution used for sampling may be reducible or have arbitrarily slow mixing on some states, rendering the domination constant infinite or T-dependent and invalidating the combined Lyapunov inequality for unbounded iterates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment of the work's significance, and the constructive comment on the cross-domination property. We address the concern directly below.

read point-by-point responses

Referee: The cross-domination property (described in the abstract as the mechanism combining the geometric actor drift and Õ(1/T) critic drift) is load-bearing for the Õ(ε^{-2}) rate. It is not evident that this property holds under only the existence of one irreducible policy, because the off-policy behavior distribution used for sampling may be reducible or have arbitrarily slow mixing on some states, rendering the domination constant infinite or T-dependent and invalidating the combined Lyapunov inequality for unbounded iterates.

Authors: We appreciate the referee highlighting the need for explicit verification of this load-bearing step. The manuscript establishes the cross-domination property in Lemma 4.5 (Section 4.3) under precisely the stated minimal assumption: the existence of at least one policy that induces an irreducible chain. The proof proceeds by showing that the actor's geometric drift keeps the policy sequence inside a compact set on which the behavior distribution (which has full support on the irreducible chain's transitions) yields a uniform positive lower bound on the visitation probabilities, independent of T. This bound is derived from the irreducibility of the reference policy and does not rely on uniform ergodicity of every intermediate policy or on the behavior distribution itself being irreducible. Consequently, the domination constant remains finite and T-independent, allowing the combined Lyapunov inequality to close for the unbounded iterates. We agree that the current exposition could make the T-independence argument more transparent and will add a short remark plus an expanded proof sketch in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives the Õ(ε^{-2}) sample complexity via a coupled Lyapunov drift framework that establishes geometric actor convergence and Õ(1/T) critic convergence, then combines them through the cross-domination property under the explicit minimal assumption of one irreducible policy. No quoted step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the rates follow from the stated drift inequalities without renaming or smuggling prior ansatzes. The framework is presented as new and of independent interest, with comparisons to external benchmarks rather than load-bearing self-references. This satisfies the self-contained criterion against the paper's own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of an irreducible policy and the validity of the coupled Lyapunov framework for single-loop updates.

axioms (1)

domain assumption Existence of a policy that induces an irreducible Markov chain
Explicitly stated as the sole minimal assumption enabling the result.

pith-pipeline@v0.9.0 · 5535 in / 1091 out tokens · 33891 ms · 2026-05-14T19:24:50.031010+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 2 internal anchors

[1]

M., Lee, J

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gra- dient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76

work page 2021
[2]

Alacaoglu, A., Viano, L., He, N., and Cevher, V. (2022). A natural actor-critic framework for zero-sum Markov games. InInternational Conference on Machine Learning, pages 307–366. PMLR

work page 2022
[3]

Banach, S. (1922). Sur les op´ erations dans les ensembles abstraits et leur application aux ´ equations int´ egrales.Fund. Math, 3(1):133–181

work page 1922
[4]

Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335

work page 2011
[5]

Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference On Learning Theory, pages 1691– 1692

work page 2018
[6]

Borkar, V. S. (2009).Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer

work page 2009
[7]

Borkar, V. S. and Konda, V. R. (1997). The actor-critic algorithm as multi-time-scale stochastic approximation.Sadhana, 22(4):525–543

work page 1997
[8]

Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approxi- mation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469

work page 2000
[9]

S., and Dodhia, P

Chandak, S., Borkar, V. S., and Dodhia, P. (2022). Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430

work page 2022
[10]

and Zhao, L

Chen, X. and Zhao, L. (2023). Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049

work page 2023
[11]

Chen, Z., Khodadadian, S., and Maguluri, S. T. (2022). Finite-sample analysis of off-policy natural actor–critic with linear function approximation.IEEE Control Systems Letters, 6:2611– 2616

work page 2022
[12]

and Maguluri, S

Chen, Z. and Maguluri, S. T. (2025). An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395. 15

work page 2025
[13]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2021). Finite-sample analysis of off-policy TD-learning via generalized Bellman operators.Advances in Neural Information Processing Systems, 34:21440–21452

work page 2021
[14]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov the- ory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367

work page 2024
[15]

Chen, Z., Zhang, K., Mazumdar, E., Ozdaglar, A., and Wierman, A. (2023). A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 75826–75883. Curran Associates, Inc

work page 2023
[16]

Degris, T., White, M., and Sutton, R. (2012). Off-policy actor-critic. InInternational Con- ference on Machine Learning

work page 2012
[17]

R., Legg, S., and Kavukcuoglu, K

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. R., Legg, S., and Kavukcuoglu, K. (2018). IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InInternational Conference on Machine Learning, pages 1407–1416

work page 2018
[18]

Fu, Z., Yang, Z., and Wang, Z. (2021). Single-timescale actor-critic provably finds globally optimal policy. InInternational Conference on Learning Representations

work page 2021
[19]

U., and Aggarwal, V

Ganesh, S., Chen, J., Mondal, W. U., and Aggarwal, V. (2025). Order-optimal global conver- gence for actor-critic with general policy and neural critic parametrization. In Chiappa, S. and Magliacane, S., editors,Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, volume 286 ofProceedings of Machine Learning Research, pages...

work page 2025
[20]

Gheshlaghi Azar, M., Munos, R., and Kappen, H. J. (2013). Minimax PAC bounds on the sam- ple complexity of reinforcement learning with a generative model.Machine learning, 91(3):325– 349

work page 2013
[21]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor. In Dy, J. and Krause, A., editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870. PMLR

work page 2018
[22]

Haque, S. U. and Maguluri, S. T. (2025). Stochastic approximation with unbounded Markovian noise: A general-purpose theorem. InInternational Conference on Artificial Intelligence and Statistics, pages 3718–3726. PMLR

work page 2025
[23]

R., and Praly, L

Jiang, Z.-P., Teel, A. R., and Praly, L. (1994). Small-gain theorem for ISS systems and applications.Mathematics of Control, Signals and Systems, 7(2):95–120

work page 1994
[24]

and Lan, G

Ju, C. and Lan, G. (2025). Auto-exploration for online reinforcement learning.Preprint Arxiv:2512.06244. 16

work page arXiv 2025
[25]

Kakade, S. M. (2001). A natural policy gradient.Advances in Neural Information Processing Systems, 14

work page 2001
[26]

Khalil, H. K. and Grizzle, J. W. (2002).Nonlinear Systems, volume 3. Prentice hall Upper Saddle River, NJ

work page 2002
[27]

Khodadadian, S., Chen, Z., and Maguluri, S. T. (2021). Finite-sample analysis of off-policy natural actor-critic algorithm. InInternational Conference on Machine Learning, pages 5420–

work page 2021
[28]

T., Romberg, J., and Maguluri, S

Khodadadian, S., Doan, T. T., Romberg, J., and Maguluri, S. T. (2022). Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control

work page 2022
[29]

and Tsitsiklis, J

Konda, V. and Tsitsiklis, J. (1999). Actor-critic algorithms.Advances in Neural Information Processing Systems, 12

work page 1999
[30]

Kumar, H., Koppel, A., and Ribeiro, A. (2023). On the sample complexity of actor-critic method for reinforcement learning with function approximation.Machine Learning, pages 1–35

work page 2023
[31]

Y., and Mannor, S

Kumar, N., Agrawal, P., Ramponi, G., Levy, K. Y., and Mannor, S. (2026a). On the conver- gence of single-timescale actor-critic. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

work page
[32]

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Kumar, N., Dahan, T., Cohen, L., Barua, A., Ramponi, G., Levy, K. Y., and Mannor, S. (2026b). Optimal sample complexity for single time-scale actor-critic with momentum.arXiv preprint: 2602.01505

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Lan, G. (2022). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical Programming, pages 1–48

work page 2022
[34]

Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc

work page 2017
[35]

Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous Q- learning: sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc

work page 2020
[36]

and Lan, G

Li, Y. and Lan, G. (2025). Policy mirror descent inherently explores action space.SIAM Journal on Optimization, 35(1):116–156

work page 2025
[37]

Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning.CoRR

work page 2015
[38]

P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pages 1928–1937

work page 2016
[39]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, 17 I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning.Nature, 518(7540):529–533

work page 2015
[40]

Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. InProceedings of the 30th International Conference on Neural Information Processing Systems, pages 1054–1062

work page 2016
[41]

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

Nanda, P. and Chen, Z. (2025). A minimal-assumption analysis ofQ-learning with time-varying policies.Preprint Arxiv:2510.16132

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

and Gharesifard, B

Olshevsky, A. and Gharesifard, B. (2023). A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization, 61(2):980–1007

work page 2023
[43]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback. InProceedings of the 36th Intern...

work page 2022
[44]

and Schaal, S

Peters, J. and Schaal, S. (2008). Natural actor-critic.Neurocomputing, 71(7-9):1180–1190

work page 2008
[45]

S., and Dasgupta, S

Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal difference learning with function approximation. InProceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 417–424, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc

work page 2001
[46]

Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Program- ming. John Wiley & Sons

work page 2014
[47]

Qiu, S., Yang, Z., Ye, J., and Wang, Z. (2021). On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664

work page 2021
[48]

and Wierman, A

Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR

work page 2020
[49]

and Monro, S

Robbins, H. and Monro, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407

work page 1951
[50]

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889–1897. PMLR

work page 2015
[51]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms

work page 2017
[52]

Schweitzer, P. J. (1971). Iterative solution of the functional equations of undiscounted Markov renewal programming.Journal of Mathematical Analysis and Applications, 34(3):495–501

work page 1971
[53]

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., 18 Graepel, T., and Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676):354

work page 2017
[54]

and Ying, L

Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830

work page 2019
[55]

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44

work page 1988
[56]

Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press

work page 2018
[57]

S., Mahmood, A

Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning.The Journal of Machine Learning Research, 17(1):2603– 2631

work page 2016
[58]

S., McAllester, D., Singh, S., and Mansour, Y

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. InProceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057–1063

work page 1999
[59]

S., Szepesv´ ari, C., and Maei, H

Sutton, R. S., Szepesv´ ari, C., and Maei, H. R. (2008). A convergentO(n) algorithm for off-policy temporal-difference learning with linear function approximation.Advances in Neural Information Processing Systems, 21(21):1609–1616

work page 2008
[60]

Tian, H., Olshevsky, A., and Paschalidis, Y. (2023). Convergence of actor-critic with multi- layer neural networks. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 9279–

work page 2023
[61]

Curran Associates, Inc

work page
[62]

Wainwright, M. J. (2019). Variance-reducedQ-learning is minimax optimal.arXiv e-prints, page arXiv:1906.04697

work page arXiv 2019
[63]

Wang, Y., Wang, Y., Zhou, Y., and Zou, S. (2024). Non-asymptotic analysis for single-loop (Natural) actor-critic with compatible function approximation. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F., editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedin...

work page 2024
[64]

Watkins, C. J. and Dayan, P. (1992). Q-learning.Machine learning, 8(3-4):279–292

work page 1992
[65]

F., Zhang, W., Xu, P., and Gu, Q

Wu, Y. F., Zhang, W., Xu, P., and Gu, Q. (2020). A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628

work page 2020
[66]

Xiao, L. (2022). On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36

work page 2022
[67]

and Liang, Y

Xu, T. and Liang, Y. (2021). Sample complexity bounds for two timescale value-based rein- forcement learning algorithms. InInternational Conference on Artificial Intelligence and Statis- tics, pages 811–819. PMLR. 19

work page 2021
[68]

Xu, T., Wang, Z., and Liang, Y. (2020a). Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33

work page
[69]

Xu, T., Wang, Z., and Liang, Y. (2020b). Non-asymptotic convergence analysis of two time- scale (natural) actor-critic algorithms.arXiv eprint: 2005.03557. 20 A Proof of Theorem 2.1 with the IS-Based Critic A.1 Analysis of the Actor A.1.1 Proof of Proposition 2.1 The following lemma is needed, whose proof is presented in Appendix A.4.1. Lemma A.1.For anyπ...

work page arXiv 2005