arxiv: 2602.01505 · v2 · submitted 2026-02-02 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

Navdeep Kumar , Tehila Dahan , Lior Cohen , Ananyabrata Barua , Giorgia Ramponi , Kfir Yehuda Levy , Shie Mannor

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:10 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords actor-criticsample complexityreinforcement learningvariance reductionSTORMMarkov decision processes

0 comments

The pith

Single-timescale actor-critic reaches optimal O(ε^{-2}) sample complexity for ε-optimal policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that a single-timescale actor-critic algorithm, enhanced with stochastic recursive momentum and a buffer of recent samples, can learn an ε-optimal policy in finite-state infinite-horizon discounted MDPs using only O(ε^{-2}) samples. This rate is optimal and improves on earlier single-timescale methods that required O(ε^{-3}) samples. The key is managing variance from the policy's changing occupancy measure by sampling uniformly from a small buffer of recent experiences. This approach keeps the method practical for deep learning setups with only small modifications.

Core claim

By applying STORM to critic updates and uniformly sampling from a buffer of recent samples, the single-timescale actor-critic algorithm achieves an ε-optimal global policy with sample complexity O(ε^{-2}) in finite state-action space discounted MDPs.

What carries the argument

STORM (STOchastic Recursive Momentum) combined with uniform sampling from a small buffer of recent samples to control variance in non-stationary settings.

Load-bearing premise

That uniformly sampling from a small buffer of recent samples sufficiently controls the variance from the non-stationary occupancy measure generated by the evolving policy.

What would settle it

Running the algorithm on a simple MDP and measuring if the number of samples needed to reach ε-optimality scales as 1/ε^2 rather than 1/ε^3 or worse.

Figures

Figures reproduced from arXiv: 2602.01505 by Ananyabrata Barua, Giorgia Ramponi, Kfir Yehuda Levy, Lior Cohen, Navdeep Kumar, Shie Mannor, Tehila Dahan.

**Figure 2.** Figure 2: Performance our Algorithm 1 on a random MDP compared to standard actor-critic without momentum (Algorithm 1 of (N. Kumar, P. Agrawal, Ramponi, et al. 2025)). MDP parameters: S = 10, A = 5, γ = 0.9. We ran this experiment for several MDP parameters and found that our Algorithm 1 consistently outperforms the standard actor critic algorithm without momentum, as illustrated by [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

read the original abstract

We establish an optimal sample complexity of $O(\epsilon^{-2})$ for obtaining an $\epsilon$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(\epsilon^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to establish an optimal sample complexity of O(ε^{-2}) for finding an ε-optimal global policy in infinite-horizon discounted MDPs with finite state-action spaces, using a single-timescale actor-critic algorithm that applies STORM for variance reduction in the critic and maintains a buffer of a small fixed fraction of recent samples from which critic updates are uniformly drawn. This improves on the prior O(ε^{-3}) state of the art while remaining compatible with deep architectures.

Significance. If the central analysis holds, the result would be significant: it would close the gap to the known lower bound for policy optimization and demonstrate that single-timescale methods can achieve optimal rates once non-stationarity is controlled via the proposed buffer mechanism. The explicit construction of a practical, architecture-compatible algorithm with reproducible variance-reduction steps is a strength.

major comments (2)

[§4.3] §4.3 (Buffer sampling analysis): the proof that uniform sampling from a fixed-fraction buffer (β independent of ε) suffices to bound the occupancy-measure shift induced by policy updates must be verified; if the bias term in the STORM variance reduction (cf. the decomposition after Eq. (12)) retains a factor that scales with the number of actor steps, the total sample complexity reverts to O(ε^{-3}).
[Theorem 1] Theorem 1 (main sample-complexity statement): the claimed O(ε^{-2}) rate is load-bearing on the assumption that the critic error is driven below ε with only O(ε^{-2}) total samples; the non-stationary sampling argument must explicitly cancel all cross terms without introducing an extra 1/ε factor from distribution drift.

minor comments (2)

[§3.2] Notation for the buffer fraction β is introduced without an explicit dependence statement; add a sentence clarifying that β is chosen independently of ε.
[Algorithm 1] Figure 1 (algorithm pseudocode) omits the precise rule for buffer eviction; this should be stated explicitly to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for acknowledging the potential significance of achieving optimal O(ε^{-2}) sample complexity with a practical single-timescale actor-critic method. We address the two major comments below with clarifications on the buffer analysis and non-stationary bounds. We believe the claims hold and will expand the relevant proof sections for greater transparency in the revision.

read point-by-point responses

Referee: [§4.3] §4.3 (Buffer sampling analysis): the proof that uniform sampling from a fixed-fraction buffer (β independent of ε) suffices to bound the occupancy-measure shift induced by policy updates must be verified; if the bias term in the STORM variance reduction (cf. the decomposition after Eq. (12)) retains a factor that scales with the number of actor steps, the total sample complexity reverts to O(ε^{-3}).

Authors: In §4.3 we show that a fixed β (independent of ε) keeps the total-variation distance between the buffer empirical measure and the current occupancy measure at most O(β + η·window), where the actor step-size η = O(ε) and the window length is O(1/β). Because policy drift per actor step is controlled by η, the occupancy shift remains O(β) uniformly over the O(ε^{-2}) actor steps. In the STORM decomposition after Eq. (12), the bias term is multiplied by the momentum factor (1 - Θ(ε)); the resulting telescoping sum therefore contributes only O(ε) to the critic error after O(ε^{-2}) steps and does not re-introduce an extra 1/ε factor. We will insert an auxiliary lemma that explicitly bounds the drift term to make this verification immediate. revision: partial
Referee: [Theorem 1] Theorem 1 (main sample-complexity statement): the claimed O(ε^{-2}) rate is load-bearing on the assumption that the critic error is driven below ε with only O(ε^{-2}) total samples; the non-stationary sampling argument must explicitly cancel all cross terms without introducing an extra 1/ε factor from distribution drift.

Authors: The proof of Theorem 1 proceeds by decomposing the critic error into a variance-reduced STORM term plus a distribution-drift term. The buffer limits drift to O(β) per update; because β is constant, this contributes only a constant factor to the variance bound. Cross terms that arise when the sampling distribution changes with the policy are canceled by (i) the contraction property of the discounted Bellman operator and (ii) the slow policy update (η = O(ε)), which makes the value-function difference between consecutive policies O(ε). Summing the resulting O(ε) per-step error over O(ε^{-2}) steps yields the desired O(ε) critic accuracy without an extra 1/ε factor. The full expansion appears in Appendix B; we will add a short paragraph in the main text summarizing the cancellation argument. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the sample complexity derivation

full rationale

The paper derives the O(ε^{-2}) bound via analysis of STORM variance reduction plus fixed-fraction buffer sampling to handle non-stationary occupancy measures in the single-timescale AC updates. No quoted equations or self-citations reduce the final rate to a fitted parameter, self-defined quantity, or prior result by construction. The bound follows from standard MDP finite-space assumptions and the algorithm's update rules without the central claim collapsing to an input. This is the expected non-circular outcome for a theorem-based complexity analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on standard finite-state discounted MDP assumptions and the correctness of the STORM variance-reduction analysis under the buffered sampling scheme.

axioms (1)

domain assumption Finite state and action spaces in infinite-horizon discounted MDPs
Required for the sample-complexity analysis to be well-defined and for the occupancy measures to be finite-dimensional.

pith-pipeline@v0.9.0 · 5461 in / 1246 out tokens · 31823 ms · 2026-05-16T08:10:00.352361+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions
cs.LG 2026-05 unverdicted novelty 8.0

Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Alekh Agarwal et al.On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift. 2020. arXiv:1908.00261 [cs.LG]

work page arXiv 2020
[2]

Optimistic Q-learning for average reward and episodic reinforcement learning

Priyank Agrawal and Shipra Agrawal. “Optimistic Q-learning for average reward and episodic reinforcement learning”. In:arXiv preprint arXiv:2407.13743(2024)

work page arXiv 2024
[3]

Q-learning with Posterior Sampling

Priyank Agrawal, Shipra Agrawal, and Azmat Azati. “Q-learning with Posterior Sampling”. In:arXiv preprint arXiv:2506.00917(2025)

work page arXiv 2025
[4]

Natural Gradient Works Efficiently in Learning

Shun-ichi Amari. “Natural Gradient Works Efficiently in Learning”. In:Neural Computation10 (1998), pp. 251– 276.URL:https://api.semanticscholar.org/CorpusID:207585383

work page 1998
[5]

Lower bounds for non-convex stochastic optimization

Yossi Arjevani et al. “Lower bounds for non-convex stochastic optimization”. In:Mathematical Programming 199.1 (2023), pp. 165–214

work page 2023
[6]

Near-optimal Regret Bounds for Reinforcement Learning

Peter Auer, Thomas Jaksch, and Ronald Ortner. “Near-optimal Regret Bounds for Reinforcement Learning”. In: Advances in Neural Information Processing Systems. Ed. by D. Koller et al. V ol. 21. Curran Associates, Inc., 2008. URL: https://proceedings.neurips.cc/paper/2008/file/e4a6222cdb5b34375400904f03d8e6a5- Paper.pdf

work page 2008
[7]

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learn- ing

Peter Auer and Ronald Ortner. “Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learn- ing”. In:Advances in Neural Information Processing Systems. Ed. by B. Sch ¨olkopf, J. Platt, and T. Hoff- man. V ol. 19. MIT Press, 2006.URL: https : / / proceedings . neurips . cc / paper / 2006 / file / c1b70d965ca504aa751ddb62ad69c63f-Paper.pdf. 11 Opt...

work page 2006
[8]

Minimax Regret Bounds for Reinforcement Learning

Mohammad Gheshlaghi Azar, Ian Osband, and R ´emi Munos. “Minimax Regret Bounds for Reinforcement Learning”. In:Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. V ol. 70. Proceedings of Machine Learning Research. PMLR, June 2017, pp. 263–272.URL: https://proceedings.mlr.press/v70/azar17a.html

work page 2017
[9]

Covariant policy search

J. Andrew Bagnell and Jeff Schneider. “Covariant policy search”. In:Proceedings of the 18th International Joint Conference on Artificial Intelligence. IJCAI’03. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2003, pp. 1019–1024

work page 2003
[10]

Infinite-horizon policy-gradient estimation

Jonathan Baxter and Peter L Bartlett. “Infinite-horizon policy-gradient estimation”. In:journal of artificial intelligence research15 (2001), pp. 319–350

work page 2001
[11]

Bertsekas.Dynamic Programming and Optimal Control, Vol

Dimitri P. Bertsekas.Dynamic Programming and Optimal Control, Vol. II. 3rd. Athena Scientific, 2007.ISBN: 1886529302

work page 2007
[12]

Global optimality guarantees for policy gradient methods

Jalaj Bhandari and Daniel Russo. “Global optimality guarantees for policy gradient methods”. In:Operations Research(2024)

work page 2024
[13]

Actor–Critic or Critic–Actor? A Tale of Two Time Scales

Shalabh Bhatnagar, Vivek S. Borkar, and Soumyajit Guin. “Actor–Critic or Critic–Actor? A Tale of Two Time Scales”. In:IEEE Control Systems Letters7 (2023), pp. 2671–2676.DOI:10.1109/LCSYS.2023.3288931

work page doi:10.1109/lcsys.2023.3288931 2023
[14]

Natural actor–critic algorithms

Shalabh Bhatnagar, Richard S. Sutton, et al. “Natural actor–critic algorithms”. In:Automatica45.11 (2009), pp. 2471–2482.ISSN: 0005-1098.DOI: https://doi.org/10.1016/j.automatica.2009.07.008 .URL: https://www.sciencedirect.com/science/article/pii/S0005109809003549

work page doi:10.1016/j.automatica.2009.07.008 2009
[15]

Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint

Vivek S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency, 2022. DOI:10.1007/978-81-951961-1-1.URL:https://doi.org/10.1007%5C%2F978-81-951961-1-1

work page doi:10.1007/978-81-951961-1-1.url:https://doi.org/10.1007 2022
[16]

A Convergent Online Single Time Scale Actor Critic Algorithm

D. Di Castro and R. Meir.A Convergent Online Single Time Scale Actor Critic Algorithm. 2009. arXiv: 0909.2934 [cs.LG].URL:https://arxiv.org/abs/0909.2934

work page internal anchor Pith review Pith/arXiv arXiv 2009
[17]

Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems

Tianyi Chen, Yuejiao Sun, and Wotao Yin. “Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems”. In:Advances in Neural Information Processing Systems. Ed. by M. Ranzato et al. V ol. 34. Curran Associates, Inc., 2021, pp. 25294–25307.URL: https://proceedings.neurips.cc/ paper_files/paper/2021/file/d4dd111a4fd973...

work page 2021
[18]

Xuyang Chen, Jingliang Duan, et al.Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator. 2023. arXiv:2208.08744 [cs.LG].URL:https://arxiv.org/abs/2208.08744

work page arXiv 2023
[19]

Finite-time analysis of single-timescale actor-critic

Xuyang Chen and Lin Zhao. “Finite-time analysis of single-timescale actor-critic”. In:Advances in Neural Information Processing Systems36 (2024)

work page 2024
[20]

Lecture Notes for EC525: Optimization for Machine Learning

Ashok Cutkosky. “Lecture Notes for EC525: Optimization for Machine Learning”. In:EC525: Optimization for Machine Learning(2022)

work page 2022
[21]

Momentum-Based Variance Reduction in Non-Convex SGD

Ashok Cutkosky and Francesco Orabona. “Momentum-Based Variance Reduction in Non-Convex SGD”. In:Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. V ol. 32. Curran Asso- ciates, Inc., 2019.URL: https : / / proceedings . neurips . cc / paper _ files / paper / 2019 / file / b8002139cdde66b87638f7f91d169d96-Paper.pdf

work page 2019
[22]

Gal Dalal, Balazs Szorenyi, and Gugan Thoppe.A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound. 2019. arXiv: 1911.09157 [cs.LG] .URL: https://arxiv.org/abs/1911. 09157

work page arXiv 2019
[23]

Swetha Ganesh, Washim Uddin Mondal, and Vaneet Aggarwal.Order-Optimal Global Convergence for Average Reward Reinforcement Learning via Actor-Critic Approach. 2024. arXiv:2407.18878 [cs.LG].URL: https: //arxiv.org/abs/2407.18878

work page arXiv 2024
[24]

Closing the Gap: Achieving Global Convergence (Last Iterate) of Actor-Critic under Marko- vian Sampling with Neural Network Parametrization

Mudit Gaur et al. “Closing the Gap: Achieving Global Convergence (Last Iterate) of Actor-Critic under Marko- vian Sampling with Neural Network Parametrization”. In:Proceedings of the 41st International Conference on Machine Learning. Ed. by Ruslan Salakhutdinov et al. V ol. 235. Proceedings of Machine Learning Research. PMLR, 21–27 Jul 2024, pp. 15153–151...

work page 2024
[25]

Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming

Saeed Ghadimi and Guanghui Lan. “Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming”. In:SIAM journal on optimization23.4 (2013), pp. 2341–2368

work page 2013
[26]

Mingyi Hong et al.A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic. 2022. arXiv:2007.05170 [math.OC].URL:https://arxiv.org/abs/2007.05170

work page arXiv 2022
[27]

Is Q-learning provably efficient?

Chi Jin et al. “Is Q-learning provably efficient?” In:Advances in neural information processing systems31 (2018)

work page 2018
[28]

A Natural Policy Gradient

Sham Kakade. “A Natural Policy Gradient”. In:Advances in Neural Information Processing Systems. V ol. 14. 2001, pp. 1531–1538

work page 2001
[29]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980(2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

Actor-Critic Algorithms

Vijay R. Konda and John N. Tsitsiklis. “Actor-Critic Algorithms”. In:Neural Information Processing Systems. 1999.URL:https://api.semanticscholar.org/CorpusID:207779694

work page 1999
[31]

Uri Koren et al.Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead. 2025. arXiv: 2506.07054 [cs.LG].URL:https://arxiv.org/abs/2506.07054

work page arXiv 2025
[32]

Harshat Kumar, Alec Koppel, and Alejandro Ribeiro.On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation. 2023. arXiv: 1910 . 08412 [cs.LG].URL: https : //arxiv.org/abs/1910.08412

work page arXiv 2023
[33]

Policy Gradient with Tree Search (PGTS) in Reinforcement Learning Evades Local Maxima

Navdeep Kumar, Priyank Agrawal, Kfir Yehuda Levy, et al. “Policy Gradient with Tree Search (PGTS) in Reinforcement Learning Evades Local Maxima”. In:The Second Tiny Papers Track at ICLR 2024. 2024.URL: https://openreview.net/forum?id=61g5xxWOI6

work page 2024
[34]

On the Convergence of Single-Timescale Actor- Critic

Navdeep Kumar, Priyank Agrawal, Giorgia Ramponi, et al. “On the Convergence of Single-Timescale Actor- Critic”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025.URL: https: //openreview.net/forum?id=OixkI1jSZD

work page 2025
[35]

Crafting Papers on Machine Learning

P. Langley. “Crafting Papers on Machine Learning”. In:Proceedings of the 17th International Conference on Machine Learning (ICML 2000). Ed. by Pat Langley. Stanford, CA: Morgan Kaufmann, 2000, pp. 1207–1216

work page 2000
[36]

Jiacai Liu, Wenye Li, and Ke Wei.Elementary Analysis of Policy Gradient Methods. 2024. arXiv: 2404.03372 [math.OC].URL:https://arxiv.org/abs/2404.03372

work page arXiv 2024
[37]

Jiacai Liu, Wenye Li, and Ke Wei.Projected Policy Gradient Converges in a Finite Number of Iterations. 2024. arXiv:2311.01104 [math.OC]

work page arXiv 2024
[38]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In:arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Jincheng Mei et al.On the Global Convergence Rates of Softmax Policy Gradient Methods. 2022. arXiv: 2005.06392 [cs.LG]

work page arXiv 2022
[40]

Human-level control through deep reinforcement learning

V olodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In:Nature518.7540 (Feb. 2015), pp. 529–533.ISSN: 00280836.URL:http://dx.doi.org/10.1038/nature14236

work page doi:10.1038/nature14236 2015
[41]

Alex Olshevsky and Bahman Gharesifard.A Small Gain Analysis of Single Timescale Actor Critic. 2023. arXiv: 2203.02591 [math.OC]

work page arXiv 2023
[42]

OpenAI et al.Solving Rubik’s Cube with a Robot Hand. 2019. arXiv: 1910.07113 [cs.LG] .URL: https: //arxiv.org/abs/1910.07113

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation

Prashansa Panda and Shalabh Bhatnagar. “Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation”. In:Proceedings of the AAAI Conference on Artificial Intelligence39.19 (Apr. 2025), pp. 19813– 19820.DOI: 10.1609/aaai.v39i19.34182.URL: https://ojs.aaai.org/index.php/AAAI/article/ view/34182

work page doi:10.1609/aaai.v39i19.34182.url: 2025
[44]

Natural Actor-Critic

Jan Peters and Stefan Schaal. “Natural Actor-Critic”. In:Neurocomputing71 (Mar. 2008), pp. 1180–1190.DOI: 10.1016/j.neucom.2007.11.026. 13 Optimal Sample Complexity for Single Time-Scale Actor-Critic with MomentumA PREPRINT

work page doi:10.1016/j.neucom.2007.11.026 2008
[45]

Some methods of speeding up the convergence of iteration methods

Boris T Polyak. “Some methods of speeding up the convergence of iteration methods”. In:Ussr computational mathematics and mathematical physics4.5 (1964), pp. 1–17

work page 1964
[46]

Markov Decision Processes: Discrete Stochastic Dynamic Programming

Martin L. Puterman. “Markov Decision Processes: Discrete Stochastic Dynamic Programming”. In:Wiley Series in Probability and Statistics. 1994

work page 1994
[47]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser et al. “Mastering Atari, Go, chess and shogi by planning with a learned model”. In: Nature588.7839 (Dec. 2020), pp. 604–609.ISSN: 1476-4687.DOI: 10.1038/s41586-020-03051-4 .URL: http://dx.doi.org/10.1038/s41586-020-03051-4

work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020
[48]

John Schulman, Xi Chen, and Pieter Abbeel.Equivalence Between Policy Gradients and Soft Q-Learning. 2017. DOI:10.48550/ARXIV.1704.06440.URL:https://arxiv.org/abs/1704.06440

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1704.06440.url:https://arxiv.org/abs/1704.06440 2017
[49]

Trust region policy optimization

John Schulman, Sergey Levine, et al. “Trust region policy optimization”. In:International conference on machine learning. PMLR. 2015, pp. 1889–1897

work page 2015
[50]

Benchmarking optimizers for large language model pretraining

Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. “Benchmarking optimizers for large language model pretraining”. In:arXiv preprint arXiv:2509.01440(2025)

work page arXiv 2025
[51]

Mastering the game of Go without human knowledge

David Silver et al. “Mastering the game of Go without human knowledge.” In:Nat.550.7676 (2017), pp. 354–359. URL:http://dblp.uni-trier.de/db/journals/nature/nature550.html#SilverSSAHGHBLB17

work page 2017
[52]

Policy gradient methods for reinforcement learning with function approximation

Richard S Sutton et al. “Policy gradient methods for reinforcement learning with function approximation.” In: Advances in Neural Information Processing Systems. V ol. 99. Citeseer. 1999, pp. 1057–1063

work page 1999
[53]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. Second. The MIT Press, 2018.URL:http://incompleteideas.net/book/the-book-2nd.html

work page 2018
[54]

Lingxiao Wang et al.Neural Policy Gradient Methods: Global Optimality and Rates of Convergence. 2019. arXiv:1909.01150 [cs.LG].URL:https://arxiv.org/abs/1909.01150

work page arXiv 2019
[55]

Qiuhao Wang, Chin Pang Ho, and Marek Petrik.On the Convergence of Policy Gradient in Robust MDPs. 2022. DOI:10.48550/ARXIV.2212.10439.URL:https://arxiv.org/abs/2212.10439

work page doi:10.48550/arxiv.2212.10439.url:https://arxiv.org/abs/2212.10439 2022
[56]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In:Machine learning8 (1992), pp. 229–256

work page 1992
[57]

Yue Wu et al.A Finite Time Analysis of Two Time-Scale Actor Critic Methods. 2022. arXiv: 2005.01350 [cs.LG].URL:https://arxiv.org/abs/2005.01350

work page arXiv 2022
[58]

2022.DOI: 10.48550/ARXIV.2201.07443

Lin Xiao.On the Convergence Rates of Policy Gradient Methods. 2022.DOI: 10.48550/ARXIV.2201.07443. URL:https://arxiv.org/abs/2201.07443

work page doi:10.48550/arxiv.2201.07443 2022
[59]

An improved convergence analysis of stochastic variance-reduced policy gradient

Pan Xu, Felicia Gao, and Quanquan Gu. “An improved convergence analysis of stochastic variance-reduced policy gradient”. In:Uncertainty in Artificial Intelligence. PMLR. 2020, pp. 541–551

work page 2020
[60]

Tengyu Xu, Zhe Wang, and Yingbin Liang.Non-asymptotic Convergence Analysis of Two Time-scale (Natural) Actor-Critic Algorithms. 2020. arXiv:2005.03557 [cs.LG]

work page arXiv 2020
[61]

Zhuoran Yang et al.On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost. 2019. arXiv:1907.06246 [cs.LG].URL:https://arxiv.org/abs/1907.06246

work page internal anchor Pith review Pith/arXiv arXiv 2019
[62]

Mars: Unleashing the power of variance reduction for training large models, 2024

Huizhuo Yuan et al. “Mars: Unleashing the power of variance reduction for training large models, 2024”. In: URL https://arxiv. org/abs/2411.10438()

work page arXiv 2024
[63]

Gower, and Alessandro Lazaric.A general sample complexity analysis of vanilla policy gradient

Rui Yuan, Robert M. Gower, and Alessandro Lazaric.A general sample complexity analysis of vanilla policy gradient. 2022. arXiv:2107.11433 [cs.LG]

work page arXiv 2022
[64]

Kaiqing Zhang et al.Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies

work page
[65]

arXiv:1906.08383 [math.OC]. 14 Optimal Sample Complexity for Single Time-Scale Actor-Critic with MomentumA PREPRINT Appendix Roadmap The appendix is organized as follows: •Appendix A: Related Work •Appendix B: Deriving Recursions – B.1: Actor Recursion – B.2: Critic Recursion – B.3: STORM Recursion – B.4: On Buffer – B.5: On Sufficient Exploration Assumpt...

work page arXiv 1906
[66]

Recall that we have the following update rule hk+1 =U k+1(ξk+1) + (1−ν k) h hk −U k(ξk+1) i ,(30) where Um(ξk) = h R(s, a) +γ P a′ πm(a′|s′)Qm(s′, a′)−Q m(s, a) i 1 s=s k, a=a k , and Vk = Eξ∼Bk[Um(ξ)] =b k ⊙(T πmQm −Q m)for allm. Lemma B.7(STORM Recursion).The STORM errorw k under Algorithm 1 follows w2 k+1 ≤(1−ν k + 16SAβ2 k)w2 k + 2ν2 kσ2 +c 2 Bη4 kz2 ...

work page 2023
[67]

Assumption B.12.[Sufficient Exploration Assumption (N

have made the following assumption. Assumption B.12.[Sufficient Exploration Assumption (N. Kumar, P. Agrawal, Ramponi, et al. 2025)] There exists a λ >0such that: ⟨Qπ −Q, D π(I−γP π)(Qπ −Q)⟩ ≥λ∥Q π −Q∥ 2 2, whereP π((s′, a′),(s, a)) =P(s ′|s, a)π(a′|s′)andD π((s′, a′),(s, a)) =1 (s′, a′) = (s, a) dπ(s)π(a|s). The above condition can be re-written as ⟨Qπ −...

work page 2025