Recognition: 2 theorem links
· Lean TheoremOptimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
Pith reviewed 2026-05-16 08:10 UTC · model grok-4.3
The pith
Single-timescale actor-critic reaches optimal O(ε^{-2}) sample complexity for ε-optimal policies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying STORM to critic updates and uniformly sampling from a buffer of recent samples, the single-timescale actor-critic algorithm achieves an ε-optimal global policy with sample complexity O(ε^{-2}) in finite state-action space discounted MDPs.
What carries the argument
STORM (STOchastic Recursive Momentum) combined with uniform sampling from a small buffer of recent samples to control variance in non-stationary settings.
Load-bearing premise
That uniformly sampling from a small buffer of recent samples sufficiently controls the variance from the non-stationary occupancy measure generated by the evolving policy.
What would settle it
Running the algorithm on a simple MDP and measuring if the number of samples needed to reach ε-optimality scales as 1/ε^2 rather than 1/ε^3 or worse.
Figures
read the original abstract
We establish an optimal sample complexity of $O(\epsilon^{-2})$ for obtaining an $\epsilon$-optimal global policy using a single-timescale actor-critic (AC) algorithm in infinite-horizon discounted Markov decision processes (MDPs) with finite state-action spaces, improving upon the prior state of the art of $O(\epsilon^{-3})$. Our approach applies STORM (STOchastic Recursive Momentum) to reduce variance in the critic updates. However, because samples are drawn from a nonstationary occupancy measure induced by the evolving policy, variance reduction via STORM alone is insufficient. To address this challenge, we maintain a buffer of small fraction of recent samples and uniformly sample from it for each critic update. Importantly, these mechanisms are compatible with existing deep learning architectures and require only minor modifications, without compromising practical applicability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to establish an optimal sample complexity of O(ε^{-2}) for finding an ε-optimal global policy in infinite-horizon discounted MDPs with finite state-action spaces, using a single-timescale actor-critic algorithm that applies STORM for variance reduction in the critic and maintains a buffer of a small fixed fraction of recent samples from which critic updates are uniformly drawn. This improves on the prior O(ε^{-3}) state of the art while remaining compatible with deep architectures.
Significance. If the central analysis holds, the result would be significant: it would close the gap to the known lower bound for policy optimization and demonstrate that single-timescale methods can achieve optimal rates once non-stationarity is controlled via the proposed buffer mechanism. The explicit construction of a practical, architecture-compatible algorithm with reproducible variance-reduction steps is a strength.
major comments (2)
- [§4.3] §4.3 (Buffer sampling analysis): the proof that uniform sampling from a fixed-fraction buffer (β independent of ε) suffices to bound the occupancy-measure shift induced by policy updates must be verified; if the bias term in the STORM variance reduction (cf. the decomposition after Eq. (12)) retains a factor that scales with the number of actor steps, the total sample complexity reverts to O(ε^{-3}).
- [Theorem 1] Theorem 1 (main sample-complexity statement): the claimed O(ε^{-2}) rate is load-bearing on the assumption that the critic error is driven below ε with only O(ε^{-2}) total samples; the non-stationary sampling argument must explicitly cancel all cross terms without introducing an extra 1/ε factor from distribution drift.
minor comments (2)
- [§3.2] Notation for the buffer fraction β is introduced without an explicit dependence statement; add a sentence clarifying that β is chosen independently of ε.
- [Algorithm 1] Figure 1 (algorithm pseudocode) omits the precise rule for buffer eviction; this should be stated explicitly to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for acknowledging the potential significance of achieving optimal O(ε^{-2}) sample complexity with a practical single-timescale actor-critic method. We address the two major comments below with clarifications on the buffer analysis and non-stationary bounds. We believe the claims hold and will expand the relevant proof sections for greater transparency in the revision.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Buffer sampling analysis): the proof that uniform sampling from a fixed-fraction buffer (β independent of ε) suffices to bound the occupancy-measure shift induced by policy updates must be verified; if the bias term in the STORM variance reduction (cf. the decomposition after Eq. (12)) retains a factor that scales with the number of actor steps, the total sample complexity reverts to O(ε^{-3}).
Authors: In §4.3 we show that a fixed β (independent of ε) keeps the total-variation distance between the buffer empirical measure and the current occupancy measure at most O(β + η·window), where the actor step-size η = O(ε) and the window length is O(1/β). Because policy drift per actor step is controlled by η, the occupancy shift remains O(β) uniformly over the O(ε^{-2}) actor steps. In the STORM decomposition after Eq. (12), the bias term is multiplied by the momentum factor (1 - Θ(ε)); the resulting telescoping sum therefore contributes only O(ε) to the critic error after O(ε^{-2}) steps and does not re-introduce an extra 1/ε factor. We will insert an auxiliary lemma that explicitly bounds the drift term to make this verification immediate. revision: partial
-
Referee: [Theorem 1] Theorem 1 (main sample-complexity statement): the claimed O(ε^{-2}) rate is load-bearing on the assumption that the critic error is driven below ε with only O(ε^{-2}) total samples; the non-stationary sampling argument must explicitly cancel all cross terms without introducing an extra 1/ε factor from distribution drift.
Authors: The proof of Theorem 1 proceeds by decomposing the critic error into a variance-reduced STORM term plus a distribution-drift term. The buffer limits drift to O(β) per update; because β is constant, this contributes only a constant factor to the variance bound. Cross terms that arise when the sampling distribution changes with the policy are canceled by (i) the contraction property of the discounted Bellman operator and (ii) the slow policy update (η = O(ε)), which makes the value-function difference between consecutive policies O(ε). Summing the resulting O(ε) per-step error over O(ε^{-2}) steps yields the desired O(ε) critic accuracy without an extra 1/ε factor. The full expansion appears in Appendix B; we will add a short paragraph in the main text summarizing the cancellation argument. revision: partial
Circularity Check
No significant circularity in the sample complexity derivation
full rationale
The paper derives the O(ε^{-2}) bound via analysis of STORM variance reduction plus fixed-fraction buffer sampling to handle non-stationary occupancy measures in the single-timescale AC updates. No quoted equations or self-citations reduce the final rate to a fitted parameter, self-defined quantity, or prior result by construction. The bound follows from standard MDP finite-space assumptions and the algorithm's update rules without the central claim collapsing to an input. This is the expected non-circular outcome for a theorem-based complexity analysis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Finite state and action spaces in infinite-horizon discounted MDPs
Forward citations
Cited by 1 Pith paper
-
Achieving $\epsilon^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions
Single-loop actor-critic achieves the first Õ(ε^{-2}) sample complexity for ε-optimal policies under minimal irreducibility assumptions.
Reference graph
Works this paper leans on
- [1]
-
[2]
Optimistic Q-learning for average reward and episodic reinforcement learning
Priyank Agrawal and Shipra Agrawal. “Optimistic Q-learning for average reward and episodic reinforcement learning”. In:arXiv preprint arXiv:2407.13743(2024)
-
[3]
Q-learning with Posterior Sampling
Priyank Agrawal, Shipra Agrawal, and Azmat Azati. “Q-learning with Posterior Sampling”. In:arXiv preprint arXiv:2506.00917(2025)
-
[4]
Natural Gradient Works Efficiently in Learning
Shun-ichi Amari. “Natural Gradient Works Efficiently in Learning”. In:Neural Computation10 (1998), pp. 251– 276.URL:https://api.semanticscholar.org/CorpusID:207585383
work page 1998
-
[5]
Lower bounds for non-convex stochastic optimization
Yossi Arjevani et al. “Lower bounds for non-convex stochastic optimization”. In:Mathematical Programming 199.1 (2023), pp. 165–214
work page 2023
-
[6]
Near-optimal Regret Bounds for Reinforcement Learning
Peter Auer, Thomas Jaksch, and Ronald Ortner. “Near-optimal Regret Bounds for Reinforcement Learning”. In: Advances in Neural Information Processing Systems. Ed. by D. Koller et al. V ol. 21. Curran Associates, Inc., 2008. URL: https://proceedings.neurips.cc/paper/2008/file/e4a6222cdb5b34375400904f03d8e6a5- Paper.pdf
work page 2008
-
[7]
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learn- ing
Peter Auer and Ronald Ortner. “Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learn- ing”. In:Advances in Neural Information Processing Systems. Ed. by B. Sch ¨olkopf, J. Platt, and T. Hoff- man. V ol. 19. MIT Press, 2006.URL: https : / / proceedings . neurips . cc / paper / 2006 / file / c1b70d965ca504aa751ddb62ad69c63f-Paper.pdf. 11 Opt...
work page 2006
-
[8]
Minimax Regret Bounds for Reinforcement Learning
Mohammad Gheshlaghi Azar, Ian Osband, and R ´emi Munos. “Minimax Regret Bounds for Reinforcement Learning”. In:Proceedings of the 34th International Conference on Machine Learning. Ed. by Doina Precup and Yee Whye Teh. V ol. 70. Proceedings of Machine Learning Research. PMLR, June 2017, pp. 263–272.URL: https://proceedings.mlr.press/v70/azar17a.html
work page 2017
-
[9]
J. Andrew Bagnell and Jeff Schneider. “Covariant policy search”. In:Proceedings of the 18th International Joint Conference on Artificial Intelligence. IJCAI’03. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2003, pp. 1019–1024
work page 2003
-
[10]
Infinite-horizon policy-gradient estimation
Jonathan Baxter and Peter L Bartlett. “Infinite-horizon policy-gradient estimation”. In:journal of artificial intelligence research15 (2001), pp. 319–350
work page 2001
-
[11]
Bertsekas.Dynamic Programming and Optimal Control, Vol
Dimitri P. Bertsekas.Dynamic Programming and Optimal Control, Vol. II. 3rd. Athena Scientific, 2007.ISBN: 1886529302
work page 2007
-
[12]
Global optimality guarantees for policy gradient methods
Jalaj Bhandari and Daniel Russo. “Global optimality guarantees for policy gradient methods”. In:Operations Research(2024)
work page 2024
-
[13]
Actor–Critic or Critic–Actor? A Tale of Two Time Scales
Shalabh Bhatnagar, Vivek S. Borkar, and Soumyajit Guin. “Actor–Critic or Critic–Actor? A Tale of Two Time Scales”. In:IEEE Control Systems Letters7 (2023), pp. 2671–2676.DOI:10.1109/LCSYS.2023.3288931
-
[14]
Natural actor–critic algorithms
Shalabh Bhatnagar, Richard S. Sutton, et al. “Natural actor–critic algorithms”. In:Automatica45.11 (2009), pp. 2471–2482.ISSN: 0005-1098.DOI: https://doi.org/10.1016/j.automatica.2009.07.008 .URL: https://www.sciencedirect.com/science/article/pii/S0005109809003549
-
[15]
Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint
Vivek S. Borkar.Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency, 2022. DOI:10.1007/978-81-951961-1-1.URL:https://doi.org/10.1007%5C%2F978-81-951961-1-1
work page doi:10.1007/978-81-951961-1-1.url:https://doi.org/10.1007 2022
-
[16]
A Convergent Online Single Time Scale Actor Critic Algorithm
D. Di Castro and R. Meir.A Convergent Online Single Time Scale Actor Critic Algorithm. 2009. arXiv: 0909.2934 [cs.LG].URL:https://arxiv.org/abs/0909.2934
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[17]
Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems
Tianyi Chen, Yuejiao Sun, and Wotao Yin. “Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems”. In:Advances in Neural Information Processing Systems. Ed. by M. Ranzato et al. V ol. 34. Curran Associates, Inc., 2021, pp. 25294–25307.URL: https://proceedings.neurips.cc/ paper_files/paper/2021/file/d4dd111a4fd973...
work page 2021
- [18]
-
[19]
Finite-time analysis of single-timescale actor-critic
Xuyang Chen and Lin Zhao. “Finite-time analysis of single-timescale actor-critic”. In:Advances in Neural Information Processing Systems36 (2024)
work page 2024
-
[20]
Lecture Notes for EC525: Optimization for Machine Learning
Ashok Cutkosky. “Lecture Notes for EC525: Optimization for Machine Learning”. In:EC525: Optimization for Machine Learning(2022)
work page 2022
-
[21]
Momentum-Based Variance Reduction in Non-Convex SGD
Ashok Cutkosky and Francesco Orabona. “Momentum-Based Variance Reduction in Non-Convex SGD”. In:Advances in Neural Information Processing Systems. Ed. by H. Wallach et al. V ol. 32. Curran Asso- ciates, Inc., 2019.URL: https : / / proceedings . neurips . cc / paper _ files / paper / 2019 / file / b8002139cdde66b87638f7f91d169d96-Paper.pdf
work page 2019
- [22]
- [23]
-
[24]
Mudit Gaur et al. “Closing the Gap: Achieving Global Convergence (Last Iterate) of Actor-Critic under Marko- vian Sampling with Neural Network Parametrization”. In:Proceedings of the 41st International Conference on Machine Learning. Ed. by Ruslan Salakhutdinov et al. V ol. 235. Proceedings of Machine Learning Research. PMLR, 21–27 Jul 2024, pp. 15153–151...
work page 2024
-
[25]
Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming
Saeed Ghadimi and Guanghui Lan. “Stochastic first-and zeroth-order methods for nonconvex stochastic pro- gramming”. In:SIAM journal on optimization23.4 (2013), pp. 2341–2368
work page 2013
- [26]
-
[27]
Is Q-learning provably efficient?
Chi Jin et al. “Is Q-learning provably efficient?” In:Advances in neural information processing systems31 (2018)
work page 2018
-
[28]
Sham Kakade. “A Natural Policy Gradient”. In:Advances in Neural Information Processing Systems. V ol. 14. 2001, pp. 1531–1538
work page 2001
-
[29]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980(2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[30]
Vijay R. Konda and John N. Tsitsiklis. “Actor-Critic Algorithms”. In:Neural Information Processing Systems. 1999.URL:https://api.semanticscholar.org/CorpusID:207779694
work page 1999
- [31]
- [32]
-
[33]
Policy Gradient with Tree Search (PGTS) in Reinforcement Learning Evades Local Maxima
Navdeep Kumar, Priyank Agrawal, Kfir Yehuda Levy, et al. “Policy Gradient with Tree Search (PGTS) in Reinforcement Learning Evades Local Maxima”. In:The Second Tiny Papers Track at ICLR 2024. 2024.URL: https://openreview.net/forum?id=61g5xxWOI6
work page 2024
-
[34]
On the Convergence of Single-Timescale Actor- Critic
Navdeep Kumar, Priyank Agrawal, Giorgia Ramponi, et al. “On the Convergence of Single-Timescale Actor- Critic”. In:The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025.URL: https: //openreview.net/forum?id=OixkI1jSZD
work page 2025
-
[35]
Crafting Papers on Machine Learning
P. Langley. “Crafting Papers on Machine Learning”. In:Proceedings of the 17th International Conference on Machine Learning (ICML 2000). Ed. by Pat Langley. Stanford, CA: Morgan Kaufmann, 2000, pp. 1207–1216
work page 2000
- [36]
- [37]
-
[38]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In:arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [39]
-
[40]
Human-level control through deep reinforcement learning
V olodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In:Nature518.7540 (Feb. 2015), pp. 529–533.ISSN: 00280836.URL:http://dx.doi.org/10.1038/nature14236
- [41]
-
[42]
OpenAI et al.Solving Rubik’s Cube with a Robot Hand. 2019. arXiv: 1910.07113 [cs.LG] .URL: https: //arxiv.org/abs/1910.07113
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[43]
Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation
Prashansa Panda and Shalabh Bhatnagar. “Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation”. In:Proceedings of the AAAI Conference on Artificial Intelligence39.19 (Apr. 2025), pp. 19813– 19820.DOI: 10.1609/aaai.v39i19.34182.URL: https://ojs.aaai.org/index.php/AAAI/article/ view/34182
-
[44]
Jan Peters and Stefan Schaal. “Natural Actor-Critic”. In:Neurocomputing71 (Mar. 2008), pp. 1180–1190.DOI: 10.1016/j.neucom.2007.11.026. 13 Optimal Sample Complexity for Single Time-Scale Actor-Critic with MomentumA PREPRINT
-
[45]
Some methods of speeding up the convergence of iteration methods
Boris T Polyak. “Some methods of speeding up the convergence of iteration methods”. In:Ussr computational mathematics and mathematical physics4.5 (1964), pp. 1–17
work page 1964
-
[46]
Markov Decision Processes: Discrete Stochastic Dynamic Programming
Martin L. Puterman. “Markov Decision Processes: Discrete Stochastic Dynamic Programming”. In:Wiley Series in Probability and Statistics. 1994
work page 1994
-
[47]
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
Julian Schrittwieser et al. “Mastering Atari, Go, chess and shogi by planning with a learned model”. In: Nature588.7839 (Dec. 2020), pp. 604–609.ISSN: 1476-4687.DOI: 10.1038/s41586-020-03051-4 .URL: http://dx.doi.org/10.1038/s41586-020-03051-4
work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020
-
[48]
John Schulman, Xi Chen, and Pieter Abbeel.Equivalence Between Policy Gradients and Soft Q-Learning. 2017. DOI:10.48550/ARXIV.1704.06440.URL:https://arxiv.org/abs/1704.06440
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1704.06440.url:https://arxiv.org/abs/1704.06440 2017
-
[49]
Trust region policy optimization
John Schulman, Sergey Levine, et al. “Trust region policy optimization”. In:International conference on machine learning. PMLR. 2015, pp. 1889–1897
work page 2015
-
[50]
Benchmarking optimizers for large language model pretraining
Andrei Semenov, Matteo Pagliardini, and Martin Jaggi. “Benchmarking optimizers for large language model pretraining”. In:arXiv preprint arXiv:2509.01440(2025)
-
[51]
Mastering the game of Go without human knowledge
David Silver et al. “Mastering the game of Go without human knowledge.” In:Nat.550.7676 (2017), pp. 354–359. URL:http://dblp.uni-trier.de/db/journals/nature/nature550.html#SilverSSAHGHBLB17
work page 2017
-
[52]
Policy gradient methods for reinforcement learning with function approximation
Richard S Sutton et al. “Policy gradient methods for reinforcement learning with function approximation.” In: Advances in Neural Information Processing Systems. V ol. 99. Citeseer. 1999, pp. 1057–1063
work page 1999
-
[53]
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. Second. The MIT Press, 2018.URL:http://incompleteideas.net/book/the-book-2nd.html
work page 2018
- [54]
-
[55]
Qiuhao Wang, Chin Pang Ho, and Marek Petrik.On the Convergence of Policy Gradient in Robust MDPs. 2022. DOI:10.48550/ARXIV.2212.10439.URL:https://arxiv.org/abs/2212.10439
work page doi:10.48550/arxiv.2212.10439.url:https://arxiv.org/abs/2212.10439 2022
-
[56]
Simple statistical gradient-following algorithms for connectionist reinforcement learning
Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In:Machine learning8 (1992), pp. 229–256
work page 1992
- [57]
-
[58]
2022.DOI: 10.48550/ARXIV.2201.07443
Lin Xiao.On the Convergence Rates of Policy Gradient Methods. 2022.DOI: 10.48550/ARXIV.2201.07443. URL:https://arxiv.org/abs/2201.07443
-
[59]
An improved convergence analysis of stochastic variance-reduced policy gradient
Pan Xu, Felicia Gao, and Quanquan Gu. “An improved convergence analysis of stochastic variance-reduced policy gradient”. In:Uncertainty in Artificial Intelligence. PMLR. 2020, pp. 541–551
work page 2020
- [60]
-
[61]
Zhuoran Yang et al.On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost. 2019. arXiv:1907.06246 [cs.LG].URL:https://arxiv.org/abs/1907.06246
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[62]
Mars: Unleashing the power of variance reduction for training large models, 2024
Huizhuo Yuan et al. “Mars: Unleashing the power of variance reduction for training large models, 2024”. In: URL https://arxiv. org/abs/2411.10438()
-
[63]
Gower, and Alessandro Lazaric.A general sample complexity analysis of vanilla policy gradient
Rui Yuan, Robert M. Gower, and Alessandro Lazaric.A general sample complexity analysis of vanilla policy gradient. 2022. arXiv:2107.11433 [cs.LG]
-
[64]
Kaiqing Zhang et al.Global Convergence of Policy Gradient Methods to (Almost) Locally Optimal Policies
-
[65]
arXiv:1906.08383 [math.OC]. 14 Optimal Sample Complexity for Single Time-Scale Actor-Critic with MomentumA PREPRINT Appendix Roadmap The appendix is organized as follows: •Appendix A: Related Work •Appendix B: Deriving Recursions – B.1: Actor Recursion – B.2: Critic Recursion – B.3: STORM Recursion – B.4: On Buffer – B.5: On Sufficient Exploration Assumpt...
-
[66]
Recall that we have the following update rule hk+1 =U k+1(ξk+1) + (1−ν k) h hk −U k(ξk+1) i ,(30) where Um(ξk) = h R(s, a) +γ P a′ πm(a′|s′)Qm(s′, a′)−Q m(s, a) i 1 s=s k, a=a k , and Vk = Eξ∼Bk[Um(ξ)] =b k ⊙(T πmQm −Q m)for allm. Lemma B.7(STORM Recursion).The STORM errorw k under Algorithm 1 follows w2 k+1 ≤(1−ν k + 16SAβ2 k)w2 k + 2ν2 kσ2 +c 2 Bη4 kz2 ...
work page 2023
-
[67]
Assumption B.12.[Sufficient Exploration Assumption (N
have made the following assumption. Assumption B.12.[Sufficient Exploration Assumption (N. Kumar, P. Agrawal, Ramponi, et al. 2025)] There exists a λ >0such that: ⟨Qπ −Q, D π(I−γP π)(Qπ −Q)⟩ ≥λ∥Q π −Q∥ 2 2, whereP π((s′, a′),(s, a)) =P(s ′|s, a)π(a′|s′)andD π((s′, a′),(s, a)) =1 (s′, a′) = (s, a) dπ(s)π(a|s). The above condition can be re-written as ⟨Qπ −...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.