pith. machine review for the scientific record. sign in

arxiv: 2605.13639 · v1 · submitted 2026-05-13 · 💻 cs.LG · math.OC· stat.ML

Recognition: unknown

Achieving ε⁻² Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords actor-criticreinforcement learningsample complexityoff-policysingle-loopLyapunov analysisconvergence rates
0
0 comments X

The pith

Single-loop off-policy actor-critic reaches an ε-optimal policy with Õ(ε^{-2}) samples under only irreducibility of one policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proves that actor-critic methods can run in a single loop at a single timescale and still deliver the optimal Õ(ε^{-2}) sample complexity for finding an ε-optimal policy. The proof covers a broad family of updates that includes approximate policy iteration and natural policy gradient, and it requires only that some policy makes the Markov chain irreducible. Earlier guarantees needed either nested loops or strong extra conditions such as uniform mixing and uniform exploration. The analysis introduces a coupled Lyapunov drift argument that produces geometric progress on the actor and Õ(1/T) progress on the critic, then links the two via a cross-domination relation to control the interaction and the unboundedness of the iterates.

Core claim

Under the single assumption that there exists a policy inducing an irreducible Markov chain, single-loop single-timescale off-policy actor-critic with a broad class of policy updates achieves last-iterate convergence with Õ(ε^{-2}) sample complexity for an ε-optimal policy. The argument establishes geometric convergence for the actor and Õ(1/T) convergence for the critic through coupled Lyapunov drift inequalities that are combined by a cross-domination property.

What carries the argument

Coupled Lyapunov drift framework with cross-domination property that links geometric actor progress to Õ(1/T) critic progress while handling interdependent updates and unbounded iterates.

If this is right

  • The Õ(ε^{-2}) guarantee applies directly to approximate policy iteration and natural policy gradient inside the single-loop setting.
  • Last-iterate convergence is obtained without any averaging of the iterates.
  • The rate holds without requiring uniform mixing times or uniform exploration constants.
  • Off-policy updates are covered even when parameter iterates remain unbounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The coupled-drift technique may extend to other single-timescale stochastic approximation schemes whose updates are interdependent.
  • Single-loop implementations could become the default in practice once the rate is confirmed on standard benchmarks.
  • Policy designers may begin checking irreducibility properties explicitly when seeking fast convergence guarantees.

Load-bearing premise

The existence of at least one policy that induces an irreducible Markov chain, plus the coupled Lyapunov drifts and cross-domination property holding for the single-loop off-policy updates.

What would settle it

An MDP in which every policy induces a reducible chain yet the single-loop algorithm still attains Õ(ε^{-2}) samples to ε-optimality, or an MDP satisfying the irreducibility assumption in which the observed sample complexity is substantially worse than Õ(ε^{-2}).

read the original abstract

In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity guarantee for finding an $\epsilon$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims to prove the first Õ(ε^{-2}) sample complexity guarantee for finding an ε-optimal policy using single-loop, single-timescale off-policy actor-critic methods under minimal assumptions, specifically the existence of a policy that induces an irreducible Markov chain. The analysis relies on a coupled Lyapunov drift framework that establishes geometric convergence for the actor and Õ(1/T) for the critic, combined via a cross-domination property for a broad class of policy updates including approximate policy iteration and natural policy gradient.

Significance. If the central claim holds, this result would represent a notable advance in reinforcement learning theory by achieving near-optimal sample complexity without requiring nested-loop updates or strong assumptions such as uniform ergodicity or exploration. The introduction of the coupled Lyapunov drift framework with cross-domination could be of independent interest for analyzing other coupled iterative algorithms with unbounded iterates.

major comments (1)
  1. The cross-domination property (described in the abstract as the mechanism combining the geometric actor drift and Õ(1/T) critic drift) is load-bearing for the Õ(ε^{-2}) rate. It is not evident that this property holds under only the existence of one irreducible policy, because the off-policy behavior distribution used for sampling may be reducible or have arbitrarily slow mixing on some states, rendering the domination constant infinite or T-dependent and invalidating the combined Lyapunov inequality for unbounded iterates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment of the work's significance, and the constructive comment on the cross-domination property. We address the concern directly below.

read point-by-point responses
  1. Referee: The cross-domination property (described in the abstract as the mechanism combining the geometric actor drift and Õ(1/T) critic drift) is load-bearing for the Õ(ε^{-2}) rate. It is not evident that this property holds under only the existence of one irreducible policy, because the off-policy behavior distribution used for sampling may be reducible or have arbitrarily slow mixing on some states, rendering the domination constant infinite or T-dependent and invalidating the combined Lyapunov inequality for unbounded iterates.

    Authors: We appreciate the referee highlighting the need for explicit verification of this load-bearing step. The manuscript establishes the cross-domination property in Lemma 4.5 (Section 4.3) under precisely the stated minimal assumption: the existence of at least one policy that induces an irreducible chain. The proof proceeds by showing that the actor's geometric drift keeps the policy sequence inside a compact set on which the behavior distribution (which has full support on the irreducible chain's transitions) yields a uniform positive lower bound on the visitation probabilities, independent of T. This bound is derived from the irreducibility of the reference policy and does not rely on uniform ergodicity of every intermediate policy or on the behavior distribution itself being irreducible. Consequently, the domination constant remains finite and T-independent, allowing the combined Lyapunov inequality to close for the unbounded iterates. We agree that the current exposition could make the T-independence argument more transparent and will add a short remark plus an expanded proof sketch in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives the Õ(ε^{-2}) sample complexity via a coupled Lyapunov drift framework that establishes geometric actor convergence and Õ(1/T) critic convergence, then combines them through the cross-domination property under the explicit minimal assumption of one irreducible policy. No quoted step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the rates follow from the stated drift inequalities without renaming or smuggling prior ansatzes. The framework is presented as new and of independent interest, with comparisons to external benchmarks rather than load-bearing self-references. This satisfies the self-contained criterion against the paper's own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of an irreducible policy and the validity of the coupled Lyapunov framework for single-loop updates.

axioms (1)
  • domain assumption Existence of a policy that induces an irreducible Markov chain
    Explicitly stated as the sole minimal assumption enabling the result.

pith-pipeline@v0.9.0 · 5535 in / 1091 out tokens · 33891 ms · 2026-05-14T19:24:50.031010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 2 internal anchors

  1. [1]

    M., Lee, J

    Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gra- dient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76

  2. [2]

    Alacaoglu, A., Viano, L., He, N., and Cevher, V. (2022). A natural actor-critic framework for zero-sum Markov games. InInternational Conference on Machine Learning, pages 307–366. PMLR

  3. [3]

    Banach, S. (1922). Sur les op´ erations dans les ensembles abstraits et leur application aux ´ equations int´ egrales.Fund. Math, 3(1):133–181

  4. [4]

    Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310–335

  5. [5]

    Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference On Learning Theory, pages 1691– 1692

  6. [6]

    Borkar, V. S. (2009).Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer

  7. [7]

    Borkar, V. S. and Konda, V. R. (1997). The actor-critic algorithm as multi-time-scale stochastic approximation.Sadhana, 22(4):525–543

  8. [8]

    Borkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approxi- mation and reinforcement learning.SIAM Journal on Control and Optimization, 38(2):447–469

  9. [9]

    S., and Dodhia, P

    Chandak, S., Borkar, V. S., and Dodhia, P. (2022). Concentration of contractive stochastic approximation and reinforcement learning.Stochastic Systems, 12(4):411–430

  10. [10]

    and Zhao, L

    Chen, X. and Zhao, L. (2023). Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36:7017–7049

  11. [11]

    Chen, Z., Khodadadian, S., and Maguluri, S. T. (2022). Finite-sample analysis of off-policy natural actor–critic with linear function approximation.IEEE Control Systems Letters, 6:2611– 2616

  12. [12]

    and Maguluri, S

    Chen, Z. and Maguluri, S. T. (2025). An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395. 15

  13. [13]

    T., Shakkottai, S., and Shanmugam, K

    Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2021). Finite-sample analysis of off-policy TD-learning via generalized Bellman operators.Advances in Neural Information Processing Systems, 34:21440–21452

  14. [14]

    T., Shakkottai, S., and Shanmugam, K

    Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov the- ory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352–1367

  15. [15]

    Chen, Z., Zhang, K., Mazumdar, E., Ozdaglar, A., and Wierman, A. (2023). A finite-sample analysis of payoff-based independent learning in zero-sum stochastic games. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 75826–75883. Curran Associates, Inc

  16. [16]

    Degris, T., White, M., and Sutton, R. (2012). Off-policy actor-critic. InInternational Con- ference on Machine Learning

  17. [17]

    R., Legg, S., and Kavukcuoglu, K

    Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I. R., Legg, S., and Kavukcuoglu, K. (2018). IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InInternational Conference on Machine Learning, pages 1407–1416

  18. [18]

    Fu, Z., Yang, Z., and Wang, Z. (2021). Single-timescale actor-critic provably finds globally optimal policy. InInternational Conference on Learning Representations

  19. [19]

    U., and Aggarwal, V

    Ganesh, S., Chen, J., Mondal, W. U., and Aggarwal, V. (2025). Order-optimal global conver- gence for actor-critic with general policy and neural critic parametrization. In Chiappa, S. and Magliacane, S., editors,Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, volume 286 ofProceedings of Machine Learning Research, pages...

  20. [20]

    Gheshlaghi Azar, M., Munos, R., and Kappen, H. J. (2013). Minimax PAC bounds on the sam- ple complexity of reinforcement learning with a generative model.Machine learning, 91(3):325– 349

  21. [21]

    Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy max- imum entropy deep reinforcement learning with a stochastic actor. In Dy, J. and Krause, A., editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1861–1870. PMLR

  22. [22]

    Haque, S. U. and Maguluri, S. T. (2025). Stochastic approximation with unbounded Markovian noise: A general-purpose theorem. InInternational Conference on Artificial Intelligence and Statistics, pages 3718–3726. PMLR

  23. [23]

    R., and Praly, L

    Jiang, Z.-P., Teel, A. R., and Praly, L. (1994). Small-gain theorem for ISS systems and applications.Mathematics of Control, Signals and Systems, 7(2):95–120

  24. [24]

    and Lan, G

    Ju, C. and Lan, G. (2025). Auto-exploration for online reinforcement learning.Preprint Arxiv:2512.06244. 16

  25. [25]

    Kakade, S. M. (2001). A natural policy gradient.Advances in Neural Information Processing Systems, 14

  26. [26]

    Khalil, H. K. and Grizzle, J. W. (2002).Nonlinear Systems, volume 3. Prentice hall Upper Saddle River, NJ

  27. [27]

    Khodadadian, S., Chen, Z., and Maguluri, S. T. (2021). Finite-sample analysis of off-policy natural actor-critic algorithm. InInternational Conference on Machine Learning, pages 5420–

  28. [28]

    T., Romberg, J., and Maguluri, S

    Khodadadian, S., Doan, T. T., Romberg, J., and Maguluri, S. T. (2022). Finite sample analysis of two-time-scale natural actor-critic algorithm.IEEE Transactions on Automatic Control

  29. [29]

    and Tsitsiklis, J

    Konda, V. and Tsitsiklis, J. (1999). Actor-critic algorithms.Advances in Neural Information Processing Systems, 12

  30. [30]

    Kumar, H., Koppel, A., and Ribeiro, A. (2023). On the sample complexity of actor-critic method for reinforcement learning with function approximation.Machine Learning, pages 1–35

  31. [31]

    Y., and Mannor, S

    Kumar, N., Agrawal, P., Ramponi, G., Levy, K. Y., and Mannor, S. (2026a). On the conver- gence of single-timescale actor-critic. InThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

  32. [32]

    Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum

    Kumar, N., Dahan, T., Cohen, L., Barua, A., Ramponi, G., Levy, K. Y., and Mannor, S. (2026b). Optimal sample complexity for single time-scale actor-critic with momentum.arXiv preprint: 2602.01505

  33. [33]

    Lan, G. (2022). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical Programming, pages 1–48

  34. [34]

    Levin, D. A. and Peres, Y. (2017).Markov Chains and Mixing Times, volume 107. American Mathematical Soc

  35. [35]

    Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. (2020). Sample complexity of asynchronous Q- learning: sharper analysis and variance reduction. InAdvances in Neural Information Processing Systems, volume 33, pages 7031–7043. Curran Associates, Inc

  36. [36]

    and Lan, G

    Li, Y. and Lan, G. (2025). Policy mirror descent inherently explores action space.SIAM Journal on Optimization, 35(1):116–156

  37. [37]

    Lillicrap, T., Hunt, J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning.CoRR

  38. [38]

    P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K

    Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pages 1928–1937

  39. [39]

    A., Veness, J., Bellemare, M

    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, 17 I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning.Nature, 518(7540):529–533

  40. [40]

    Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. (2016). Safe and efficient off-policy reinforcement learning. InProceedings of the 30th International Conference on Neural Information Processing Systems, pages 1054–1062

  41. [41]

    A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

    Nanda, P. and Chen, Z. (2025). A minimal-assumption analysis ofQ-learning with time-varying policies.Preprint Arxiv:2510.16132

  42. [42]

    and Gharesifard, B

    Olshevsky, A. and Gharesifard, B. (2023). A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization, 61(2):980–1007

  43. [43]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022). Training language models to follow instructions with human feedback. InProceedings of the 36th Intern...

  44. [44]

    and Schaal, S

    Peters, J. and Schaal, S. (2008). Natural actor-critic.Neurocomputing, 71(7-9):1180–1190

  45. [45]

    S., and Dasgupta, S

    Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal difference learning with function approximation. InProceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 417–424, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc

  46. [46]

    Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Program- ming. John Wiley & Sons

  47. [47]

    Qiu, S., Yang, Z., Ye, J., and Wang, Z. (2021). On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664

  48. [48]

    and Wierman, A

    Qu, G. and Wierman, A. (2020). Finite-time analysis of asynchronous stochastic approximation and Q-learning. InConference on Learning Theory, pages 3185–3205. PMLR

  49. [49]

    and Monro, S

    Robbins, H. and Monro, S. (1951). A stochastic approximation method.The Annals of Mathematical Statistics, pages 400–407

  50. [50]

    Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889–1897. PMLR

  51. [51]

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms

  52. [52]

    Schweitzer, P. J. (1971). Iterative solution of the functional equations of undiscounted Markov renewal programming.Journal of Mathematical Analysis and Applications, 34(3):495–501

  53. [53]

    Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., 18 Graepel, T., and Hassabis, D. (2017). Mastering the game of Go without human knowledge. Nature, 550(7676):354

  54. [54]

    and Ying, L

    Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830

  55. [55]

    Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3(1):9–44

  56. [56]

    Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press

  57. [57]

    S., Mahmood, A

    Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of off-policy temporal-difference learning.The Journal of Machine Learning Research, 17(1):2603– 2631

  58. [58]

    S., McAllester, D., Singh, S., and Mansour, Y

    Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. InProceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057–1063

  59. [59]

    S., Szepesv´ ari, C., and Maei, H

    Sutton, R. S., Szepesv´ ari, C., and Maei, H. R. (2008). A convergentO(n) algorithm for off-policy temporal-difference learning with linear function approximation.Advances in Neural Information Processing Systems, 21(21):1609–1616

  60. [60]

    Tian, H., Olshevsky, A., and Paschalidis, Y. (2023). Convergence of actor-critic with multi- layer neural networks. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors,Advances in Neural Information Processing Systems, volume 36, pages 9279–

  61. [61]

    Curran Associates, Inc

  62. [62]

    Wainwright, M. J. (2019). Variance-reducedQ-learning is minimax optimal.arXiv e-prints, page arXiv:1906.04697

  63. [63]

    Wang, Y., Wang, Y., Zhou, Y., and Zou, S. (2024). Non-asymptotic analysis for single-loop (Natural) actor-critic with compatible function approximation. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F., editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedin...

  64. [64]

    Watkins, C. J. and Dayan, P. (1992). Q-learning.Machine learning, 8(3-4):279–292

  65. [65]

    F., Zhang, W., Xu, P., and Gu, Q

    Wu, Y. F., Zhang, W., Xu, P., and Gu, Q. (2020). A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628

  66. [66]

    Xiao, L. (2022). On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36

  67. [67]

    and Liang, Y

    Xu, T. and Liang, Y. (2021). Sample complexity bounds for two timescale value-based rein- forcement learning algorithms. InInternational Conference on Artificial Intelligence and Statis- tics, pages 811–819. PMLR. 19

  68. [68]

    Xu, T., Wang, Z., and Liang, Y. (2020a). Improving sample complexity bounds for (natural) actor-critic algorithms.Advances in Neural Information Processing Systems, 33

  69. [69]

    Xu, T., Wang, Z., and Liang, Y. (2020b). Non-asymptotic convergence analysis of two time- scale (natural) actor-critic algorithms.arXiv eprint: 2005.03557. 20 A Proof of Theorem 2.1 with the IS-Based Critic A.1 Analysis of the Actor A.1.1 Proof of Proposition 2.1 The following lemma is needed, whose proof is presented in Appendix A.4.1. Lemma A.1.For anyπ...