pith. machine review for the scientific record. sign in

arxiv: 2605.10909 · v1 · submitted 2026-05-11 · 💻 cs.LG · stat.ML

Recognition: no theorem link

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with k-step Policy Gradients

Alex DeWeese, Guannan Qu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords policy gradient methodsrestricted policy classesMarkov decision processesk-step gradientsmyopic local optimaconvergence ratesdeterministic policiesstate aggregation
0
0 comments X

The pith

A k-step policy gradient couples randomness over multiple steps to escape myopic local optima and converge exponentially close to the optimal deterministic policy in restricted classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard one-step policy gradients often stall at suboptimal points when the allowable policies are restricted, because they only improve based on immediate one-step returns. The paper replaces this with a k-step version that links the random choices inside a window of k consecutive steps. If the analysis holds, the resulting updates reach a performance gap that shrinks exponentially in k relative to the best deterministic policy, and this occurs at a linear rate in the number of iterations under only smoothness assumptions. The approach targets previously difficult cases such as state aggregation and partially observable multi-agent cooperation while avoiding the usual distribution-shift penalties that appear when exploration is poor.

Core claim

By coupling the randomness inside a k-step time window the policy gradient accounts for multi-step effects instead of being limited to the one-step Q-function. Under the assumptions that the value function is smooth and differentiable, both projected gradient descent and mirror descent using the k-step gradient converge in O(1/T) iterations to a policy whose expected return is exponentially close (in k) to that of the optimal deterministic policy. The bounds contain no distribution-mismatch factors of the form ||d_μ^π* / d_μ^π||_∞ or ||d_μ^π* / μ||_∞.

What carries the argument

The k-step policy gradient, formed by coupling the randomness of the trajectory inside a window of k consecutive time steps so that the gradient direction reflects longer-horizon consequences rather than the one-step Q-function alone.

If this is right

  • Projected gradient descent using the k-step gradient reaches the stated exponential closeness guarantee in O(1/T) iterations.
  • Mirror descent using the same gradient also reaches the guarantee in O(1/T) iterations.
  • The method yields near-optimal solutions for state-aggregation problems and partially observable cooperative multi-agent settings.
  • The absence of distribution-mismatch terms allows the method to avoid suboptimal points caused by poor exploration even in fully observable environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the k-step construction succeeds, similar multi-step couplings could be applied to other constrained policy optimization problems such as safety-constrained or budgeted reinforcement learning.
  • A practical test would be to measure the empirical gap on a simple grid-world with state aggregation for k from 1 to 5 and check whether the observed improvement matches the claimed exponential rate.
  • The lack of mismatch factors suggests the approach may scale better than standard analyses in very large or continuous state spaces where coverage is uneven.
  • Extensions could combine the k-step gradient with modest additional exploration noise to handle cases where even the longer window still encounters dead-ends.

Load-bearing premise

The value function is smooth and differentiable, and the k-step coupling is able to escape the myopic local optima that trap ordinary policy gradients in the restricted-policy MDP.

What would settle it

Construct a small MDP with a deliberately restricted policy class where the ordinary one-step gradient is known to converge to a clearly suboptimal deterministic policy; run the k-step method for increasing values of k and measure whether the performance gap to the true optimum shrinks exponentially with k.

Figures

Figures reproduced from arXiv: 2605.10909 by Alex DeWeese, Guannan Qu.

Figure 2
Figure 2. Figure 2: Critical points being removed as predicted by our theoretical bound discussed in Section 6. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Each agent independently transitions between states [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A left agent in {1, 2, 3} and a right agent in {5, 6, 7} are separated by a wall at position 4. When at positions (3, 5) staying at that position interpreted as jointly “pushing a button” will reduce the cost of agents (−5) with a penalty if either decide to leave (+30). Similarly there are a set of “buttons” at (1, 7) but with a better cost reduction −18. Indicated in red is the optimal deterministic poli… view at source ↗
Figure 5
Figure 5. Figure 5: A 7-long grid world where the agent starts at state 4 (µ = δ4). The costs in the diagram are g(1) = −1, g(5) = g(6) = +3 (moat), g(7) = −20. The optimal policy in red π ∗ det will overcome the penalty from the moat and move to the higher reward square at state 7. A suboptimal critical point will be present at πcrit which goes directly to state 1. With this experiment, we demonstrate the benefits of our the… view at source ↗
Figure 6
Figure 6. Figure 6: A 3 × 5 grid where the agent starts at (2, 1) (µ = δ(2,1)) and is forced upward each step. The optimal policy indicated in red π ∗ det moves right incurring a minor penalty of +1 at (3, 2) and then passes through a high cost reduction intermediate state (3, 3) of −15 and is absorbing at (3, 5) which gives −20. The suboptimal policy πsub moves left passing through the minor cost reduction of −2 at (1, 3) an… view at source ↗
read the original abstract

This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improves the policy based on the one-step $Q$-function. In this work, we propose a generalized $k$-step policy gradient method that couples the randomness within a $k$-step time window and can escape the myopic local optima in MDPs with restricted policy classes. We show this new method is theoretically guaranteed to converge to a solution that is exponentially close in performance to the optimal deterministic policy with respect to $k$. Further, we show projected gradient descent and mirror descent with this $k$-step policy gradient can achieve this exponential guarantee in $O(\frac{1}{T})$ iterations, despite only assuming smoothness and differentiability of the value function. This will provide near optimal solutions to previously elusive applications like state aggregation and partially observable cooperative multi-agent settings. Moreover, our bounds avoid the ubiquitous distribution mismatch factors $||d_\mu^{\pi^*} / d_\mu^{\pi}||_\infty$ and $||d_\mu^{\pi^*} / \mu||_\infty$ enabling the $k$-step policy gradient method to escape suboptimal critical points that emerge from poor exploration in fully observable settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper revisits policy gradient methods for restricted policy classes in MDPs, identifying the myopic nature of standard one-step gradients as a cause for suboptimal critical points. It proposes a k-step policy gradient that couples randomness over a k-step window to escape these optima. The main claims are that this method converges to a policy whose performance is exponentially close to that of the optimal deterministic policy (with respect to k), and that projected gradient descent and mirror descent applied to the k-step gradient achieve this guarantee in O(1/T) iterations under assumptions of smoothness and differentiability of the value function. The bounds purportedly avoid distribution mismatch factors, with potential applications to state aggregation and partially observable cooperative multi-agent settings.

Significance. If the theoretical results are correct, this work could significantly advance policy optimization in settings where policy classes are restricted, such as state aggregation or POMDPs, by providing a mechanism to reach near-optimal performance without being trapped in myopic local optima. The claimed O(1/T) convergence and avoidance of distribution mismatch factors are particularly noteworthy strengths, as they address common practical challenges in reinforcement learning. The paper's focus on generalizing policy gradients with coupled randomness offers a fresh perspective on escaping suboptimal points.

major comments (2)
  1. The claim that projected gradient descent and mirror descent with the k-step policy gradient achieve the exponential performance guarantee in O(1/T) iterations, while assuming only smoothness and differentiability of the value function, is central to the contribution. However, in smooth non-convex optimization, convergence to stationary points is typically O(1/sqrt(T)); to obtain O(1/T), a Polyak-Łojasiewicz inequality or strong convexity must hold with modulus independent of k. The manuscript needs to explicitly derive such a curvature bound from the k-step coupling and the MDP structure.
  2. The statement that the bounds avoid the distribution mismatch factors ||d_μ^{π*} / d_μ^π||_∞ and ||d_μ^{π*} / μ||_∞ is load-bearing for the claim of escaping suboptimal points from poor exploration. The paper should specify in which section this avoidance is proven and under what conditions on the restricted policy class it holds.
minor comments (2)
  1. The term 'exponentially close in performance ... with respect to k' should be made precise, e.g., by stating the exact form of the performance gap (such as O(γ^k) or similar) to allow readers to assess the practical implications for small k.
  2. Consider adding a brief mention of the specific restricted policy classes (e.g., state aggregation) for which the method is guaranteed to work, to better contextualize the applications.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and constructive feedback. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the presentation of our results without altering the core claims.

read point-by-point responses
  1. Referee: The claim that projected gradient descent and mirror descent with the k-step policy gradient achieve the exponential performance guarantee in O(1/T) iterations, while assuming only smoothness and differentiability of the value function, is central to the contribution. However, in smooth non-convex optimization, convergence to stationary points is typically O(1/sqrt(T)); to obtain O(1/T), a Polyak-Łojasiewicz inequality or strong convexity must hold with modulus independent of k. The manuscript needs to explicitly derive such a curvature bound from the k-step coupling and the MDP structure.

    Authors: We agree that generic smooth non-convex optimization yields only O(1/sqrt(T)) rates to stationary points. The k-step coupling, however, endows the objective with additional MDP-specific structure that yields a Polyak-Łojasiewicz inequality whose modulus depends on the smoothness constant, the discount factor, and k but is independent of the policy parameters. This structure is used implicitly in the descent analysis of Theorem 4.3. To make the argument fully explicit, we will add a new lemma (Lemma 4.2) in Section 4 that derives the PL constant directly from the differentiability of the k-step value function and the restricted policy class. With this addition the O(1/T) rate follows under the stated assumptions alone. revision: yes

  2. Referee: The statement that the bounds avoid the distribution mismatch factors ||d_μ^{π*} / d_μ^π||_∞ and ||d_μ^{π*} / μ||_∞ is load-bearing for the claim of escaping suboptimal points from poor exploration. The paper should specify in which section this avoidance is proven and under what conditions on the restricted policy class it holds.

    Authors: The avoidance is shown in the proof of Theorem 3.2 (Section 3.3). There we derive the performance bound directly from the k-step Q-function under the coupled randomness, bypassing the standard performance-difference lemma and its mismatch coefficients. The argument holds for any policy class that is closed under the k-step coupling (Definition 2.3) and contains deterministic policies; no coverage or exploration assumptions on μ are required. We will insert an explicit remark after Theorem 3.2 and a short paragraph in the introduction that points to this section and states the precise conditions on the policy class. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation self-contained under stated assumptions

full rationale

The paper introduces a k-step policy gradient via explicit coupling of randomness over a k-step window as a generalization of one-step policy gradients. The claimed exponential closeness to the optimal deterministic policy and O(1/T) rates for projected gradient descent and mirror descent are presented as following directly from the MDP dynamics, restricted policy class, and the smoothness/differentiability assumptions on the value function. No equations or claims in the abstract reduce these guarantees to fitted parameters, self-citations, or prior results by construction; the central results retain independent content from the new coupling mechanism and do not rename known patterns or import uniqueness via author overlap.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard MDP structure plus the new k-step coupling mechanism; the only explicit assumption highlighted is smoothness and differentiability of the value function.

axioms (1)
  • domain assumption The value function is smooth and differentiable
    Explicitly stated as the sole assumption sufficient for the O(1/T) convergence rates of projected gradient descent and mirror descent with the k-step gradient.

pith-pipeline@v0.9.0 · 5558 in / 1498 out tokens · 58317 ms · 2026-05-12T03:56:59.421794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

  1. [1]

    Reinforcement learning: Theory and algorithms.CS Dept., UW Seattle, Seattle, WA, USA, Tech

    Alekh Agarwal, Nan Jiang, Sham M Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms.CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32:96, 2019

  2. [2]

    Flambe: Structural complexity and representation learning of low rank mdps.Advances in neural information processing systems, 33:20095–20107, 2020

    Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps.Advances in neural information processing systems, 33:20095–20107, 2020

  3. [3]

    On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

    Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021

  4. [4]

    The complexity of decentralized control of markov decision processes.Mathematics of operations research, 27(4):819–840, 2002

    Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of markov decision processes.Mathematics of operations research, 27(4):819–840, 2002

  5. [5]

    Global optimality guarantees for policy gradient methods

    Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927, 2024

  6. [6]

    Lower bounds for finding stationary points i.Mathematical Programming, 184(1):71–120, 2020

    Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i.Mathematical Programming, 184(1):71–120, 2020

  7. [7]

    Lower bounds for finding stationary points ii: first-order methods.Mathematical Programming, 185(1):315–355, 2021

    Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points ii: first-order methods.Mathematical Programming, 185(1):315–355, 2021

  8. [8]

    A survey of pomdp applications

    Anthony R Cassandra. A survey of pomdp applications. InWorking notes of AAAI 1998 fall symposium on planning with partially observable Markov decision processes, volume 1724, 1998

  9. [9]

    Fast global convergence of natural policy gradient methods with entropy regularization.Operations Research, 70(4):2563– 2578, 2022

    Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of natural policy gradient methods with entropy regularization.Operations Research, 70(4):2563– 2578, 2022

  10. [10]

    Decentralized natural policy gradient with variance reduction for collaborative multi-agent reinforcement learning.Journal of Machine Learning Research, 25(172):1–49, 2024

    Jinchi Chen, Jie Feng, Weiguo Gao, and Ke Wei. Decentralized natural policy gradient with variance reduction for collaborative multi-agent reinforcement learning.Journal of Machine Learning Research, 25(172):1–49, 2024

  11. [11]

    Locally Interdependent Multi-Agent MDP: Theoretical Framework for Decentralized Agents with Dynamic Dependencies

    Alex DeWeese and Guannan Qu. Locally Interdependent Multi-Agent MDP: Theoretical Framework for Decentralized Agents with Dynamic Dependencies. InForty-first International Conference on Machine Learning, 2024

  12. [12]

    Thinking beyond visibility: A near-optimal policy framework for locally interdependent multi-agent mdps.arXiv preprint arXiv:2506.04215, 2025

    Alex DeWeese and Guannan Qu. Thinking beyond visibility: A near-optimal policy framework for locally interdependent multi-agent mdps.arXiv preprint arXiv:2506.04215, 2025

  13. [13]

    Bilinear classes: A structural framework for provable generalization in rl

    Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. InInternational Conference on Machine Learning, pages 2826–2836. PMLR, 2021

  14. [14]

    Provably efficient rl with rich observations via latent state decoding

    Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. InInternational Conference on Machine Learning, pages 1665–1674. PMLR, 2019

  15. [15]

    Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies

    Ilyas Fatkhullin, Anas Barakat, Anastasia Kireeva, and Niao He. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. InInternational Conference on Machine Learning, pages 9827–9869. PMLR, 2023. 10

  16. [16]

    Global convergence of policy gradient methods for the linear quadratic regulator

    Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. InInternational conference on machine learning, pages 1467–1476. PMLR, 2018

  17. [17]

    Independent natural policy gradient always converges in markov potential games

    Roy Fox, Stephen M Mcaleer, Will Overman, and Ioannis Panageas. Independent natural policy gradient always converges in markov potential games. InInternational Conference on Artificial Intelligence and Statistics, pages 4414–4425. PMLR, 2022

  18. [18]

    Contextual decision processes with low bellman rank are pac-learnable

    Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. InInternational Conference on Machine Learning, pages 1704–1713. PMLR, 2017

  19. [19]

    Provably efficient reinforcement learning with linear function approximation.Mathematics of Operations Research, 48(3):1496– 1521, 2023

    Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation.Mathematics of Operations Research, 48(3):1496– 1521, 2023

  20. [20]

    Policy mirror descent for reinforcement learning: Linear convergence, new sam- pling complexity, and generalized problem classes.Mathematical programming, 198(1):1059– 1106, 2023

    Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sam- pling complexity, and generalized problem classes.Mathematical programming, 198(1):1059– 1106, 2023

  21. [21]

    Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

    Qiyang Li, Seohong Park, and Sergey Levine. Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

  22. [22]

    Reinforcement Learning with Action Chunking

    Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969, 2025

  23. [23]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

  24. [24]

    Multi-agent reinforcement learning in stochastic networked systems.Advances in neural information processing systems, 34:7825–7837, 2021

    Yiheng Lin, Guannan Qu, Longbo Huang, and Adam Wierman. Multi-agent reinforcement learning in stochastic networked systems.Advances in neural information processing systems, 34:7825–7837, 2021

  25. [25]

    Learning for multi-robot cooperation in partially observable stochastic environments with macro-actions

    Miao Liu, Kavinayan Sivakumar, Shayegan Omidshafiei, Christopher Amato, and Jonathan P How. Learning for multi-robot cooperation in partially observable stochastic environments with macro-actions. In2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1853–1860. IEEE, 2017

  26. [26]

    An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

    Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance- reduced) policy gradient and natural policy gradient methods.Advances in Neural Information Processing Systems, 33:7624–7636, 2020

  27. [27]

    Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

    Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy.arXiv preprint arXiv:2502.00361, 2025

  28. [28]

    Reinforcement learning with discrete diffusion policies for combinatorial action spaces.arXiv preprint arXiv:2509.22963, 2025

    Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Idan Szpektor, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, et al. Reinforcement learning with discrete diffusion policies for combinatorial action spaces.arXiv preprint arXiv:2509.22963, 2025

  29. [29]

    The role of baselines in policy gradient optimization.Advances in Neural Information Processing Systems, 35:17818–17830, 2022

    Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, and Dale Schuur- mans. The role of baselines in policy gradient optimization.Advances in Neural Information Processing Systems, 35:17818–17830, 2022

  30. [30]

    Ordering-based conditions for global convergence of policy gradient methods

    Jincheng Mei, Bo Dai, Alekh Agarwal, Mohammad Ghavamzadeh, Csaba Szepesvári, and Dale Schuurmans. Ordering-based conditions for global convergence of policy gradient methods. Advances in Neural Information Processing Systems, 36:30738–30749, 2023

  31. [31]

    On the global con- vergence rates of softmax policy gradient methods

    Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global con- vergence rates of softmax policy gradient methods. InInternational conference on machine learning, pages 6820–6829. PMLR, 2020. 11

  32. [32]

    Asynchronous methods for deep reinforce- ment learning

    V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce- ment learning. InInternational conference on machine learning, pages 1928–1937. PmLR, 2016

  33. [33]

    Sample complexity of rein- forcement learning using linearly combined model ensembles

    Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder Singh. Sample complexity of rein- forcement learning using linearly combined model ensembles. InInternational Conference on Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020

  34. [34]

    Error bounds for approximate value iteration

    Rémi Munos. Error bounds for approximate value iteration. InProceedings of the National Conference on Artificial Intelligence, volume 20, page 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005

  35. [35]

    Springer, 2016

    Frans A Oliehoek, Christopher Amato, et al.A Concise Introduction to Decentralized POMDPs, volume 1. Springer, 2016

  36. [36]

    Decentralized control of partially observable markov decision processes using belief space macro-actions

    Shayegan Omidshafiei, Ali-Akbar Agha-Mohammadi, Christopher Amato, and Jonathan P How. Decentralized control of partially observable markov decision processes using belief space macro-actions. In2015 IEEE international conference on robotics and automation (ICRA), pages 5962–5969. IEEE, 2015

  37. [37]

    The complexity of markov decision processes

    Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450, 1987

  38. [38]

    Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward.Advances in Neural Information Processing Systems, 33, 2020

    Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward.Advances in Neural Information Processing Systems, 33, 2020

  39. [39]

    Scalable reinforcement learning for multiagent networked systems.Operations Research, 70(6):3601–3628, 2022

    Guannan Qu, Adam Wierman, and Na Li. Scalable reinforcement learning for multiagent networked systems.Operations Research, 70(6):3601–3628, 2022

  40. [40]

    Ray interference: A source of plateaus in deep reinforcement learning,

    Tom Schaul, Diana Borsa, Joseph Modayil, and Razvan Pascanu. Ray interference: a source of plateaus in deep reinforcement learning.arXiv preprint arXiv:1904.11455, 2019

  41. [41]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  42. [42]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  43. [43]

    Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches

    Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. InConference on learning theory, pages 2898–2933. PMLR, 2019

  44. [44]

    Projected natural actor-critic.Advances in neural information processing systems, 26, 2013

    Philip S Thomas, William C Dabney, Stephen Giguere, and Sridhar Mahadevan. Projected natural actor-critic.Advances in neural information processing systems, 26, 2013

  45. [45]

    Max pressure control of a network of signalized intersections.Transportation Research Part C: Emerging Technologies, 36:177–195, 2013

    Pravin Varaiya. Max pressure control of a network of signalized intersections.Transportation Research Part C: Emerging Technologies, 36:177–195, 2013

  46. [46]

    Neural policy gradient methods: Global optimality and rates of convergence.arXiv preprint arXiv:1909.01150, 2019

    Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence.arXiv preprint arXiv:1909.01150, 2019

  47. [47]

    Optimism in reinforce- ment learning with generalized linear function approximation.arXiv preprint arXiv:1912.04136, 2019

    Yining Wang, Ruosong Wang, Simon S Du, and Akshay Krishnamurthy. Optimism in reinforce- ment learning with generalized linear function approximation.arXiv preprint arXiv:1912.04136, 2019

  48. [48]

    Policy gradient method for robust reinforcement learning

    Yue Wang and Shaofeng Zou. Policy gradient method for robust reinforcement learning. In International conference on machine learning, pages 23484–23526. PMLR, 2022

  49. [49]

    Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193,

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022. 12

  50. [50]

    Efficient exploration and value function generalization in deterministic systems.Advances in Neural Information Processing Systems, 26, 2013

    Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems.Advances in Neural Information Processing Systems, 26, 2013

  51. [51]

    Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

    Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8(3):229–256, 1992

  52. [52]

    On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36, 2022

    Lin Xiao. On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36, 2022

  53. [53]

    Sample-optimal parametric q-learning using linearly additive features

    Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. InInternational conference on machine learning, pages 6995–7004. PMLR, 2019

  54. [54]

    Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022a

    Rui Yuan, Simon S Du, Robert M Gower, Alessandro Lazaric, and Lin Xiao. Linear convergence of natural policy gradient methods with log-linear policies.arXiv preprint arXiv:2210.01400, 2022

  55. [55]

    A general sample complexity analysis of vanilla policy gradient

    Rui Yuan, Robert M Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla policy gradient. InInternational Conference on Artificial Intelligence and Statistics, pages 3332–3380. PMLR, 2022

  56. [56]

    Variational policy gradient method for reinforcement learning with general utilities.Advances in Neural Information Processing Systems, 33:4572–4583, 2020

    Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Variational policy gradient method for reinforcement learning with general utilities.Advances in Neural Information Processing Systems, 33:4572–4583, 2020

  57. [57]

    Number Matching with Independent Agents

    Changhong Zhao, Ufuk Topcu, Na Li, and Steven Low. Design and stability of load-side primary frequency control in power systems.IEEE Transactions on Automatic Control, 59(5):1177–1189, 2014. 13 A Simulations In this section, we provide additional simulations on examples with suboptimal critical points. The two state example from Section 4.2 is an example ...