arxiv: 2605.06377 · v1 · submitted 2026-05-07 · 💻 cs.GT · cs.LG· cs.MA

Recognition: unknown

Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

Maryam Kamgarpour, Philip Jordan

Pith reviewed 2026-05-08 04:16 UTC · model grok-4.3

classification 💻 cs.GT cs.LGcs.MA

keywords independent learningNash equilibriumpartially observable Markov gamesMarkov potential gamesdecentralized multi-agent learningfilter stabilitydecoupled dynamics

0 comments

The pith

Agents in partially observable Markov potential games can learn an approximate Nash equilibrium independently using only their own recent observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when agents interact through rewards but their underlying states evolve independently, they can still reach a stable joint outcome without any communication or shared data. Each player runs a local learning procedure that depends only on a short window of its personal action and observation history. Because the full-information version is a potential game, the partial-observability setting can be reduced to a simpler surrogate problem whose sample and computation costs grow only quasi-polynomially in the number of players rather than exponentially.

Core claim

In a POMG whose dynamics are decoupled across agents and whose fully observed counterpart is a Markov potential game, an independent algorithm lets each player restrict its policy to a finite history window and still jointly converge to an approximate Nash equilibrium. The filter stability assumption guarantees that these finite-history policies are sufficiently close to optimal, so that the original POMG can be replaced by a near-potential surrogate Markov game for which known independent learning methods apply directly.

What carries the argument

The surrogate near-potential Markov game formed by restricting each agent's policy to a finite window of its own action-observation history, under the filter stability assumption that makes long-past observations irrelevant.

If this is right

Independent learners achieve quasi-polynomial sample and computational complexity instead of exponential scaling with the number of agents.
No information sharing or central coordinator is needed for the convergence guarantee.
Any Markov potential game that satisfies the decoupled-dynamics and filter-stability conditions inherits the same independent-learning result.
Optimal policies in the POMG can be replaced by finite-history policies while preserving approximate equilibrium properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same finite-history reduction might apply to other decentralized multi-agent problems once a suitable stability condition on beliefs can be verified.
Potential-game structure can compensate for severe decentralization of information in ways that general-sum games cannot.
Empirical tests in sensor-network or robotics settings with controllable observability would show how quickly filter stability must hold for the method to remain practical.

Load-bearing premise

The belief that an agent maintains over the hidden state becomes insensitive to observations beyond a short recent window.

What would settle it

A concrete POMG instance in which the distance between the value of any finite-history policy and the optimal infinite-history policy remains bounded away from zero no matter how long the window, so that the surrogate game fails to stay near-potential and independent learners do not converge.

read the original abstract

We study Nash equilibrium learning in partially observable Markov games (POMGs), a multi-agent reinforcement learning framework in which agents cannot fully observe the underlying state. Prior work in this setting relies on centralization or information sharing, and suffers from sample and computational complexity that scales exponentially in the number of players. We focus on a subclass of POMGs with independent state transitions, where agents remain coupled through their rewards, and assume that the underlying fully observed Markov game is a Markov potential game. For this class, we present an independent learning algorithm in which players, observing only their own actions and observations and without communication, jointly converge to an approximate Nash equilibrium. Due to partial observability, optimal policies may in general depend on the full action-observation history. Under a filter stability assumption, we show that policies based on finite history windows provide sufficient approximation guarantees. This enables us to approximate the POMG by a surrogate Markov game that is near-potential, leading to quasi-polynomial sample and computational complexity for independent Nash equilibrium learning in the underlying POMG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives independent Nash learning with quasi-polynomial complexity for POMGs that have decoupled transitions and an underlying potential structure, but only by leaning hard on filter stability to justify finite-history policies and a near-potential surrogate.

read the letter

The one thing to know is that the authors give an independent learning algorithm for Nash equilibria in POMGs with decoupled dynamics that are Markov potential games, achieving quasi-polynomial complexity by approximating with a near-potential surrogate under filter stability. Players only need their own actions and observations, no communication, and the method avoids the exponential blow-up that hits general POMGs. They reduce the problem to existing independent learning results for potential games once the surrogate is in place. This is a clean use of structure that prior centralized approaches did not exploit in this subclass. The reduction itself is the main technical move and it is worth seeing in full. The soft spot is the filter stability assumption. The abstract says it makes finite history windows sufficient and keeps the surrogate close enough to potential, but it supplies no conditions on when the assumption holds for decoupled-dynamics POMGs and no explicit bounds on the resulting approximation error. If stability is weak or the required window grows too large, the near-potential property can slip and the complexity claim for the original game falls apart. Without the full proofs or any concrete examples it is hard to tell how often the assumption is realistic. This is for people working on decentralized multi-agent RL who already know potential games and want to handle partial observability without centralization. A reader focused on complexity of learning in games will get value from the reduction even if the scope is limited. It deserves a serious referee because the central claim is coherent on its own terms and the complexity improvement would matter if the proofs check out. I would send it to review.

Referee Report

1 major / 1 minor

Summary. The paper studies Nash equilibrium learning in partially observable Markov games (POMGs) with decoupled dynamics where the underlying fully observed game is a Markov potential game. It proposes an independent learning algorithm in which agents, each observing only their local actions and observations and without communication, converge to an approximate Nash equilibrium. Under a filter stability assumption, finite-history-window policies are shown to yield sufficient approximation guarantees, allowing the POMG to be reduced to a near-potential surrogate Markov game that admits quasi-polynomial sample and computational complexity.

Significance. If the filter stability assumption admits explicit quantitative bounds that control the deviation from potentiality, the result would constitute a meaningful advance: it supplies the first independent (communication-free) algorithm for approximate NE in this POMG subclass whose complexity scales quasi-polynomially rather than exponentially in the number of players. The reduction via finite-history surrogates and the exploitation of the potential-game structure are technically attractive and could influence subsequent work on decentralized MARL.

major comments (1)

[Abstract] Abstract (and the filter-stability paragraph): the central reduction to a 'near-potential' surrogate Markov game rests on the claim that finite-history policies are sufficiently close to optimal under filter stability. No quantitative bound is supplied on the resulting deviation from exact potentiality, nor are concrete conditions or examples given under which filter stability holds for POMGs with decoupled dynamics. Because this deviation directly determines whether the surrogate remains amenable to the quasi-polynomial independent-learning result, the assumption is load-bearing for the complexity claim.

minor comments (1)

The abstract states that the algorithm achieves 'quasi-polynomial sample and computational complexity' but does not indicate the precise dependence on horizon length, observation-space cardinality, or the filter-stability parameters.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for recognizing the potential significance of the result. We address the single major comment below and will revise the manuscript accordingly to strengthen the quantitative aspects of the filter-stability analysis.

read point-by-point responses

Referee: [Abstract] Abstract (and the filter-stability paragraph): the central reduction to a 'near-potential' surrogate Markov game rests on the claim that finite-history policies are sufficiently close to optimal under filter stability. No quantitative bound is supplied on the resulting deviation from exact potentiality, nor are concrete conditions or examples given under which filter stability holds for POMGs with decoupled dynamics. Because this deviation directly determines whether the surrogate remains amenable to the quasi-polynomial independent-learning result, the assumption is load-bearing for the complexity claim.

Authors: We agree that an explicit quantitative link between filter stability, finite-history approximation error, and the resulting deviation from exact potentiality would make the complexity claim more transparent. In the current manuscript the approximation of finite-window policies is controlled via the filter-stability rate (Definition 2.3 and Lemma 3.1), which yields a sub-optimality gap that decays exponentially in the window length; the surrogate game is then shown to be ε-near-potential with ε depending on this gap. However, the dependence is stated only implicitly. In the revision we will add a new corollary (Corollary 4.3) that explicitly bounds the potential-function deviation by O(δ + (1-γ)^w), where δ is the filter-stability contraction factor, γ the discount, and w the window length. This bound is polynomial in the relevant parameters and therefore preserves the quasi-polynomial sample and computational complexity up to an additional poly(1/ε) factor. We will also include a short appendix example (Appendix C) of a two-player decoupled POMG in which each agent receives a noisy local-state observation; under standard mixing assumptions on the observation kernel the filter contracts at rate γ = 1-η for η > 0, satisfying the assumption with an explicit δ. These additions directly address the load-bearing nature of the assumption without altering the main theorems. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained via external assumptions and standard potential-game results

full rationale

The paper assumes the underlying fully-observed Markov game is a Markov potential game and invokes an external filter-stability condition to justify that finite-history policies yield a near-potential surrogate Markov game. It then applies known independent-learning algorithms for potential games to obtain the quasi-polynomial complexity bound. No step reduces a claimed prediction or first-principles result to a fitted parameter, self-citation chain, or definitional tautology; the load-bearing approximations rest on stated assumptions whose validity is independent of the present derivation, and the final complexity claim follows from the surrogate's properties rather than from any internal renaming or self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim depends on three domain assumptions drawn from game theory and POMDP literature rather than new postulates or fitted parameters.

axioms (3)

domain assumption The underlying fully observed Markov game is a Markov potential game
Invoked to ensure the existence of a shared potential function that supports Nash equilibrium learning.
domain assumption Filter stability assumption
Used to justify that finite action-observation histories suffice for policy approximation.
domain assumption Independent state transitions (decoupled dynamics)
Assumed so that agents' state evolution does not depend on others' actions.

pith-pipeline@v0.9.0 · 5487 in / 1330 out tokens · 34603 ms · 2026-05-08T04:16:20.347803+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 1 canonical work pages

[1]

Partially observable Markov decision processes in robotics: A survey.IEEE Transactions on Robotics, 39(1):21–40, 2022

Mikko Lauri, David Hsu, and Joni Pajarinen. Partially observable Markov decision processes in robotics: A survey.IEEE Transactions on Robotics, 39(1):21–40, 2022

2022
[2]

Intention-aware online POMDP planning for autonomous driving in a crowd

Haoyu Bai, Shaojun Cai, Nan Ye, David Hsu, and Wee Sun Lee. Intention-aware online POMDP planning for autonomous driving in a crowd. InInternational Conference on Robotics and Automation, pages 454–460. IEEE, 2015

2015
[3]

Asynchronous multi-agent deep reinforcement learning under partial observability.The International Journal of Robotics Research, 44(8):1257–1286, 2025

Yuchen Xiao, Weihao Tan, Joshua Hoffman, Tian Xia, and Christopher Amato. Asynchronous multi-agent deep reinforcement learning under partial observability.The International Journal of Robotics Research, 44(8):1257–1286, 2025

2025
[4]

Solving imperfect information Poker games using Monte Carlo search and POMDP models

Jian Yao, Zeyu Zhang, Li Xia, Jun Yang, and Qianchuan Zhao. Solving imperfect information Poker games using Monte Carlo search and POMDP models. In2020 IEEE 9th Data Driven Control and Learning Systems Conference (DDCLS), pages 1060–1065. IEEE, 2020

2020
[5]

Markov decision processes: Discrete stochastic dynamic programming, 1994

Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming, 1994

1994
[6]

The complexity of Markov decision processes

Christos H Papadimitriou and John N Tsitsiklis. The complexity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, 1987

1987
[7]

PAC reinforcement learning with rich observations.Advances in Neural Information Processing Systems, 2016

Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with rich observations.Advances in Neural Information Processing Systems, 2016

2016
[8]

Planning and learning in partially observ- able systems via filter stability

Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Planning and learning in partially observ- able systems via filter stability. InProceedings of the 55th Annual ACM Symposium on Theory of Computing, pages 349–362, 2023

2023
[9]

When is partially observable reinforcement learning not scary? InConference on Learning Theory, pages 5175–5220

Qinghua Liu, Alan Chung, Csaba Szepesvári, and Chi Jin. When is partially observable reinforcement learning not scary? InConference on Learning Theory, pages 5175–5220. PMLR, 2022

2022
[10]

Learning in observable POMDPs, without computationally intractable oracles.Advances in Neural Information Processing Systems, 2022

Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Learning in observable POMDPs, without computationally intractable oracles.Advances in Neural Information Processing Systems, 2022

2022
[11]

Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2023

Ali Devran Kara and Serdar Yüksel. Convergence of finite memory Q-learning for POMDPs and near optimality of learned policies under filter stability.Mathematics of Operations Research, 48(4):2066–2093, 2023

2066
[12]

Finite-time analysis of natural actor-critic for POMDPs

Semih Cayci, Niao He, and R Srikant. Finite-time analysis of natural actor-critic for POMDPs. SIAM Journal on Mathematics of Data Science, 6(4):869–896, 2024

2024
[13]

Scalable policy-based RL algorithms for POMDPs.Advances in Neural Information Processing Systems, 2025

Ameya Anjarlekar, S Rasoul Etesami, and R Srikant. Scalable policy-based RL algorithms for POMDPs.Advances in Neural Information Processing Systems, 2025

2025
[14]

Model-based learning of near-optimal finite-window policies in POMDPs.arXiv preprint arXiv:2604.01024, 2026

Philip Jordan and Maryam Kamgarpour. Model-based learning of near-optimal finite-window policies in POMDPs.arXiv preprint arXiv:2604.01024, 2026

work page arXiv 2026
[15]

Approxi- mate solutions for partially observable stochastic games with common payoffs

Rosemary Emery-Montemerlo, Geoff Gordon, Jeff Schneider, and Sebastian Thrun. Approxi- mate solutions for partially observable stochastic games with common payoffs. InProceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems,
[16]

IEEE, 2004

AAMAS 2004., pages 136–143. IEEE, 2004

2004
[17]

Deep decentralized multi-task multi-agent reinforcement learning under partial observability

Shayegan Omidshafiei, Jason Pazis, Christopher Amato, Jonathan P How, and John Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. In International Conference on Machine Learning, pages 2681–2690. PMLR, 2017

2017
[18]

Improving policies via search in cooperative partially observable games

Adam Lerer, Hengyuan Hu, Jakob Foerster, and Noam Brown. Improving policies via search in cooperative partially observable games. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7187–7194, 2020. 10

2020
[19]

Divergence-regularized discounted aggregation: Equilibrium finding in multiplayer partially observable stochastic games

Runyu Lu, Yuanheng Zhu, and Dongbin Zhao. Divergence-regularized discounted aggregation: Equilibrium finding in multiplayer partially observable stochastic games. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[20]

Sample-efficient reinforcement learning of partially observable Markov games.Advances in Neural Information Processing Systems, 2022

Qinghua Liu, Csaba Szepesvári, and Chi Jin. Sample-efficient reinforcement learning of partially observable Markov games.Advances in Neural Information Processing Systems, 2022

2022
[21]

The complexity of computing a Nash equilibrium.Communications of the ACM, 52(2):89–97, 2009

Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium.Communications of the ACM, 52(2):89–97, 2009

2009
[22]

Learning parametric closed-loop policies for Markov potential games

Sergio Valcarcel Macua, Javier Zazo, and Santiago Zazo. Learning parametric closed-loop policies for Markov potential games. InInternational Conference on Learning Representations, 2018

2018
[23]

When can we learn general-sum Markov games with a large number of players sample-efficiently? InInternational Conference on Learning Representations, 2022

Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum Markov games with a large number of players sample-efficiently? InInternational Conference on Learning Representations, 2022

2022
[24]

Global conver- gence of multi-agent policy gradient in Markov potential games

Stefanos Leonardos, Will Overman, Ioannis Panageas, and Georgios Piliouras. Global conver- gence of multi-agent policy gradient in Markov potential games. InInternational Conference on Learning Representations, 2022

2022
[25]

Independent policy gradient for large-scale Markov potential games: Sharper rates, function approximation, and game-agnostic convergence

Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Mihailo Jovanovic. Independent policy gradient for large-scale Markov potential games: Sharper rates, function approximation, and game-agnostic convergence. InInternational Conference on Machine Learning, pages 5166–
[26]

Independent and decentralized learning in Markov potential games.IEEE Transactions on Automatic Control, 2025

Chinmay Maheshwari, Manxi Wu, Druv Pai, and Shankar Sastry. Independent and decentralized learning in Markov potential games.IEEE Transactions on Automatic Control, 2025

2025
[27]

The complexity of decentralized control of Markov decision processes.Mathematics of operations research, 27(4): 819–840, 2002

Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes.Mathematics of operations research, 27(4): 819–840, 2002

2002
[28]

Optimally solving Dec-POMDPs as continuous-state MDPs.Journal of Artificial Intelligence Research, 55:443–497, 2016

Jilles Steeve Dibangoye, Christopher Amato, Olivier Buffet, and François Charpillet. Optimally solving Dec-POMDPs as continuous-state MDPs.Journal of Artificial Intelligence Research, 55:443–497, 2016

2016
[29]

Learning to act in decentralized partially observable MDPs

Jilles Dibangoye and Olivier Buffet. Learning to act in decentralized partially observable MDPs. InInternational Conference on Machine Learning, pages 1233–1242. PMLR, 2018

2018
[30]

Decentralized learning of finite-memory policies in Dec-POMDPs.IFAC-PapersOnLine, 56(2):2601–2607, 2023

Weichao Mao, Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Decentralized learning of finite-memory policies in Dec-POMDPs.IFAC-PapersOnLine, 56(2):2601–2607, 2023

2023
[31]

Partially observable multi-agent rl with (quasi-) efficiency: The blessing of information sharing

Xiangyu Liu and Kaiqing Zhang. Partially observable multi-agent rl with (quasi-) efficiency: The blessing of information sharing. InInternational Conference on Machine Learning, pages 22370–22419. PMLR, 2023

2023
[32]

Constrained stochastic games in wireless networks

Eitan Altman, Konstantin Avratchenkov, Nicolas Bonneau, Mérouane Debbah, Rachid El- Azouzi, and Daniel Sadoc Menasché. Constrained stochastic games in wireless networks. In IEEE GLOBECOM 2007-IEEE Global Telecommunications Conference, pages 315–320. IEEE, 2007

2007
[33]

Dynamic discrete power control in cellular networks.IEEE Transactions on Automatic Control, 54(10):2328–2340, 2009

Eitan Altman, Konstantin Avrachenkov, Ishai Menache, Gregory Miller, Balakrishna J Prabhu, and Adam Shwartz. Dynamic discrete power control in cellular networks.IEEE Transactions on Automatic Control, 54(10):2328–2340, 2009

2009
[34]

Stochastic games for the smart grid energy management with prospect prosumers.IEEE Transactions on Automatic Control, 63(8):2327–2342, 2018

S Rasoul Etesami, Walid Saad, Narayan B Mandayam, and H Vincent Poor. Stochastic games for the smart grid energy management with prospect prosumers.IEEE Transactions on Automatic Control, 63(8):2327–2342, 2018

2018
[35]

Markov games with decoupled dynamics: Price of anarchy and sample complexity

Runyu Zhang, Yuyang Zhang, Rohit Konda, Bryce Ferguson, Jason Marden, and Na Li. Markov games with decoupled dynamics: Price of anarchy and sample complexity. In2023 62nd IEEE Conference on Decision and Control (CDC), pages 8100–8107. IEEE, 2023. 11

2023
[36]

Learning stationary nash equilibrium policies in n-player stochastic games with independent chains.SIAM Journal on Control and Optimization, 62(2):799–825, 2024

S Rasoul Etesami. Learning stationary nash equilibrium policies in n-player stochastic games with independent chains.SIAM Journal on Control and Optimization, 62(2):799–825, 2024

2024
[37]

Markov α-potential games.IEEE Transactions on Automatic Control, 2025

Xin Guo, Xinyu Li, Chinmay Maheshwari, Shankar Sastry, and Manxi Wu. Markov α-potential games.IEEE Transactions on Automatic Control, 2025

2025
[38]

An α-potential game framework for n-player dynamic games.SIAM Journal on Control and Optimization, 63(4):2964–3005, 2025

Xin Guo, Xinyu Li, and Yufei Zhang. An α-potential game framework for n-player dynamic games.SIAM Journal on Control and Optimization, 63(4):2964–3005, 2025

2025
[39]

On the global convergence rates of decentralized softmax gradient play in Markov potential games.Advances in Neural Information Processing Systems, 2022

Runyu Zhang, Jincheng Mei, Bo Dai, Dale Schuurmans, and Na Li. On the global convergence rates of decentralized softmax gradient play in Markov potential games.Advances in Neural Information Processing Systems, 2022

2022
[40]

Multi-agent learning via Markov potential games in marketplaces for distributed energy resources

Dheeraj Narasimha, Kiyeob Lee, Dileep Kalathil, and Srinivas Shakkottai. Multi-agent learning via Markov potential games in marketplaces for distributed energy resources. In2022 IEEE 61st Conference on Decision and Control (CDC), pages 6350–6357. IEEE, 2022

2022
[41]

Hidden Markov models.Unpublished lecture notes, 2008

Ramon van Handel. Hidden Markov models.Unpublished lecture notes, 2008

2008
[42]

Near optimality of finite memory feedback policies in partially observed Markov decision processes.Journal of Machine Learning Research, 23(11):1–46, 2022

Ali Kara and Serdar Yuksel. Near optimality of finite memory feedback policies in partially observed Markov decision processes.Journal of Machine Learning Research, 23(11):1–46, 2022

2022
[43]

Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸ sar. Multi-agent reinforcement learning: A selective overview of theories and algorithms.Handbook of reinforcement learning and control, pages 321–384, 2021

2021
[44]

Independent policy gradient methods for competitive reinforcement learning.Advances in Neural Information Processing Systems, 2020

Constantinos Daskalakis, Dylan J Foster, and Noah Golowich. Independent policy gradient methods for competitive reinforcement learning.Advances in Neural Information Processing Systems, 2020

2020
[45]

Provable self-play algorithms for competitive reinforcement learning

Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International Conference on Machine Learning, pages 551–560. PMLR, 2020

2020
[46]

Cyclic equilibria in Markov games

Martin Zinkevich, Amy Greenwald, and Michael Littman. Cyclic equilibria in Markov games. Advances in Neural Information Processing Systems, 2005

2005
[47]

On the sample complexity of reinforcement learning with a generative model

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert Kappen. On the sample complexity of reinforcement learning with a generative model. InInternational Conference on Machine Learning, 2012

2012
[48]

Near-optimal reinforcement learning in polynomial time

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2):209–232, 2002

2002
[49]

HX h′=h rm i,h′(sh′, ah′)|(a 1, o1, . . . , ah−1, oh−1) =τ # V m i,h(π;w) :=E π, s1∼µ

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76, 2021. 12 Supplementary Material Table of Contents A Overview of Notation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 B Discu...

2021