Recognition: 2 theorem links
· Lean TheoremNatural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework
Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3
The pith
Natural policy gradient is exactly a doubly smoothed and averaged form of policy iteration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Natural policy gradient admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, doubly smoothed policy iteration obtains each policy by applying a regularized greedy step to a weighted average of past Q-functions. Using only monotonicity and contraction of smoothed Bellman operators, DSPI converges globally and geometrically. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) for computing an ε-optimal policy, without modifying the MDP or using adaptive stepsizes. The unregularized case terminates in finite steps, and the framework extends to linear function-0-
What carries the argument
Doubly smoothed policy iteration (DSPI): each policy is obtained by applying a regularized greedy step to a weighted average of past Q-functions, unifying natural policy gradient and related methods as special cases.
If this is right
- Natural policy gradient and policy dual averaging achieve O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) iterations to ε-optimality without MDP changes or adaptive stepsizes.
- Dual-averaged policy iteration terminates after finitely many steps in the unregularized case.
- The convergence result extends to discounted MDPs with linear function approximation.
- The same framework applies to stochastic shortest path problems.
Where Pith is reading between the lines
- The Bellman-operator unification may let researchers transfer similar geometric rates to other mirror-descent-style policy updates not covered in the paper.
- Averaging historical Q-functions could be tested as a practical stabilization technique inside existing actor-critic implementations.
- The framework suggests that global convergence in policy optimization often stems from the combination of smoothing and averaging rather than from any single algorithm choice.
Load-bearing premise
Smoothed Bellman operators are monotone and contractive.
What would settle it
In a small tabular MDP, running natural policy gradient until an ε-optimal policy is reached and observing that the iteration count deviates substantially from the O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) bound would falsify the claimed complexity.
read the original abstract
In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-\gamma)^{-1}\log((1-\gamma)^{-1}\epsilon^{-1}))$ for computing an $\epsilon$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces doubly smoothed policy iteration (DSPI) as a Bellman-operator framework that exactly reformulates natural policy gradient (NPG) and policy dual averaging methods. DSPI obtains each policy via a regularized greedy step applied to a weighted average of historical Q-functions. The authors prove that, relying solely on the monotonicity and contraction properties of the associated smoothed Bellman operators, DSPI converges globally and geometrically in a distribution-free manner. This yields an iteration complexity of O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) for ε-optimal policies in discounted MDPs, with finite termination in the unregularized case. The framework is extended to linear function approximation and stochastic shortest path problems.
Significance. If the derivations hold, this work offers a significant unification of policy optimization algorithms under a single operator-theoretic lens, providing rigorous convergence guarantees without requiring MDP modifications, additional regularization, or adaptive stepsizes. The explicit use of monotonicity and contraction for distribution-free rates is a strength, and the reduction of standard NPG to this form clarifies its theoretical underpinnings. The extensions to function approximation and SSP further broaden its applicability.
minor comments (3)
- [§3.2] §3.2, Eq. (8): the recursive definition of the weighted Q-average could be accompanied by an explicit closed-form expression to make the link to dual averaging immediate for readers.
- [§4.3] §4.3: the extension to linear function approximation states that the same operator properties carry over, but the contraction modulus now depends on the approximation error; a short remark quantifying this dependence would strengthen the claim.
- [§4.2] The finite-termination argument for the unregularized case (dual-averaged policy iteration) is stated for finite MDPs; an explicit sentence noting that the argument relies on the finite policy space would prevent misreading.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of our manuscript, as well as for the recommendation of minor revision. We are pleased that the Bellman-operator unification of natural policy gradient and related methods, along with the distribution-free convergence guarantees, were viewed as significant contributions. Since the report contains no specific major comments or criticisms, we have no individual points to rebut or revise at this stage. We will incorporate any minor editorial suggestions from the editor or additional feedback in the revised version.
Circularity Check
No significant circularity detected in the derivation
full rationale
The manuscript derives the exact equivalence of natural policy gradient to doubly smoothed policy iteration (DSPI) via direct substitution of the mirror-map regularizer into the DSPI update, without any self-referential definition or fitted parameters. Convergence is established from the independent, standard properties of monotonicity and contraction (modulus γ) of the induced smoothed Bellman operators; these hold for the general operator framework and are not constructed from the target rates or algorithm-specific quantities. The resulting iteration complexity O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) follows by composing these operator bounds, and the unregularized case reduces to the classical policy-improvement argument. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs; the framework is self-contained against external Bellman-operator theory.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Smoothed Bellman operators are monotone and contractive
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearDSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient... as special cases
Reference graph
Works this paper leans on
-
[1]
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76
work page 2021
-
[2]
Alfano, C. and Rebeschini, P. (2023). Linear convergence for natural policy gradient with log-linear policy parametrization. InInternational Conference on Learning Representations
work page 2023
-
[3]
Alfano, C., Yuan, R., and Rebeschini, P. (2023). A novel framework for policy mirror descent with general parameterization and linear convergence.Advances in Neural Information Processing Systems, 36:30681–30725
work page 2023
-
[4]
Bellman, R. (1957).Dynamic Programming. Princeton University Press
work page 1957
-
[5]
Bertsekas, D. P. (1995).Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA
work page 1995
-
[6]
Bertsekas, D. P. and Tsitsiklis, J. N. (1991). An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595
work page 1991
-
[7]
Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-dynamic programming. Athena Scientific
work page 1996
-
[8]
Bhandari, J. and Russo, D. (2021). On the linear convergence of policy gradient methods for finite mdps. In Banerjee, A. and Fukumizu, K., editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 2386–2394. PMLR
work page 2021
-
[9]
Bhandari, J. and Russo, D. (2024). Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927
work page 2024
-
[10]
Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR
work page 2018
-
[11]
D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901
work page 2020
-
[12]
Cayci, S., He, N., and Srikant, R. (2024). Convergence of entropy-regularized natural policy gradient with linear function approximation.SIAM Journal on Optimization, 34(3):2729–2755
work page 2024
-
[13]
Cen, S., Cheng, C., Chen, Y., Wei, Y., and Chi, Y. (2022). Fast global convergence of natural policy gradient methods with entropy regularization.Operations Research, 70(4):2563–2578. 13
work page 2022
-
[14]
Chen, L., Luo, H., and Rosenberg, A. (2022). Policy optimization for stochastic shortest path. InConference on Learning Theory, pages 982–1046. PMLR
work page 2022
-
[15]
Chen, Z. and Maguluri, S. T. (2025). An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395
work page 2025
-
[16]
T., Shakkottai, S., and Shanmugam, K
Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352– 1367
work page 2024
-
[17]
Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized Markov decision processes. InInternational Conference on Machine Learning, pages 2160–2169. PMLR
work page 2019
-
[18]
Gheshlaghi Azar, M., Munos, R., and Kappen, B. K. H. J. (2013). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model.Machine Learning, 91:325–349
work page 2013
-
[19]
Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE
work page 2017
-
[20]
Gupta, A. and Mahajan, A. (2026). Operator-theoretic foundations and policy gradient methods for general MDPs with unbounded costs.Preprint arXiv:2603.17875
-
[21]
Howard, R. and of Technology, M. I. (1964).Dynamic Programming and Markov Processes. Massachusetts Institute of Technology
work page 1964
-
[22]
Ju, C. and Lan, G. (2022). Policy optimization over general state and action spaces.Preprint arXiv:2211.16715
-
[23]
Ju, C. and Lan, G. (2026). Strongly-polynomial time and validation analysis of policy gradient methods.Mathematical Programming, pages 1–45
work page 2026
-
[24]
Kakade, S. M. (2001). A natural policy gradient.Advances in neural information processing systems, 14
work page 2001
-
[25]
Khodadadian, S., Jhunjhunwala, P. R., Varma, S. M., and Maguluri, S. T. (2022). On linear and super-linear convergence of natural policy gradient algorithm.Systems & Control Letters, 164:105214
work page 2022
-
[26]
Lan, G. (2023). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical programming, 198(1):1059– 1106
work page 2023
-
[27]
Lan, G., Li, Y., and Zhao, T. (2023). Block policy mirror descent.SIAM Journal on Optimization, 33(3):2341–2378
work page 2023
-
[28]
Lee, J. and Ryu, E. K. (2025). Why policy gradient algorithms work for undiscounted total- reward MDPs.Preprint arXiv:2510.18340. 14
- [29]
-
[30]
Li, Y., Lan, G., and Zhao, T. (2024). Homotopic policy mirror descent: policy conver- gence, algorithmic regularization, and improved sample complexity.Mathematical Programming, 207(1):457–513
work page 2024
-
[31]
Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. (2020). On the global convergence rates of softmax policy gradient methods. InInternational Conference on Machine Learning, pages 6820–6829. PMLR
work page 2020
-
[32]
Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons
work page 2014
-
[33]
Ryu, E. K. and Yin, W. (2022).Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press
work page 2022
-
[34]
Scherrer, B. (2014). Approximate policy iteration schemes: a comparison. InInternational Conference on Machine Learning, pages 1314–1322. PMLR
work page 2014
-
[35]
Scherrer, B. (2016). Improved and generalized upper bounds on the complexity of policy iteration.Mathematics of Operations Research, 41(3):758–774
work page 2016
-
[36]
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR
work page 2015
-
[37]
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.Preprint arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675
work page 2020
-
[39]
Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830
work page 2019
-
[40]
Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press
work page 2018
-
[41]
S., McAllester, D., Singh, S., and Mansour, Y
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. InProceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057–1063
work page 1999
-
[42]
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202
work page 1994
-
[43]
Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation.IEEE transactions on automatic control, 42(5):674–690
work page 1997
-
[44]
Xiao, L. (2022). On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36. 15
work page 2022
-
[45]
Ye, Y. (2011). The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate.Mathematics of Operations Research, 36(4):593–603
work page 2011
-
[46]
Yu, H. and Bertsekas, D. P. (2013a). On boundedness of Q-learning iterates for stochastic shortest path problems.Mathematics of Operations Research, 38(2):209–227
-
[47]
Yu, H. and Bertsekas, D. P. (2013b). Q-learning and policy iteration algorithms for stochastic shortest path problems.Annals of Operations Research, 208(1):95–132
-
[48]
Yuan, R., Du, S. S., Gower, R. M., Lazaric, A., and Xiao, L. (2023). Linear convergence of natural policy gradient methods with log-linear policies. InThe Eleventh International Conference on Learning Representations
work page 2023
-
[49]
Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K. (2020). A survey of autonomous driving: Common practices and emerging technologies.IEEE Access, 8:58443–58469
work page 2020
-
[50]
Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J. D., and Chi, Y. (2023). Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence.SIAM Journal on Optimization, 33(2):1061–1091. 16 Appendices A Related Work on Natural Policy Gradient Table 1: Summary of existing analyses of natural policy gradient. Global...
work page 2023
-
[51]
Hence,π1(a′ 0|s′) = 0. Our next result gives a bound on the certificate corresponding to the policies generated by the DPI algorithm. Lemma B.5.Let {πk}be the sequence of policies generated by Algorithm 3. Then, for every k≥1, ∥Q∗−Hπk (Q∗)∥∞≤2γ 1−γ k−1∏ i=1 ( 1−(1−γ)βi ) ∥Q∗−Hπ0(Q∗)∥∞. As a result, when choosingβ 0 = 1andβk =β∈(0,1]for everyk≥1, we have ∥...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.