pith. machine review for the scientific record. sign in

arxiv: 2605.10671 · v1 · submitted 2026-05-11 · 💻 cs.LG · math.OC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

Phalguni Nanda, Zaiwei Chen

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords natural policy gradientpolicy iterationBellman operatorsreinforcement learningconvergence analysispolicy dual averagingsmoothed operators
0
0 comments X

The pith

Natural policy gradient is exactly a doubly smoothed and averaged form of policy iteration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that natural policy gradient arises precisely as one instance of doubly smoothed policy iteration, where each new policy comes from a regularized greedy step applied to a weighted average of past Q-functions. This single Bellman-operator framework also recovers policy iteration, dual-averaged policy iteration, and other policy dual averaging methods as special cases. Using only the monotonicity and contraction properties of the smoothed operators, the authors prove that the procedure converges geometrically to an optimal policy in a distribution-free manner. A reader cares because the result supplies explicit iteration complexity bounds for these widely used reinforcement learning algorithms without MDP modifications, extra regularization, or trajectory-dependent step sizes.

Core claim

Natural policy gradient admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, doubly smoothed policy iteration obtains each policy by applying a regularized greedy step to a weighted average of past Q-functions. Using only monotonicity and contraction of smoothed Bellman operators, DSPI converges globally and geometrically. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) for computing an ε-optimal policy, without modifying the MDP or using adaptive stepsizes. The unregularized case terminates in finite steps, and the framework extends to linear function-0-

What carries the argument

Doubly smoothed policy iteration (DSPI): each policy is obtained by applying a regularized greedy step to a weighted average of past Q-functions, unifying natural policy gradient and related methods as special cases.

If this is right

  • Natural policy gradient and policy dual averaging achieve O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) iterations to ε-optimality without MDP changes or adaptive stepsizes.
  • Dual-averaged policy iteration terminates after finitely many steps in the unregularized case.
  • The convergence result extends to discounted MDPs with linear function approximation.
  • The same framework applies to stochastic shortest path problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The Bellman-operator unification may let researchers transfer similar geometric rates to other mirror-descent-style policy updates not covered in the paper.
  • Averaging historical Q-functions could be tested as a practical stabilization technique inside existing actor-critic implementations.
  • The framework suggests that global convergence in policy optimization often stems from the combination of smoothing and averaging rather than from any single algorithm choice.

Load-bearing premise

Smoothed Bellman operators are monotone and contractive.

What would settle it

In a small tabular MDP, running natural policy gradient until an ε-optimal policy is reached and observing that the iteration count deviates substantially from the O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) bound would falsify the claimed complexity.

read the original abstract

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-\gamma)^{-1}\log((1-\gamma)^{-1}\epsilon^{-1}))$ for computing an $\epsilon$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces doubly smoothed policy iteration (DSPI) as a Bellman-operator framework that exactly reformulates natural policy gradient (NPG) and policy dual averaging methods. DSPI obtains each policy via a regularized greedy step applied to a weighted average of historical Q-functions. The authors prove that, relying solely on the monotonicity and contraction properties of the associated smoothed Bellman operators, DSPI converges globally and geometrically in a distribution-free manner. This yields an iteration complexity of O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) for ε-optimal policies in discounted MDPs, with finite termination in the unregularized case. The framework is extended to linear function approximation and stochastic shortest path problems.

Significance. If the derivations hold, this work offers a significant unification of policy optimization algorithms under a single operator-theoretic lens, providing rigorous convergence guarantees without requiring MDP modifications, additional regularization, or adaptive stepsizes. The explicit use of monotonicity and contraction for distribution-free rates is a strength, and the reduction of standard NPG to this form clarifies its theoretical underpinnings. The extensions to function approximation and SSP further broaden its applicability.

minor comments (3)
  1. [§3.2] §3.2, Eq. (8): the recursive definition of the weighted Q-average could be accompanied by an explicit closed-form expression to make the link to dual averaging immediate for readers.
  2. [§4.3] §4.3: the extension to linear function approximation states that the same operator properties carry over, but the contraction modulus now depends on the approximation error; a short remark quantifying this dependence would strengthen the claim.
  3. [§4.2] The finite-termination argument for the unregularized case (dual-averaged policy iteration) is stated for finite MDPs; an explicit sentence noting that the argument relies on the finite policy space would prevent misreading.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our manuscript, as well as for the recommendation of minor revision. We are pleased that the Bellman-operator unification of natural policy gradient and related methods, along with the distribution-free convergence guarantees, were viewed as significant contributions. Since the report contains no specific major comments or criticisms, we have no individual points to rebut or revise at this stage. We will incorporate any minor editorial suggestions from the editor or additional feedback in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The manuscript derives the exact equivalence of natural policy gradient to doubly smoothed policy iteration (DSPI) via direct substitution of the mirror-map regularizer into the DSPI update, without any self-referential definition or fitted parameters. Convergence is established from the independent, standard properties of monotonicity and contraction (modulus γ) of the induced smoothed Bellman operators; these hold for the general operator framework and are not constructed from the target rates or algorithm-specific quantities. The resulting iteration complexity O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) follows by composing these operator bounds, and the unregularized case reduces to the classical policy-improvement argument. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs; the framework is self-contained against external Bellman-operator theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard domain assumption that smoothed Bellman operators remain monotone and contractive; no free parameters or new invented entities are introduced beyond the definition of DSPI itself.

axioms (1)
  • domain assumption Smoothed Bellman operators are monotone and contractive
    Invoked directly to establish global geometric convergence of DSPI and its special cases.

pith-pipeline@v0.9.0 · 5519 in / 1332 out tokens · 57459 ms · 2026-05-12T03:56:20.487341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

  1. [1]

    M., Lee, J

    Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76

  2. [2]

    and Rebeschini, P

    Alfano, C. and Rebeschini, P. (2023). Linear convergence for natural policy gradient with log-linear policy parametrization. InInternational Conference on Learning Representations

  3. [3]

    Alfano, C., Yuan, R., and Rebeschini, P. (2023). A novel framework for policy mirror descent with general parameterization and linear convergence.Advances in Neural Information Processing Systems, 36:30681–30725

  4. [4]

    (1957).Dynamic Programming

    Bellman, R. (1957).Dynamic Programming. Princeton University Press

  5. [5]

    Bertsekas, D. P. (1995).Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA

  6. [6]

    Bertsekas, D. P. and Tsitsiklis, J. N. (1991). An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595

  7. [7]

    Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-dynamic programming. Athena Scientific

  8. [8]

    and Russo, D

    Bhandari, J. and Russo, D. (2021). On the linear convergence of policy gradient methods for finite mdps. In Banerjee, A. and Fukumizu, K., editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 2386–2394. PMLR

  9. [9]

    and Russo, D

    Bhandari, J. and Russo, D. (2024). Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927

  10. [10]

    Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR

  11. [11]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901

  12. [12]

    Cayci, S., He, N., and Srikant, R. (2024). Convergence of entropy-regularized natural policy gradient with linear function approximation.SIAM Journal on Optimization, 34(3):2729–2755

  13. [13]

    Cen, S., Cheng, C., Chen, Y., Wei, Y., and Chi, Y. (2022). Fast global convergence of natural policy gradient methods with entropy regularization.Operations Research, 70(4):2563–2578. 13

  14. [14]

    Chen, L., Luo, H., and Rosenberg, A. (2022). Policy optimization for stochastic shortest path. InConference on Learning Theory, pages 982–1046. PMLR

  15. [15]

    and Maguluri, S

    Chen, Z. and Maguluri, S. T. (2025). An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395

  16. [16]

    T., Shakkottai, S., and Shanmugam, K

    Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352– 1367

  17. [17]

    Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized Markov decision processes. InInternational Conference on Machine Learning, pages 2160–2169. PMLR

  18. [18]

    Gheshlaghi Azar, M., Munos, R., and Kappen, B. K. H. J. (2013). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model.Machine Learning, 91:325–349

  19. [19]

    Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE

  20. [20]

    and Mahajan, A

    Gupta, A. and Mahajan, A. (2026). Operator-theoretic foundations and policy gradient methods for general MDPs with unbounded costs.Preprint arXiv:2603.17875

  21. [21]

    and of Technology, M

    Howard, R. and of Technology, M. I. (1964).Dynamic Programming and Markov Processes. Massachusetts Institute of Technology

  22. [22]

    and Lan, G

    Ju, C. and Lan, G. (2022). Policy optimization over general state and action spaces.Preprint arXiv:2211.16715

  23. [23]

    and Lan, G

    Ju, C. and Lan, G. (2026). Strongly-polynomial time and validation analysis of policy gradient methods.Mathematical Programming, pages 1–45

  24. [24]

    Kakade, S. M. (2001). A natural policy gradient.Advances in neural information processing systems, 14

  25. [25]

    R., Varma, S

    Khodadadian, S., Jhunjhunwala, P. R., Varma, S. M., and Maguluri, S. T. (2022). On linear and super-linear convergence of natural policy gradient algorithm.Systems & Control Letters, 164:105214

  26. [26]

    Lan, G. (2023). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical programming, 198(1):1059– 1106

  27. [27]

    Lan, G., Li, Y., and Zhao, T. (2023). Block policy mirror descent.SIAM Journal on Optimization, 33(3):2341–2378

  28. [28]

    and Ryu, E

    Lee, J. and Ryu, E. K. (2025). Why policy gradient algorithms work for undiscounted total- reward MDPs.Preprint arXiv:2510.18340. 14

  29. [29]

    Li, Y., Lan, G., and Zhao, T. (2022). First-order policy optimization for robust Markov decision process.Preprint arXiv:2209.10579

  30. [30]

    Li, Y., Lan, G., and Zhao, T. (2024). Homotopic policy mirror descent: policy conver- gence, algorithmic regularization, and improved sample complexity.Mathematical Programming, 207(1):457–513

  31. [31]

    Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. (2020). On the global convergence rates of softmax policy gradient methods. InInternational Conference on Machine Learning, pages 6820–6829. PMLR

  32. [32]

    Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons

  33. [33]

    Ryu, E. K. and Yin, W. (2022).Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press

  34. [34]

    Scherrer, B. (2014). Approximate policy iteration schemes: a comparison. InInternational Conference on Machine Learning, pages 1314–1322. PMLR

  35. [35]

    Scherrer, B. (2016). Improved and generalized upper bounds on the complexity of policy iteration.Mathematics of Operations Research, 41(3):758–774

  36. [36]

    Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR

  37. [37]

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.Preprint arXiv:1707.06347

  38. [38]

    Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675

  39. [39]

    and Ying, L

    Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830

  40. [40]

    Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press

  41. [41]

    S., McAllester, D., Singh, S., and Mansour, Y

    Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. InProceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057–1063

  42. [42]

    Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202

  43. [43]

    Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation.IEEE transactions on automatic control, 42(5):674–690

  44. [44]

    Xiao, L. (2022). On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36. 15

  45. [45]

    Ye, Y. (2011). The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate.Mathematics of Operations Research, 36(4):593–603

  46. [46]

    and Bertsekas, D

    Yu, H. and Bertsekas, D. P. (2013a). On boundedness of Q-learning iterates for stochastic shortest path problems.Mathematics of Operations Research, 38(2):209–227

  47. [47]

    and Bertsekas, D

    Yu, H. and Bertsekas, D. P. (2013b). Q-learning and policy iteration algorithms for stochastic shortest path problems.Annals of Operations Research, 208(1):95–132

  48. [48]

    S., Gower, R

    Yuan, R., Du, S. S., Gower, R. M., Lazaric, A., and Xiao, L. (2023). Linear convergence of natural policy gradient methods with log-linear policies. InThe Eleventh International Conference on Learning Representations

  49. [49]

    Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K. (2020). A survey of autonomous driving: Common practices and emerging technologies.IEEE Access, 8:58443–58469

  50. [50]

    D., and Chi, Y

    Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J. D., and Chi, Y. (2023). Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence.SIAM Journal on Optimization, 33(2):1061–1091. 16 Appendices A Related Work on Natural Policy Gradient Table 1: Summary of existing analyses of natural policy gradient. Global...

  51. [51]

    Our next result gives a bound on the certificate corresponding to the policies generated by the DPI algorithm

    Hence,π1(a′ 0|s′) = 0. Our next result gives a bound on the certificate corresponding to the policies generated by the DPI algorithm. Lemma B.5.Let {πk}be the sequence of policies generated by Algorithm 3. Then, for every k≥1, ∥Q∗−Hπk (Q∗)∥∞≤2γ 1−γ k−1∏ i=1 ( 1−(1−γ)βi ) ∥Q∗−Hπ0(Q∗)∥∞. As a result, when choosingβ 0 = 1andβk =β∈(0,1]for everyk≥1, we have ∥...