arxiv: 2605.10671 · v1 · submitted 2026-05-11 · 💻 cs.LG · math.OC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

Phalguni Nanda, Zaiwei Chen

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords natural policy gradientpolicy iterationBellman operatorsreinforcement learningconvergence analysispolicy dual averagingsmoothed operators

0 comments

The pith

Natural policy gradient is exactly a doubly smoothed and averaged form of policy iteration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that natural policy gradient arises precisely as one instance of doubly smoothed policy iteration, where each new policy comes from a regularized greedy step applied to a weighted average of past Q-functions. This single Bellman-operator framework also recovers policy iteration, dual-averaged policy iteration, and other policy dual averaging methods as special cases. Using only the monotonicity and contraction properties of the smoothed operators, the authors prove that the procedure converges geometrically to an optimal policy in a distribution-free manner. A reader cares because the result supplies explicit iteration complexity bounds for these widely used reinforcement learning algorithms without MDP modifications, extra regularization, or trajectory-dependent step sizes.

Core claim

Natural policy gradient admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, doubly smoothed policy iteration obtains each policy by applying a regularized greedy step to a weighted average of past Q-functions. Using only monotonicity and contraction of smoothed Bellman operators, DSPI converges globally and geometrically. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) for computing an ε-optimal policy, without modifying the MDP or using adaptive stepsizes. The unregularized case terminates in finite steps, and the framework extends to linear function-0-

What carries the argument

Doubly smoothed policy iteration (DSPI): each policy is obtained by applying a regularized greedy step to a weighted average of past Q-functions, unifying natural policy gradient and related methods as special cases.

If this is right

Natural policy gradient and policy dual averaging achieve O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) iterations to ε-optimality without MDP changes or adaptive stepsizes.
Dual-averaged policy iteration terminates after finitely many steps in the unregularized case.
The convergence result extends to discounted MDPs with linear function approximation.
The same framework applies to stochastic shortest path problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The Bellman-operator unification may let researchers transfer similar geometric rates to other mirror-descent-style policy updates not covered in the paper.
Averaging historical Q-functions could be tested as a practical stabilization technique inside existing actor-critic implementations.
The framework suggests that global convergence in policy optimization often stems from the combination of smoothing and averaging rather than from any single algorithm choice.

Load-bearing premise

Smoothed Bellman operators are monotone and contractive.

What would settle it

In a small tabular MDP, running natural policy gradient until an ε-optimal policy is reached and observing that the iteration count deviates substantially from the O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) bound would falsify the claimed complexity.

read the original abstract

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-\gamma)^{-1}\log((1-\gamma)^{-1}\epsilon^{-1}))$ for computing an $\epsilon$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NPG is exactly a doubly-smoothed policy iteration step on averaged Q-values, with convergence following from plain monotonicity and contraction of the operators.

read the letter

The paper's central move is to define doubly smoothed policy iteration as a regularized greedy step on a weighted average of previous Q-functions, then show that natural policy gradient and dual-averaged policy iteration fall out exactly as special cases by choice of the mirror map and averaging weights. The convergence argument uses only that the smoothed Bellman operators stay monotone and contract with modulus gamma, independent of the state space. This yields the standard geometric rate O((1-gamma)^{-1} log((1-gamma)^{-1} epsilon^{-1})) without extra regularization, adaptive steps, or MDP modifications, and finite termination for the unregularized case. The same setup is sketched for linear function approximation and stochastic shortest paths. That is the actual new content: a single operator framework that recovers several algorithms and supplies distribution-free rates from first principles. The proofs appear direct and the operator properties are standard, so the derivations should be checkable. The rates themselves are not faster than prior work on policy iteration, so the value is the unification rather than a breakthrough in speed. One minor question is whether the specific smoothing weights introduce any hidden dependence on the horizon or the number of past iterates that the abstract does not flag. Overall the claims line up with the operator view they advertise. This is for RL theorists who care about clean Bellman-operator arguments for policy optimization. It is worth sending to peer review because the equivalence is exact and the convergence argument rests on verifiable properties rather than fitting or self-reference.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces doubly smoothed policy iteration (DSPI) as a Bellman-operator framework that exactly reformulates natural policy gradient (NPG) and policy dual averaging methods. DSPI obtains each policy via a regularized greedy step applied to a weighted average of historical Q-functions. The authors prove that, relying solely on the monotonicity and contraction properties of the associated smoothed Bellman operators, DSPI converges globally and geometrically in a distribution-free manner. This yields an iteration complexity of O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) for ε-optimal policies in discounted MDPs, with finite termination in the unregularized case. The framework is extended to linear function approximation and stochastic shortest path problems.

Significance. If the derivations hold, this work offers a significant unification of policy optimization algorithms under a single operator-theoretic lens, providing rigorous convergence guarantees without requiring MDP modifications, additional regularization, or adaptive stepsizes. The explicit use of monotonicity and contraction for distribution-free rates is a strength, and the reduction of standard NPG to this form clarifies its theoretical underpinnings. The extensions to function approximation and SSP further broaden its applicability.

minor comments (3)

[§3.2] §3.2, Eq. (8): the recursive definition of the weighted Q-average could be accompanied by an explicit closed-form expression to make the link to dual averaging immediate for readers.
[§4.3] §4.3: the extension to linear function approximation states that the same operator properties carry over, but the contraction modulus now depends on the approximation error; a short remark quantifying this dependence would strengthen the claim.
[§4.2] The finite-termination argument for the unregularized case (dual-averaged policy iteration) is stated for finite MDPs; an explicit sentence noting that the argument relies on the finite policy space would prevent misreading.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our manuscript, as well as for the recommendation of minor revision. We are pleased that the Bellman-operator unification of natural policy gradient and related methods, along with the distribution-free convergence guarantees, were viewed as significant contributions. Since the report contains no specific major comments or criticisms, we have no individual points to rebut or revise at this stage. We will incorporate any minor editorial suggestions from the editor or additional feedback in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The manuscript derives the exact equivalence of natural policy gradient to doubly smoothed policy iteration (DSPI) via direct substitution of the mirror-map regularizer into the DSPI update, without any self-referential definition or fitted parameters. Convergence is established from the independent, standard properties of monotonicity and contraction (modulus γ) of the induced smoothed Bellman operators; these hold for the general operator framework and are not constructed from the target rates or algorithm-specific quantities. The resulting iteration complexity O((1-γ)^{-1} log((1-γ)^{-1} ε^{-1})) follows by composing these operator bounds, and the unregularized case reduces to the classical policy-improvement argument. No load-bearing self-citation, ansatz smuggling, or renaming of known results occurs; the framework is self-contained against external Bellman-operator theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard domain assumption that smoothed Bellman operators remain monotone and contractive; no free parameters or new invented entities are introduced beyond the definition of DSPI itself.

axioms (1)

domain assumption Smoothed Bellman operators are monotone and contractive
Invoked directly to establish global geometric convergence of DSPI and its special cases.

pith-pipeline@v0.9.0 · 5519 in / 1332 out tokens · 57459 ms · 2026-05-12T03:56:20.487341+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient... as special cases

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 1 internal anchor

[1]

M., Lee, J

Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of policy gradient methods: Optimality, approximation, and distribution shift.Journal of Machine Learning Research, 22(98):1–76

work page 2021
[2]

and Rebeschini, P

Alfano, C. and Rebeschini, P. (2023). Linear convergence for natural policy gradient with log-linear policy parametrization. InInternational Conference on Learning Representations

work page 2023
[3]

Alfano, C., Yuan, R., and Rebeschini, P. (2023). A novel framework for policy mirror descent with general parameterization and linear convergence.Advances in Neural Information Processing Systems, 36:30681–30725

work page 2023
[4]

(1957).Dynamic Programming

Bellman, R. (1957).Dynamic Programming. Princeton University Press

work page 1957
[5]

Bertsekas, D. P. (1995).Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA

work page 1995
[6]

Bertsekas, D. P. and Tsitsiklis, J. N. (1991). An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595

work page 1991
[7]

Bertsekas, D. P. and Tsitsiklis, J. N. (1996).Neuro-dynamic programming. Athena Scientific

work page 1996
[8]

and Russo, D

Bhandari, J. and Russo, D. (2021). On the linear convergence of policy gradient methods for finite mdps. In Banerjee, A. and Fukumizu, K., editors,Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 ofProceedings of Machine Learning Research, pages 2386–2394. PMLR

work page 2021
[9]

and Russo, D

Bhandari, J. and Russo, D. (2024). Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927

work page 2024
[10]

Bhandari, J., Russo, D., and Singal, R. (2018). A finite-time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR

work page 2018
[11]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901

work page 2020
[12]

Cayci, S., He, N., and Srikant, R. (2024). Convergence of entropy-regularized natural policy gradient with linear function approximation.SIAM Journal on Optimization, 34(3):2729–2755

work page 2024
[13]

Cen, S., Cheng, C., Chen, Y., Wei, Y., and Chi, Y. (2022). Fast global convergence of natural policy gradient methods with entropy regularization.Operations Research, 70(4):2563–2578. 13

work page 2022
[14]

Chen, L., Luo, H., and Rosenberg, A. (2022). Policy optimization for stochastic shortest path. InConference on Learning Theory, pages 982–1046. PMLR

work page 2022
[15]

and Maguluri, S

Chen, Z. and Maguluri, S. T. (2025). An approximate policy iteration viewpoint of actor–critic algorithms.Automatica, 179:112395

work page 2025
[16]

T., Shakkottai, S., and Shanmugam, K

Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2024). A Lyapunov theory for finite-sample guarantees of Markovian stochastic approximation.Operations Research, 72(4):1352– 1367

work page 2024
[17]

Geist, M., Scherrer, B., and Pietquin, O. (2019). A theory of regularized Markov decision processes. InInternational Conference on Machine Learning, pages 2160–2169. PMLR

work page 2019
[18]

Gheshlaghi Azar, M., Munos, R., and Kappen, B. K. H. J. (2013). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model.Machine Learning, 91:325–349

work page 2013
[19]

Gu, S., Holly, E., Lillicrap, T., and Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In2017 IEEE international conference on robotics and automation (ICRA), pages 3389–3396. IEEE

work page 2017
[20]

and Mahajan, A

Gupta, A. and Mahajan, A. (2026). Operator-theoretic foundations and policy gradient methods for general MDPs with unbounded costs.Preprint arXiv:2603.17875

work page arXiv 2026
[21]

and of Technology, M

Howard, R. and of Technology, M. I. (1964).Dynamic Programming and Markov Processes. Massachusetts Institute of Technology

work page 1964
[22]

and Lan, G

Ju, C. and Lan, G. (2022). Policy optimization over general state and action spaces.Preprint arXiv:2211.16715

work page arXiv 2022
[23]

and Lan, G

Ju, C. and Lan, G. (2026). Strongly-polynomial time and validation analysis of policy gradient methods.Mathematical Programming, pages 1–45

work page 2026
[24]

Kakade, S. M. (2001). A natural policy gradient.Advances in neural information processing systems, 14

work page 2001
[25]

R., Varma, S

Khodadadian, S., Jhunjhunwala, P. R., Varma, S. M., and Maguluri, S. T. (2022). On linear and super-linear convergence of natural policy gradient algorithm.Systems & Control Letters, 164:105214

work page 2022
[26]

Lan, G. (2023). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical programming, 198(1):1059– 1106

work page 2023
[27]

Lan, G., Li, Y., and Zhao, T. (2023). Block policy mirror descent.SIAM Journal on Optimization, 33(3):2341–2378

work page 2023
[28]

and Ryu, E

Lee, J. and Ryu, E. K. (2025). Why policy gradient algorithms work for undiscounted total- reward MDPs.Preprint arXiv:2510.18340. 14

work page arXiv 2025
[29]

Li, Y., Lan, G., and Zhao, T. (2022). First-order policy optimization for robust Markov decision process.Preprint arXiv:2209.10579

work page arXiv 2022
[30]

Li, Y., Lan, G., and Zhao, T. (2024). Homotopic policy mirror descent: policy conver- gence, algorithmic regularization, and improved sample complexity.Mathematical Programming, 207(1):457–513

work page 2024
[31]

Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. (2020). On the global convergence rates of softmax policy gradient methods. InInternational Conference on Machine Learning, pages 6820–6829. PMLR

work page 2020
[32]

Puterman, M. L. (2014).Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons

work page 2014
[33]

Ryu, E. K. and Yin, W. (2022).Large-Scale Convex Optimization: Algorithms & Analyses via Monotone Operators. Cambridge University Press

work page 2022
[34]

Scherrer, B. (2014). Approximate policy iteration schemes: a comparison. InInternational Conference on Machine Learning, pages 1314–1322. PMLR

work page 2014
[35]

Scherrer, B. (2016). Improved and generalized upper bounds on the complexity of policy iteration.Mathematics of Operations Research, 41(3):758–774

work page 2016
[36]

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR

work page 2015
[37]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms.Preprint arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergence and faster rates for regularized MDPs. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5668–5675

work page 2020
[39]

and Ying, L

Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation and TD-learning. InConference on Learning Theory, pages 2803–2830

work page 2019
[40]

Sutton, R. S. and Barto, A. G. (2018).Reinforcement Learning: An Introduction. MIT press

work page 2018
[41]

S., McAllester, D., Singh, S., and Mansour, Y

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. InProceedings of the 12th International Conference on Neural Information Processing Systems, pages 1057–1063

work page 1999
[42]

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning.Machine learning, 16(3):185–202

work page 1994
[43]

Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation.IEEE transactions on automatic control, 42(5):674–690

work page 1997
[44]

Xiao, L. (2022). On the convergence rates of policy gradient methods.Journal of Machine Learning Research, 23(282):1–36. 15

work page 2022
[45]

Ye, Y. (2011). The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate.Mathematics of Operations Research, 36(4):593–603

work page 2011
[46]

and Bertsekas, D

Yu, H. and Bertsekas, D. P. (2013a). On boundedness of Q-learning iterates for stochastic shortest path problems.Mathematics of Operations Research, 38(2):209–227

work page
[47]

and Bertsekas, D

Yu, H. and Bertsekas, D. P. (2013b). Q-learning and policy iteration algorithms for stochastic shortest path problems.Annals of Operations Research, 208(1):95–132

work page
[48]

S., Gower, R

Yuan, R., Du, S. S., Gower, R. M., Lazaric, A., and Xiao, L. (2023). Linear convergence of natural policy gradient methods with log-linear policies. InThe Eleventh International Conference on Learning Representations

work page 2023
[49]

Yurtsever, E., Lambert, J., Carballo, A., and Takeda, K. (2020). A survey of autonomous driving: Common practices and emerging technologies.IEEE Access, 8:58443–58469

work page 2020
[50]

D., and Chi, Y

Zhan, W., Cen, S., Huang, B., Chen, Y., Lee, J. D., and Chi, Y. (2023). Policy mirror descent for regularized reinforcement learning: A generalized framework with linear convergence.SIAM Journal on Optimization, 33(2):1061–1091. 16 Appendices A Related Work on Natural Policy Gradient Table 1: Summary of existing analyses of natural policy gradient. Global...

work page 2023
[51]

Our next result gives a bound on the certificate corresponding to the policies generated by the DPI algorithm

Hence,π1(a′ 0|s′) = 0. Our next result gives a bound on the certificate corresponding to the policies generated by the DPI algorithm. Lemma B.5.Let {πk}be the sequence of policies generated by Algorithm 3. Then, for every k≥1, ∥Q∗−Hπk (Q∗)∥∞≤2γ 1−γ k−1∏ i=1 ( 1−(1−γ)βi ) ∥Q∗−Hπ0(Q∗)∥∞. As a result, when choosingβ 0 = 1andβk =β∈(0,1]for everyk≥1, we have ∥...

work page