Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Rui Gao; Shuang Li; Zhaoyu Zhu

arxiv: 2605.26078 · v3 · pith:GL3EF2IVnew · submitted 2026-05-25 · 💻 cs.LG

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Zhaoyu Zhu , Rui Gao , Shuang Li This is my paper

Pith reviewed 2026-06-29 22:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords wasserstein policy gradiententropy-regularized reinforcement learningglobal convergencelog-sobolev inequalitypolyak-lojasiewicz conditionbellman residualcontinuous controlreinforcement learning

0 comments

The pith

Wasserstein policy gradient converges globally for entropy-regularized reinforcement learning by establishing a distributional Polyak-Lojasiewicz condition from Bellman structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a global convergence theory for Wasserstein policy gradient in entropy-regularized RL. It replaces convexity with Bellman-based arguments: the soft Bellman residual has a statewise KL representation to a Gibbs policy, contraction links the residual to the optimality gap, and a resolvent identity connects value improvement to relative Fisher information. These combine with a uniform log-Sobolev inequality on the evolving Gibbs policies to produce a distributional PL condition. Regularity bounds then control discretization error to obtain geometric contraction up to bias. A reader would care because the result explains why this optimal-transport method succeeds for continuous-action problems despite the non-convex objective.

Core claim

The authors show that the Bellman recursion of entropy-regularized RL induces a distributional Polyak-Lojasiewicz geometry for Wasserstein gradient flow. The soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality for the evolving Gibbs family, these ingredients yield a distributional Polyak-Lojasiewicz condition. Regularity and uniform bounds control the discretization error, thereby obtaining geometric contraction up to a discretization bias.

What carries the argument

The distributional Polyak-Lojasiewicz condition derived from the soft Bellman residual, Bellman contraction, resolvent identity, and uniform log-Sobolev inequality on the evolving Gibbs policies.

If this is right

The optimality gap contracts geometrically along the continuous-time Wasserstein flow.
Discrete-time implementations converge globally up to a controllable discretization bias.
Global convergence holds without requiring the RL objective to be convex in the usual flat sense.
The Bellman structure alone suffices to replace convexity in the convergence argument.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Bellman-induced geometry might support convergence proofs for other optimal-transport policy methods.
Checking whether the uniform LSI holds in common continuous-control benchmarks would test the theory's practical reach.
The approach suggests that Bellman recursion can create favorable optimization geometries in broader classes of non-convex RL problems.

Load-bearing premise

The uniform log-Sobolev inequality holds for the sequence of Gibbs policies generated by the optimization iterates.

What would settle it

An entropy-regularized RL instance where the log-Sobolev constant for the Gibbs policies deteriorates along the Wasserstein policy gradient iterates, so that the distributional PL inequality fails and geometric contraction does not hold.

read the original abstract

Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates. In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak--\L{}ojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak--Lojasiewicz-type (PL) geometry that supports global convergence of WPG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a global convergence proof for Wasserstein policy gradient by turning Bellman structure into a distributional PL condition, but the uniform LSI on the policy sequence is the step that still needs verification.

read the letter

The main point is that the paper proves global convergence of Wasserstein policy gradient for entropy-regularized reinforcement learning. It does so by showing that the Bellman structure induces a distributional Polyak-Lojasiewicz condition through a KL representation of the soft Bellman residual, combined with contraction and a resolvent identity, plus a uniform log-Sobolev inequality on the Gibbs policies.

They handle the fact that the objective is not convex in the usual sense by using the RL recursion instead. This is a solid move and the steps in the abstract are coherent. They also claim to get the regularity needed for discretization error control.

The soft spot is the uniform LSI. The analysis needs the LSI constant to stay bounded along the iterates. If it does not, the contraction rate is no longer geometric in a uniform way. The abstract says they establish uniform bounds, but without seeing the details it is hard to tell if the bound is truly independent of the policy sequence or if it relies on additional assumptions like bounded rewards and Lipschitz dynamics in a way that works.

This is aimed at people doing theory in continuous-action RL. Someone studying policy optimization with transport geometry or Langevin methods in RL would find the technique useful.

It has enough new content and a clear argument to merit peer review. The thinking is serious and the claim is specific.

Recommendation: Send it for review.

Referee Report

2 major / 1 minor

Summary. The paper develops a global convergence theory for Wasserstein policy gradient (WPG) applied to entropy-regularized RL. It replaces standard convexity arguments with Bellman-based ingredients: a statewise KL representation of the soft Bellman residual, Bellman contraction relating the residual to the optimality gap, and a Bellman resolvent identity linking value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality on the sequence of Gibbs policies, these yield a distributional Polyak-Łojasiewicz inequality; regularity and uniform bounds are then established to control discretization error, producing geometric contraction up to a bias term.

Significance. If the uniform LSI holds along the generated policy sequence, the result supplies the first rigorous global convergence guarantee for WPG in continuous-action settings and demonstrates that the Bellman recursion induces a favorable PL geometry even though the RL objective is not convex in the usual sense. The use of optimal-transport geometry together with resolvent identities is a technically distinctive contribution.

major comments (2)

[Abstract] Abstract: The distributional PL condition is obtained only after invoking a uniform LSI on the evolving Gibbs family {π_θ(·|s)}. The manuscript must either derive a uniform LSI constant from the problem data (bounded rewards, Lipschitz dynamics) or state it explicitly as an additional hypothesis; without this, the contraction rate becomes iteration-dependent and geometric convergence is lost.
[Discretization-error control] Discretization-error section (where uniform bounds are claimed): The control of discretization bias inherits the LSI constant; the dependence of the final rate on this constant should be stated explicitly so that the overall geometric rate remains verifiable when the LSI is only conditionally uniform.

minor comments (1)

[Abstract] Notation for the state-conditional policies and the soft Q-function should be introduced once and used consistently; the current abstract mixes π_θ and Gibbs policies without a single forward reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. We address each major comment below and will revise the manuscript to improve clarity on the role of the uniform LSI assumption.

read point-by-point responses

Referee: [Abstract] Abstract: The distributional PL condition is obtained only after invoking a uniform LSI on the evolving Gibbs family {π_θ(·|s)}. The manuscript must either derive a uniform LSI constant from the problem data (bounded rewards, Lipschitz dynamics) or state it explicitly as an additional hypothesis; without this, the contraction rate becomes iteration-dependent and geometric convergence is lost.

Authors: We agree that the uniform LSI must be stated explicitly as a hypothesis. The current manuscript invokes it for the sequence of Gibbs policies but does not label it as such in the abstract or theorem statements. In the revision we will add an explicit assumption (Assumption X) stating that the family of Gibbs policies satisfies a uniform LSI with constant independent of the iteration, and we will restate the abstract and main theorems accordingly. While we discuss sufficient conditions (compact action space, bounded rewards, Lipschitz dynamics) under which such uniformity can hold, a general derivation of an explicit LSI constant solely from arbitrary problem data is not provided and would require additional technical work beyond the scope of the present analysis. revision: yes
Referee: [Discretization-error control] Discretization-error section (where uniform bounds are claimed): The control of discretization bias inherits the LSI constant; the dependence of the final rate on this constant should be stated explicitly so that the overall geometric rate remains verifiable when the LSI is only conditionally uniform.

Authors: We accept the point. The discretization analysis already uses the LSI constant to bound the bias term, but the dependence is not written out in the final rate statement. In the revised version we will insert an explicit factor of the LSI constant (denoted C_LSI) into the statement of the geometric contraction rate and into the discretization-error bound, making the overall rate verifiable under the uniform-LSI hypothesis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on explicit external assumption

full rationale

The paper's central steps (Bellman residual to KL representation, contraction to optimality gap, resolvent identity to Fisher information) are standard RL identities applied to the entropy-regularized objective. The distributional PL condition is obtained only after adjoining an explicit uniform LSI hypothesis on the evolving Gibbs family; this LSI is listed as an input rather than derived from the WPG iterates or fitted to data. No parameters are estimated from a subset and then relabeled as predictions, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz is smuggled via prior work. The discretization-error bounds are controlled by regularity assumptions stated separately. The chain therefore remains non-circular once the LSI hypothesis is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The proof rests on three mathematical assumptions drawn from RL theory and analysis; no free parameters or new entities are introduced in the abstract.

axioms (3)

domain assumption Uniform log-Sobolev inequality holds for the family of Gibbs policies generated along the iterates
Combined with the Bellman resolvent identity to obtain the distributional PL condition.
standard math Bellman contraction property for the soft Bellman operator
Used to relate the statewise KL residual to the global optimality gap.
domain assumption Regularity and uniform bounds on the soft Q-function along the policy iterates
Required to control discretization error of the continuous-time flow.

pith-pipeline@v0.9.1-grok · 5833 in / 1582 out tokens · 37340 ms · 2026-06-29T22:17:39.665688+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 12 canonical work pages · 4 internal anchors

[1]

Global optimality guarantees for policy gradient methods

Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927,

1906
[2]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba

doi: 10.1287/opre.2021.0014. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,

work page doi:10.1287/opre.2021.0014 2021
[3]

L´ ena¨ ıc Chizat

doi: 10.1287/opre.2021.2151. L´ ena¨ ıc Chizat. Mean-field langevin dynamics: Exponential convergence and annealing. arXiv preprint arXiv:2202.01009,

work page doi:10.1287/opre.2021.2151 2021
[4]

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over- parameterized models using optimal transport.arXiv preprint arXiv:1805.09545,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

On the global convergence of momentum- based policy gradient

Yuhao Ding, Junzi Zhang, and Javad Lavaei. On the global convergence of momentum- based policy gradient. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 1910–1934. PMLR,

1910
[6]

Sham M. Kakade. A natural policy gradient. InAdvances in Neural Information Processing Systems 14 (NeurIPS 2001),

2001
[7]

Sajad Khodadadian, Prakirt Raj Jhunjhunwala, Sushil Mahavir Varma, and Siva Theja Maguluri

doi: 10.1109/CDC45484.2021.9682908. Sajad Khodadadian, Prakirt Raj Jhunjhunwala, Sushil Mahavir Varma, and Siva Theja Maguluri. On linear and super-linear convergence of natural policy gradient algorithm. Systems & Control Letters, 164:105214,

work page doi:10.1109/cdc45484.2021.9682908 2021
[8]

Guanghui Lan

doi: 10.1016/j.sysconle.2022.105214. Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical Programming, 198 (1):1059–1106,

work page doi:10.1016/j.sysconle.2022.105214 2022
[9]

Continuous control with deep reinforcement learning

doi: 10.1137/22M1480409. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1137/22m1480409
[10]

Efficient wasserstein natural gradient for policy optimization.arXiv preprint arXiv:2010.05380,

Ted Moskovitz et al. Efficient wasserstein natural gradient for policy optimization.arXiv preprint arXiv:2010.05380,

work page arXiv 2010
[11]

Borsa, Jo˜ ao Guilherme Madeira Ara´ ujo, Brendan Daniel Tracey, and Hado van Hasselt

David Pfau, Ian Davies, Diana L. Borsa, Jo˜ ao Guilherme Madeira Ara´ ujo, Brendan Daniel Tracey, and Hado van Hasselt. Wasserstein policy optimization.arXiv preprint arXiv:2505.00663,

work page arXiv
[12]

Proximal Policy Optimization Algorithms

19 John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

A note on convergence of Wasserstein policy optimization

David ˇSiˇ ska and Yufei Zhang. A note on convergence of wasserstein policy optimization. arXiv preprint arXiv:2605.22622,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Sutton, David A

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems 12 (NIPS 1999), pages 1057–1063,

1999
[15]

Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang

doi: 10.1137/21M1456789. Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Vari- ational policy gradient method for reinforcement learning with general utilities. InAd- vances in Neural Information Processing Systems, volume 33, pages 4572–4583,

work page doi:10.1137/21m1456789
[16]

Wasserstein proximal policy gradient

Zhaoyu Zhu, Shuhan Zhang, Rui Gao, and Shuang Li. Wasserstein proximal policy gradient. arXiv preprint arXiv:2603.02576,

work page arXiv
[17]

Appendix A. Auxiliary Results A.1 Entropy-Regularized Performance Difference Lemma Lemma A.1 (Entropy-Regularized Performance Difference Lemma)(Lan, 2023, Lemma2) Letρ 0 ∈ P(S)be the initial state distribution, and writeJ(π) :=J ρ0(π) = R S V π(s)ρ0(ds). For any two feasible policiesπandπ ′, we have J(π ′)−J(π) = 1 1−γ Z S Z Rd Qπ(s, a) π′(a|s)−π(a|s) da−...

2023
[18]

Fix a policy π, and letd π =d π ρ0 be its normalized discounted state occupancy. For each states, consider a smooth transport perturbationt7→π t(· |s) withπ 0 =π and velocity fieldu(s,·), so that ∂tπt(a|s) t=0 =−∇ a · π(a|s)u(s, a) .(26) The following proposition identifies the first variation ofJalong such statewise perturba- tions. 22 Proposition A.1 (F...

2008

[1] [1]

Global optimality guarantees for policy gradient methods

Jalaj Bhandari and Daniel Russo. Global optimality guarantees for policy gradient methods. Operations Research, 72(5):1906–1927,

1906

[2] [2]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba

doi: 10.1287/opre.2021.0014. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.arXiv preprint arXiv:1606.01540,

work page doi:10.1287/opre.2021.0014 2021

[3] [3]

L´ ena¨ ıc Chizat

doi: 10.1287/opre.2021.2151. L´ ena¨ ıc Chizat. Mean-field langevin dynamics: Exponential convergence and annealing. arXiv preprint arXiv:2202.01009,

work page doi:10.1287/opre.2021.2151 2021

[4] [4]

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over- parameterized models using optimal transport.arXiv preprint arXiv:1805.09545,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

On the global convergence of momentum- based policy gradient

Yuhao Ding, Junzi Zhang, and Javad Lavaei. On the global convergence of momentum- based policy gradient. InProceedings of The 25th International Conference on Artificial Intelligence and Statistics, pages 1910–1934. PMLR,

1910

[6] [6]

Sham M. Kakade. A natural policy gradient. InAdvances in Neural Information Processing Systems 14 (NeurIPS 2001),

2001

[7] [7]

Sajad Khodadadian, Prakirt Raj Jhunjhunwala, Sushil Mahavir Varma, and Siva Theja Maguluri

doi: 10.1109/CDC45484.2021.9682908. Sajad Khodadadian, Prakirt Raj Jhunjhunwala, Sushil Mahavir Varma, and Siva Theja Maguluri. On linear and super-linear convergence of natural policy gradient algorithm. Systems & Control Letters, 164:105214,

work page doi:10.1109/cdc45484.2021.9682908 2021

[8] [8]

Guanghui Lan

doi: 10.1016/j.sysconle.2022.105214. Guanghui Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes.Mathematical Programming, 198 (1):1059–1106,

work page doi:10.1016/j.sysconle.2022.105214 2022

[9] [9]

Continuous control with deep reinforcement learning

doi: 10.1137/22M1480409. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1137/22m1480409

[10] [10]

Efficient wasserstein natural gradient for policy optimization.arXiv preprint arXiv:2010.05380,

Ted Moskovitz et al. Efficient wasserstein natural gradient for policy optimization.arXiv preprint arXiv:2010.05380,

work page arXiv 2010

[11] [11]

Borsa, Jo˜ ao Guilherme Madeira Ara´ ujo, Brendan Daniel Tracey, and Hado van Hasselt

David Pfau, Ian Davies, Diana L. Borsa, Jo˜ ao Guilherme Madeira Ara´ ujo, Brendan Daniel Tracey, and Hado van Hasselt. Wasserstein policy optimization.arXiv preprint arXiv:2505.00663,

work page arXiv

[12] [12]

Proximal Policy Optimization Algorithms

19 John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

A note on convergence of Wasserstein policy optimization

David ˇSiˇ ska and Yufei Zhang. A note on convergence of wasserstein policy optimization. arXiv preprint arXiv:2605.22622,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Sutton, David A

Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems 12 (NIPS 1999), pages 1057–1063,

1999

[15] [15]

Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang

doi: 10.1137/21M1456789. Junyu Zhang, Alec Koppel, Amrit Singh Bedi, Csaba Szepesvari, and Mengdi Wang. Vari- ational policy gradient method for reinforcement learning with general utilities. InAd- vances in Neural Information Processing Systems, volume 33, pages 4572–4583,

work page doi:10.1137/21m1456789

[16] [16]

Wasserstein proximal policy gradient

Zhaoyu Zhu, Shuhan Zhang, Rui Gao, and Shuang Li. Wasserstein proximal policy gradient. arXiv preprint arXiv:2603.02576,

work page arXiv

[17] [17]

Appendix A. Auxiliary Results A.1 Entropy-Regularized Performance Difference Lemma Lemma A.1 (Entropy-Regularized Performance Difference Lemma)(Lan, 2023, Lemma2) Letρ 0 ∈ P(S)be the initial state distribution, and writeJ(π) :=J ρ0(π) = R S V π(s)ρ0(ds). For any two feasible policiesπandπ ′, we have J(π ′)−J(π) = 1 1−γ Z S Z Rd Qπ(s, a) π′(a|s)−π(a|s) da−...

2023

[18] [18]

Fix a policy π, and letd π =d π ρ0 be its normalized discounted state occupancy. For each states, consider a smooth transport perturbationt7→π t(· |s) withπ 0 =π and velocity fieldu(s,·), so that ∂tπt(a|s) t=0 =−∇ a · π(a|s)u(s, a) .(26) The following proposition identifies the first variation ofJalong such statewise perturba- tions. 22 Proposition A.1 (F...

2008