Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

David \v{S}i\v{s}ka; Lukasz Szpruch; Ziyue Chen

arxiv: 2605.24939 · v1 · pith:C33AFYNRnew · submitted 2026-05-24 · 💻 cs.LG · math.OC

Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs

Ziyue Chen , David \v{S}i\v{s}ka , Lukasz Szpruch This is my paper

Pith reviewed 2026-06-30 11:58 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords policy gradiententropy regularizationlinear function approximationglobal convergenceMarkov decision processessoftmax policiesPolyak-Lojasiewicz inequality

0 comments

The pith

Under realizability of the regularized Q-function, entropy-regularized softmax policy gradients achieve global linear convergence in two non-tabular feature regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that for infinite-horizon entropy-regularized MDPs with continuous spaces, log-linear softmax policies with linear function approximation converge globally at a linear rate along the gradient flow. This holds when the regularized state-action value function is realizable and the features fall into either full-affine-span or simplex-valued classes, which keep the smallest eigenvalue of the Fisher information or covariance matrix bounded away from zero. The result removes the tabular restriction of earlier analyses by handling the degeneracy in the Polyak-Łojasiewicz inequality through radial-unboundedness arguments on the KL regularizer. A reader cares because the same policy class now comes with provable global optimality guarantees outside finite state-action tables.

Core claim

Under Q^π_τ-realizability, the regularized objective satisfies a non-uniform Polyak-Łojasiewicz inequality whose constant is controlled along the flow in the two feature regimes; this yields suboptimality decaying as O(e^{-Ct}) for the gradient flow of the entropy-regularized objective.

What carries the argument

Non-uniform Polyak-Łojasiewicz inequality for the regularized objective, made uniform by lower bounds on the smallest eigenvalue of the Fisher information matrix (full-affine-span features) or uncentered covariance matrix (simplex-valued features).

If this is right

The same linear rate extends to any log-linear softmax policy whose features satisfy the span or simplex condition.
Global linear convergence holds for the entropy-regularized objective even when state and action spaces are continuous.
The analysis recovers the known tabular rate as a special case when features are one-hot.
The KL regularizer is radially unbounded in the identified subspaces, preventing escape to infinity in parameter space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounding technique may apply to other regularizers whose geometry yields analogous eigenvalue control.
If the realizability assumption is relaxed to approximation error, the linear rate would degrade to a neighborhood whose size tracks that error.
The two feature regimes suggest a design principle: choose features whose convex hull or span keeps the covariance matrix well-conditioned along trajectories.

Load-bearing premise

The regularized state-action value function must be exactly realizable by the linear features, and the features must belong to one of the two classes that prevent the smallest eigenvalue from approaching zero.

What would settle it

A continuous-state MDP with linear features outside both regimes where the convergence rate of the regularized objective along the gradient flow is observed to be sublinear or to stall.

read the original abstract

We study the global convergence of policy gradient for infinite-horizon entropy-regularized Markov decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function approximation, which extend the tabular softmax parameterization while retaining a tractable policy class. Under $Q^\pi_\tau$-realizability for the regularized state-action value function, we first establish a non-uniform Polyak--{\L}ojasiewicz (P\L) inequality. The non-uniformity arises through degeneracy of constants associated with the policy geometry, namely the Fisher information matrix or an uncentered feature covariance matrix. We then identify two feature regimes under which this non-uniform constant can be bounded along the gradient flow. For full-affine-span features, we prove radial unboundedness of the KL regularizer and show that the smallest eigenvalue of the Fisher information matrix remains bounded below by an initialization-dependent positive constant. For simplex-valued features, we prove an analogous radial unboundedness result in the subspace orthogonal to the all-ones vector and obtain a uniform lower bound for the smallest eigenvalue of the uncentered covariance matrix. These results imply global linear convergence of the regularized objective along the gradient flow, i.e. suboptimality decaying as $\mathcal{O}(e^{-Ct})$ for some $C>0$. Our analysis extends the global convergence theory of entropy-regularized softmax policy gradient beyond the tabular setting of Agarwal et al. (2020); Bhandari and Russo (2024); Mei et al. (2020).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends global linear convergence of entropy-regularized softmax PG to continuous MDPs with linear FA under Q-realizability plus two explicit feature regimes.

read the letter

The main advance is showing that the non-uniform PL inequality for the regularized objective can be turned into a uniform one along the gradient flow. They do this by proving radial unboundedness of the KL regularizer under either full-affine-span features or simplex-valued features, which keeps the smallest eigenvalue of the Fisher information or covariance matrix bounded away from zero by an initialization-dependent constant. That yields the O(e^{-Ct}) decay on suboptimality.

The strategy follows the tabular proofs but adds the continuous-space geometry arguments needed to control degeneracy. The two regimes are handled separately yet with parallel reasoning, and the abstract makes the path from assumptions to rate explicit.

The realizability condition on the regularized Q is standard but strong; the feature conditions are narrow, which the paper states clearly rather than hiding. Everything is for the continuous-time flow, so discrete updates are left open. No circularity appears in the rate derivation.

The work is aimed at RL theorists who already know the tabular entropy-regularized results and want to see how the same ideas carry over when states and actions are continuous. Anyone tracking convergence rates with linear function approximation will find the eigenvalue control useful.

It is worth sending to peer review. The claim is precise, the approach is direct, and the gaps are technical rather than foundational.

Referee Report

0 major / 3 minor

Summary. The paper claims that for infinite-horizon entropy-regularized MDPs with continuous state-action spaces and log-linear softmax policies under linear function approximation, Q^π_τ-realizability implies a non-uniform PL inequality whose degeneracy (via the smallest eigenvalue of the Fisher information or uncentered covariance) can be controlled along the gradient flow. In the full-affine-span feature regime, radial unboundedness of the KL regularizer yields an initialization-dependent uniform lower bound on λ_min of the Fisher matrix; in the simplex-valued regime an analogous bound holds in the orthogonal complement of the all-ones vector for the covariance matrix. These uniform bounds deliver global linear convergence of the regularized objective, i.e., suboptimality O(e^{-Ct}). The analysis extends the tabular global-convergence results of Agarwal et al. (2020), Bhandari & Russo (2024) and Mei et al. (2020).

Significance. If the derivations hold, the work supplies the first global linear rate for entropy-regularized policy gradient outside the tabular setting, under explicit and checkable feature conditions. The technical device of converting a non-uniform PL inequality into a uniform one via radial unboundedness of the regularizer is a clean and potentially reusable idea. The result is falsifiable once the two feature regimes are instantiated and supplies a concrete rate constant C that depends only on initialization and problem primitives.

minor comments (3)

The abstract states that the smallest-eigenvalue lower bound is 'initialization-dependent' for the full-affine-span case; the main text should make explicit whether this dependence appears only in the transient or also affects the asymptotic rate C (cf. the O(e^{-Ct}) claim).
Notation for the regularized value function Q^π_τ is introduced without an explicit definition of the entropy coefficient τ; a one-line reminder of its scaling would improve readability in §2.
The two feature regimes are presented as sufficient conditions; a brief remark on whether they are also necessary (or on the existence of counter-examples outside these regimes) would clarify the scope of the result.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our contributions, as well as the favorable significance assessment. The recommendation for minor revision is noted. However, the report contains no specific major comments to address.

Circularity Check

0 steps flagged

Derivation is self-contained with no circular reductions

full rationale

The paper establishes a non-uniform PL inequality from Q^π_τ-realizability, then uses two explicit feature regimes (full-affine-span or simplex-valued) to obtain uniform eigenvalue lower bounds via radial unboundedness of the KL regularizer, directly implying the O(e^{-Ct}) rate along the gradient flow. These steps are derived from the stated assumptions without any reduction of the convergence claim to fitted parameters, self-definitions, or load-bearing self-citations. The extension of tabular results cites independent prior work by other authors (Agarwal et al., Bhandari and Russo, Mei et al.), which serves as external context rather than circular justification. The analysis is self-contained against the given assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Q-realizability assumption and the two feature regimes as domain assumptions in RL theory. No free parameters or invented entities are introduced. Standard optimization inequalities are invoked but not detailed in the abstract.

axioms (1)

standard math Properties of gradient flows and Polyak-Lojasiewicz inequalities hold under the stated policy parameterization.
Invoked to obtain the non-uniform PL inequality and linear convergence rate.

pith-pipeline@v0.9.1-grok · 5818 in / 1244 out tokens · 38894 ms · 2026-06-30T11:58:10.108927+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 7 canonical work pages · 2 internal anchors

[1]

LQR through the lens of first order methods: Discrete-time case.arXiv preprint arXiv:1907.08921,

Jingjing Bu, Afshin Mesbahi, Maryam Fazel, and Mehran Mesbahi. LQR through the lens of first order methods: Discrete-time case.arXiv preprint arXiv:1907.08921,

work page arXiv 1907
[2]

Taming the Noise in Reinforcement Learning via Soft Updates

Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

URLhttps://doi.org/10.1137/22M1533517

doi: 10.1137/22M1533517. URLhttps://doi.org/10.1137/22M1533517. Gene H. Golub. Some modified matrix eigenvalue problems.SIAM Review, 15(2):318–334,

work page doi:10.1137/22m1533517
[4]

URLhttp://www.jstor.org/stable/2028604

ISSN 00361445, 10957200. URLhttp://www.jstor.org/stable/2028604. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR,

work page arXiv
[5]

Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,

Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,

work page arXiv
[6]

Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting.Advances in Neural Information Processing Systems, 34:16671–16685, 2021a

Gen Li, Yuxin Chen, Yuejie Chi, Yuantao Gu, and Yuting Wei. Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting.Advances in Neural Information Processing Systems, 34:16671–16685, 2021a. Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Softmax policy gradient methods can take exponential time ...

work page arXiv
[7]

A unified view of entropy-regularized Markov decision processes

GLOBAL LINEAR CONVERGENCE OF SOFTMAX POLICY GRADIENT 33 Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized Markov decision processes.arXiv preprint arXiv:1705.07798,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Frequentist regret bounds for randomized least-squares value iteration

Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR,

1954

[1] [1]

LQR through the lens of first order methods: Discrete-time case.arXiv preprint arXiv:1907.08921,

Jingjing Bu, Afshin Mesbahi, Maryam Fazel, and Mehran Mesbahi. LQR through the lens of first order methods: Discrete-time case.arXiv preprint arXiv:1907.08921,

work page arXiv 1907

[2] [2]

Taming the Noise in Reinforcement Learning via Soft Updates

Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

URLhttps://doi.org/10.1137/22M1533517

doi: 10.1137/22M1533517. URLhttps://doi.org/10.1137/22M1533517. Gene H. Golub. Some modified matrix eigenvalue problems.SIAM Review, 15(2):318–334,

work page doi:10.1137/22m1533517

[4] [4]

URLhttp://www.jstor.org/stable/2028604

ISSN 00361445, 10957200. URLhttp://www.jstor.org/stable/2028604. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR,

work page arXiv

[5] [5]

Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,

Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,

work page arXiv

[6] [6]

Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting.Advances in Neural Information Processing Systems, 34:16671–16685, 2021a

Gen Li, Yuxin Chen, Yuejie Chi, Yuantao Gu, and Yuting Wei. Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting.Advances in Neural Information Processing Systems, 34:16671–16685, 2021a. Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Softmax policy gradient methods can take exponential time ...

work page arXiv

[7] [7]

A unified view of entropy-regularized Markov decision processes

GLOBAL LINEAR CONVERGENCE OF SOFTMAX POLICY GRADIENT 33 Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized Markov decision processes.arXiv preprint arXiv:1705.07798,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Frequentist regret bounds for randomized least-squares value iteration

Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR,

1954