Global linear convergence of entropy-regularized softmax policy gradient beyond tabular MDPs
Pith reviewed 2026-06-30 11:58 UTC · model grok-4.3
The pith
Under realizability of the regularized Q-function, entropy-regularized softmax policy gradients achieve global linear convergence in two non-tabular feature regimes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under Q^π_τ-realizability, the regularized objective satisfies a non-uniform Polyak-Łojasiewicz inequality whose constant is controlled along the flow in the two feature regimes; this yields suboptimality decaying as O(e^{-Ct}) for the gradient flow of the entropy-regularized objective.
What carries the argument
Non-uniform Polyak-Łojasiewicz inequality for the regularized objective, made uniform by lower bounds on the smallest eigenvalue of the Fisher information matrix (full-affine-span features) or uncentered covariance matrix (simplex-valued features).
If this is right
- The same linear rate extends to any log-linear softmax policy whose features satisfy the span or simplex condition.
- Global linear convergence holds for the entropy-regularized objective even when state and action spaces are continuous.
- The analysis recovers the known tabular rate as a special case when features are one-hot.
- The KL regularizer is radially unbounded in the identified subspaces, preventing escape to infinity in parameter space.
Where Pith is reading between the lines
- The same bounding technique may apply to other regularizers whose geometry yields analogous eigenvalue control.
- If the realizability assumption is relaxed to approximation error, the linear rate would degrade to a neighborhood whose size tracks that error.
- The two feature regimes suggest a design principle: choose features whose convex hull or span keeps the covariance matrix well-conditioned along trajectories.
Load-bearing premise
The regularized state-action value function must be exactly realizable by the linear features, and the features must belong to one of the two classes that prevent the smallest eigenvalue from approaching zero.
What would settle it
A continuous-state MDP with linear features outside both regimes where the convergence rate of the regularized objective along the gradient flow is observed to be sublinear or to stall.
read the original abstract
We study the global convergence of policy gradient for infinite-horizon entropy-regularized Markov decision processes (MDPs) with continuous state and action spaces. We consider log-linear softmax policies with linear function approximation, which extend the tabular softmax parameterization while retaining a tractable policy class. Under $Q^\pi_\tau$-realizability for the regularized state-action value function, we first establish a non-uniform Polyak--{\L}ojasiewicz (P\L) inequality. The non-uniformity arises through degeneracy of constants associated with the policy geometry, namely the Fisher information matrix or an uncentered feature covariance matrix. We then identify two feature regimes under which this non-uniform constant can be bounded along the gradient flow. For full-affine-span features, we prove radial unboundedness of the KL regularizer and show that the smallest eigenvalue of the Fisher information matrix remains bounded below by an initialization-dependent positive constant. For simplex-valued features, we prove an analogous radial unboundedness result in the subspace orthogonal to the all-ones vector and obtain a uniform lower bound for the smallest eigenvalue of the uncentered covariance matrix. These results imply global linear convergence of the regularized objective along the gradient flow, i.e. suboptimality decaying as $\mathcal{O}(e^{-Ct})$ for some $C>0$. Our analysis extends the global convergence theory of entropy-regularized softmax policy gradient beyond the tabular setting of Agarwal et al. (2020); Bhandari and Russo (2024); Mei et al. (2020).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that for infinite-horizon entropy-regularized MDPs with continuous state-action spaces and log-linear softmax policies under linear function approximation, Q^π_τ-realizability implies a non-uniform PL inequality whose degeneracy (via the smallest eigenvalue of the Fisher information or uncentered covariance) can be controlled along the gradient flow. In the full-affine-span feature regime, radial unboundedness of the KL regularizer yields an initialization-dependent uniform lower bound on λ_min of the Fisher matrix; in the simplex-valued regime an analogous bound holds in the orthogonal complement of the all-ones vector for the covariance matrix. These uniform bounds deliver global linear convergence of the regularized objective, i.e., suboptimality O(e^{-Ct}). The analysis extends the tabular global-convergence results of Agarwal et al. (2020), Bhandari & Russo (2024) and Mei et al. (2020).
Significance. If the derivations hold, the work supplies the first global linear rate for entropy-regularized policy gradient outside the tabular setting, under explicit and checkable feature conditions. The technical device of converting a non-uniform PL inequality into a uniform one via radial unboundedness of the regularizer is a clean and potentially reusable idea. The result is falsifiable once the two feature regimes are instantiated and supplies a concrete rate constant C that depends only on initialization and problem primitives.
minor comments (3)
- The abstract states that the smallest-eigenvalue lower bound is 'initialization-dependent' for the full-affine-span case; the main text should make explicit whether this dependence appears only in the transient or also affects the asymptotic rate C (cf. the O(e^{-Ct}) claim).
- Notation for the regularized value function Q^π_τ is introduced without an explicit definition of the entropy coefficient τ; a one-line reminder of its scaling would improve readability in §2.
- The two feature regimes are presented as sufficient conditions; a brief remark on whether they are also necessary (or on the existence of counter-examples outside these regimes) would clarify the scope of the result.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our contributions, as well as the favorable significance assessment. The recommendation for minor revision is noted. However, the report contains no specific major comments to address.
Circularity Check
Derivation is self-contained with no circular reductions
full rationale
The paper establishes a non-uniform PL inequality from Q^π_τ-realizability, then uses two explicit feature regimes (full-affine-span or simplex-valued) to obtain uniform eigenvalue lower bounds via radial unboundedness of the KL regularizer, directly implying the O(e^{-Ct}) rate along the gradient flow. These steps are derived from the stated assumptions without any reduction of the convergence claim to fitted parameters, self-definitions, or load-bearing self-citations. The extension of tabular results cites independent prior work by other authors (Agarwal et al., Bhandari and Russo, Mei et al.), which serves as external context rather than circular justification. The analysis is self-contained against the given assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Properties of gradient flows and Polyak-Lojasiewicz inequalities hold under the stated policy parameterization.
Reference graph
Works this paper leans on
-
[1]
LQR through the lens of first order methods: Discrete-time case.arXiv preprint arXiv:1907.08921,
Jingjing Bu, Afshin Mesbahi, Maryam Fazel, and Mehran Mesbahi. LQR through the lens of first order methods: Discrete-time case.arXiv preprint arXiv:1907.08921,
-
[2]
Taming the Noise in Reinforcement Learning via Soft Updates
Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
URLhttps://doi.org/10.1137/22M1533517
doi: 10.1137/22M1533517. URLhttps://doi.org/10.1137/22M1533517. Gene H. Golub. Some modified matrix eigenvalue problems.SIAM Review, 15(2):318–334,
-
[4]
URLhttp://www.jstor.org/stable/2028604
ISSN 00361445, 10957200. URLhttp://www.jstor.org/stable/2028604. Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR,
-
[5]
Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,
Caleb Ju and Guanghui Lan. Policy optimization over general state and action spaces.arXiv preprint arXiv:2211.16715,
-
[6]
Gen Li, Yuxin Chen, Yuejie Chi, Yuantao Gu, and Yuting Wei. Sample-efficient reinforcement learning is feasible for linearly realizable MDPs with limited revisiting.Advances in Neural Information Processing Systems, 34:16671–16685, 2021a. Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Softmax policy gradient methods can take exponential time ...
-
[7]
A unified view of entropy-regularized Markov decision processes
GLOBAL LINEAR CONVERGENCE OF SOFTMAX POLICY GRADIENT 33 Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized Markov decision processes.arXiv preprint arXiv:1705.07798,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Frequentist regret bounds for randomized least-squares value iteration
Andrea Zanette, David Brandfonbrener, Emma Brunskill, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. InInternational Conference on Artificial Intelligence and Statistics, pages 1954–1964. PMLR,
1954
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.