Recognition: 2 theorem links
· Lean TheoremLearning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret
Pith reviewed 2026-05-13 01:45 UTC · model grok-4.3
The pith
Strong duality holds for weakly communicating average-reward CMDPs over stationary policies by exploiting the geometric structure of occupation measures, enabling an algorithm with improved O(T^{2/3}) regret and constraint violation bounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that for weakly communicating average-reward CMDPs with finite states and actions, strong duality holds over stationary policies by exploiting the geometric structure of the occupation measure set, despite nonconvexity. Building on this duality, a primal-dual clipped value iteration algorithm is proposed for the linear setting; it achieves regret and constraint violation bounds of order tilde O(T^{2/3}) by decomposing the Lagrangian regret into separate components via strong duality and adapting clipped value iteration through finite-horizon approximation.
What carries the argument
The geometric structure of the occupation measure set, which establishes strong duality without requiring convexity under the weakly communicating assumption and supports the subsequent regret decomposition.
If this is right
- Strong duality permits decomposition of the composite Lagrangian regret into independent bounds on regret and constraint violation.
- The primal-dual algorithm extends clipped value iteration to the constrained average-reward setting through a finite-horizon approximation that stabilizes the dual variable.
- The method attains tilde O(T^{2/3}) bounds simultaneously on regret and constraint violation for linear CMDPs.
- The approach relies on stationary policies and works under the weakly communicating assumption with finite state-action spaces.
Where Pith is reading between the lines
- If the geometric argument extends beyond finite spaces, similar duality results could apply to countably infinite weakly communicating CMDPs.
- One could check the duality gap numerically on small hand-crafted CMDPs that satisfy weak communication but border on the boundary of the assumption.
- The finite-horizon approximation technique for stabilizing dual variables might transfer to other primal-dual reinforcement learning settings with average rewards.
Load-bearing premise
The weakly communicating assumption on the CMDP together with finite state and action spaces is needed to obtain strong duality and to carry out the regret analysis.
What would settle it
A concrete falsifier is an explicit weakly communicating average-reward CMDP with finite states and actions in which the primal and dual values differ by a positive amount, or an instance where the proposed algorithm exhibits regret growing faster than order T to the 2/3.
Figures
read the original abstract
We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces by exploiting the geometric structure of the occupation measure set, despite nonconvexity and lack of an LP formulation. Building on this, it proposes a primal-dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs, achieving regret and constraint violation bounds of Õ(T^{2/3}) via a finite-horizon approximation to stabilize the dual variable and a novel Lagrangian regret decomposition into separate regret and violation terms.
Significance. If the duality result holds under the stated assumptions, this provides improved regret bounds for constrained average-reward RL, advancing beyond prior work on communicating or unichain settings. The geometric approach to duality in nonconvex occupation-measure sets and the regret decomposition technique are technically useful contributions that could apply to other constrained RL problems. The finite-state/action and weak-communication assumptions are standard, but the extension to linear CMDPs adds practical value.
major comments (2)
- [Theorem 1 / Section 3] In the proof of strong duality (Theorem 1, Section 3): the geometric structure argument for closing the duality gap must be verified to hold for all weakly communicating CMDPs. Cases with multiple recurrent classes or particular transient structures permitted by the weak-communication definition may leave a positive gap, which would invalidate the Lagrangian regret decomposition used for the Õ(T^{2/3}) bounds in the subsequent algorithm analysis.
- [Section 4] In the regret analysis (Section 4): the decomposition of composite Lagrangian regret into separate regret and constraint-violation bounds explicitly invokes the strong-duality result. If the gap is not identically zero, both the primal-dual clipped value iteration analysis and the claimed improvement over prior bounds require revision; the finite-horizon approximation's role in dual stabilization should include explicit dependence on the weak-communication parameters.
minor comments (2)
- [Abstract] Clarify in the abstract and introduction whether 'linear CMDPs' refers to linear function approximation, linear constraints, or both, as the title refers to general CMDPs.
- [Notation / Preliminaries] Ensure all notation for occupation measures, dual variables, and the clipping operator is defined before first use and used consistently across sections.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments on our manuscript. We address the major comments point by point below, providing clarifications on the duality result and analysis.
read point-by-point responses
-
Referee: [Theorem 1 / Section 3] In the proof of strong duality (Theorem 1, Section 3): the geometric structure argument for closing the duality gap must be verified to hold for all weakly communicating CMDPs. Cases with multiple recurrent classes or particular transient structures permitted by the weak-communication definition may leave a positive gap, which would invalidate the Lagrangian regret decomposition used for the Õ(T^{2/3}) bounds in the subsequent algorithm analysis.
Authors: The proof of Theorem 1 exploits the geometric structure of the occupation measure set under the weakly communicating assumption to establish that the duality gap is zero. The weak communication condition ensures that any recurrent classes are connected via transient states in a manner that preserves the geometric properties needed to close the gap, without requiring convexity or an LP formulation. The argument is formulated to apply to the full class of weakly communicating CMDPs, including permitted transient structures. We will add a clarifying remark in the revised manuscript to explicitly address why the definition precludes gaps from multiple recurrent classes. revision: partial
-
Referee: [Section 4] In the regret analysis (Section 4): the decomposition of composite Lagrangian regret into separate regret and constraint-violation bounds explicitly invokes the strong-duality result. If the gap is not identically zero, both the primal-dual clipped value iteration analysis and the claimed improvement over prior bounds require revision; the finite-horizon approximation's role in dual stabilization should include explicit dependence on the weak-communication parameters.
Authors: Since Theorem 1 establishes that the duality gap is identically zero under the stated assumptions, the Lagrangian regret decomposition in Section 4 remains valid and underpins the Õ(T^{2/3}) bounds. The finite-horizon approximation stabilizes the dual variable, with its length selected to control approximation error relative to the weak communication parameters (e.g., effective diameter). We will revise Section 4 to make this parameter dependence explicit in the horizon choice and bound derivations. revision: partial
Circularity Check
No circularity: duality and regret bounds derived from geometric and algorithmic arguments without reduction to inputs.
full rationale
The paper's core contributions rest on establishing strong duality via the geometric structure of occupation measures under the weakly communicating assumption, followed by a primal-dual clipped value iteration algorithm whose regret analysis decomposes the Lagrangian using that duality. No quoted steps reduce by construction to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations; the derivations invoke finite state-action spaces and weak communication as external assumptions rather than circularly redefining them. The analysis is self-contained against the stated mathematical premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The CMDP is weakly communicating
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclearwe establish strong duality for weakly communicating average-reward CMDPs over stationary policies ... by carefully exploiting the geometric structure of the occupation measure set
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclearLemma 1. ... relint(P)⊆bP(s1), ext(P)⊆bP(s1), and bP(s1)⊆P
Reference graph
Works this paper leans on
-
[1]
Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, and Honghao Wei
URLhttps://openreview.net/forum?id=SdBApv9iT4. Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, and Honghao Wei. An optimistic algorithm for online CMDPS with anytime adversarial constraints. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=fFgiXamW8E. Matthew Zurek and Yudong Chen. Span-based optimal sample...
work page 2025
-
[2]
TX t=1 r(st, at)|s1 # = lim T→∞ 1 T EP,π
developed a computationally efficient bias-span-constrained algorithm. Model-free algorithms for average-reward MDPs were studied by Wei et al. [2020], Zhang and Xie [2023], and recent works further investigate planning and sample-complexity aspects of average-reward MDPs [Agrawal and Agrawal, 2024, Zurek and Chen, 2024, Lee and Ryu, 2023, Lee et al., 202...
work page 2020
-
[3]
There exists a unique stationary distributiondπ for Pπ such that dπ(s) > 0for s∈ S C and dπ(s) = 0fors∈ S T
-
[4]
Then we havelim n→∞ 1 n Pn t=1 µt(s) =d π(s)for alls∈ S
Let µt(s) = EPπ[1{st = s}|s1 ∼µ 1], where µ1 is the given initial state distribution. Then we havelim n→∞ 1 n Pn t=1 µt(s) =d π(s)for alls∈ S. Proof. For simplicity, with some abuse of notation, letPπ ∈R |S|×|S| denote the transition matrix, where Pπ,ij denotes the probability of transition from statei to state j. Also, we usedπ, µt ∈R |S| for 24 the vect...
work page 1994
-
[5]
implies that fors∈ S C, lim n→∞ 1 n−τ+ 1 nX t=τ ¯µt(s) =d π(s)a.s. This leads to lim n→∞ 1 n nX t=1 ¯µt(s) = lim n→∞ 1 n τ−1X t=1 ¯µt(s) + n−τ+ 1 n 1 n−τ+ 1 nX t=τ ¯µt(s) ! =d π(s)a.s. Moreover, we know that| 1 n Pn t=1 ¯µt,SC | ≤ 1. Then by the bounded convergence theorem (Theorem 1.5.3 in Durrett [2019]), it follows that fors∈ SC, lim n→∞ 1 n nX t=1 EPπ...
work page 2019
-
[7]
Note thatJ ∗ r , J∗ g are assumed to be constant acrosss∈ Sby Assumption 2
+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k 35 where the first inequality follows from(20), and the last inequality is because Lemma 4 implies that ϕ(sk h, ak h)⊤wk h+1 + mins′ V k h+1(s′) ≤P V k h+1(sk h, ak h) + β∥ϕ(sk h, ak h)∥Σ−1 k . Note thatJ ∗ r , J∗ g are assumed to be constant acrosss∈ Sby Assumption 2. Then,...
-
[8]
+λ k HJ ∗ g (s1)−V ∗ g,1(sk 1) +V ∗ r,1(sk
-
[9]
+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k ≤sp(v ∗ r) +λ ksp(v∗ g) +V ∗ r,1(sk
-
[10]
+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k ≤sp(v ∗ r) + 2 γ sp(v∗ g) +V ∗ r,1(sk
-
[11]
where the last inequality follows fromλk ≤ 2 γ for all k
+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) | {z } (III) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k . where the last inequality follows fromλk ≤ 2 γ for all k. Term (III) can be further bounded as follows. (III)=V ∗ r,1(sk
-
[14]
+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) =V ∗ r,1(sk
-
[15]
+λ kV ∗ g,1(sk 1)−V k 1 (sk 1) +V k 1 (sk 1)−Q k 1(sk 1, ak
-
[16]
+P V k 2 (sk 1, ak 1)−V k 2 (sk
- [17]
-
[19]
+ HX h=1 V k h (sk h)−Q k h(sk h, ak h) + HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) 36 This implies that for eachk∈[K], (I)≤sp(v ∗ r) + 2 γ sp(v∗ g) +V ∗ r,1(sk
-
[20]
+λ kV ∗ g,1(sk 1)−V k 1 (sk
-
[21]
(22) Note that taking sum overk = 1,
+ HX h=1 V k h (sk h)−Q k h(sk h, ak h) + HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k . (22) Note that taking sum overk = 1, . . . , Kto (22) yields the first term in(18). In addition to this, by applying (19), we have for anyλ∈[0, 2 γ ], TX t=1 (J ∗ r (s1)−r(s t, at)) +λ TX t=1 (b−g(s t, at)) ≤K sp(v∗ r) + 2 γ sp(v∗ g) + ...
-
[22]
Note that V k h (sk h) ≤ eV k h (sk h) = Qk h(sk h, ak h)
+λ kV ∗ g,1(sk 1)−V k 1 (sk 1) + KX k=1 HX h=1 V k h (sk h)−Q k h(sk h, ak h) + KX k=1 HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) + λ2 2η + ηKH 2 2 . Note that V k h (sk h) ≤ eV k h (sk h) = Qk h(sk h, ak h). It follows thatPK k=1 PH h=1 V k h (sk h)−Q k h(sk h, ak h) ≤ 0. This concludes the proof. 37 E.5 Proof of Lemma 6 Assume that the statement of Le...
-
[23]
+λ kV ∗ g,1(sk 1)−V k 1 (sk 1) | {z } (II) + KX k=1 HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) | {z } (III) + λ2 2η + ηKH 2 2 . (23) By Lemma 5, we have (II)≤0.(24) Moreover, since (III) is the sum of a martingale difference sequence, it can be bounded using the Azuma-Hoeffding inequality. Note thatsp(Vk h+1)≤2sp(v ∗ r) + 4 γ sp(v∗ g) = 2ζ. Then, with p...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.