arxiv: 2605.11586 · v1 · submitted 2026-05-12 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret

Beomhan Baek, Dabeen Lee, Kihyun Yu

Pith reviewed 2026-05-13 01:45 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords average-reward CMDPsstrong dualityweakly communicatingregret boundsprimal-dual algorithmclipped value iterationconstrained Markov decision processesoccupation measures

0 comments

The pith

Strong duality holds for weakly communicating average-reward CMDPs over stationary policies by exploiting the geometric structure of occupation measures, enabling an algorithm with improved O(T^{2/3}) regret and constraint violation bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that strong duality holds for average-reward constrained Markov decision processes under the weakly communicating assumption, even though the usual linear programming formulation is absent and the problem is nonconvex. The authors demonstrate this by analyzing the geometry of the set of occupation measures for finite state and action spaces. They then build a primal-dual clipped value iteration algorithm tailored to the linear case, using a finite-horizon approximation to stabilize the dual variable. This yields regret and constraint violation bounds of order T to the power 2/3, which improve on prior results. Readers would care because average-reward models capture long-run performance in ongoing tasks and constraints appear in many practical control problems.

Core claim

The central claim is that for weakly communicating average-reward CMDPs with finite states and actions, strong duality holds over stationary policies by exploiting the geometric structure of the occupation measure set, despite nonconvexity. Building on this duality, a primal-dual clipped value iteration algorithm is proposed for the linear setting; it achieves regret and constraint violation bounds of order tilde O(T^{2/3}) by decomposing the Lagrangian regret into separate components via strong duality and adapting clipped value iteration through finite-horizon approximation.

What carries the argument

The geometric structure of the occupation measure set, which establishes strong duality without requiring convexity under the weakly communicating assumption and supports the subsequent regret decomposition.

If this is right

Strong duality permits decomposition of the composite Lagrangian regret into independent bounds on regret and constraint violation.
The primal-dual algorithm extends clipped value iteration to the constrained average-reward setting through a finite-horizon approximation that stabilizes the dual variable.
The method attains tilde O(T^{2/3}) bounds simultaneously on regret and constraint violation for linear CMDPs.
The approach relies on stationary policies and works under the weakly communicating assumption with finite state-action spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the geometric argument extends beyond finite spaces, similar duality results could apply to countably infinite weakly communicating CMDPs.
One could check the duality gap numerically on small hand-crafted CMDPs that satisfy weak communication but border on the boundary of the assumption.
The finite-horizon approximation technique for stabilizing dual variables might transfer to other primal-dual reinforcement learning settings with average rewards.

Load-bearing premise

The weakly communicating assumption on the CMDP together with finite state and action spaces is needed to obtain strong duality and to carry out the regret analysis.

What would settle it

A concrete falsifier is an explicit weakly communicating average-reward CMDP with finite states and actions in which the primal and dual values differ by a positive amount, or an instance where the proposed algorithm exhibits regret growing faster than order T to the 2/3.

Figures

Figures reproduced from arXiv: 2605.11586 by Beomhan Baek, Dabeen Lee, Kihyun Yu.

**Figure 1.** Figure 1: Comparison of Pb(s1) and P under the weakly communicating assumption. As a consequence, (Occ-CMDP) admits the following equivalent LP reformulation, and this connection allows us to rely on LP strong duality under the unichain assumption: (Occ-CMDP) Pb(s1)=P = max q∈P r ⊤q s.t. g ⊤q ≥ b. (1) However, this is not applicable under the weakly communicating assumption, as Pb(s1) ̸= P, and Pb(s1) may be nonconv… view at source ↗

**Figure 2.** Figure 2: Illustration of an average-reward CMDP over the occupation measure space. The green line [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of random agent, Algorithm 2 in [PITH_FULL_IMAGE:figures/full_fig_p042_3.png] view at source ↗

read the original abstract

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Strong duality for weakly communicating average-reward CMDPs via occupation measure geometry, plus a primal-dual clipped VI algorithm that reaches T^{2/3} regret and violation.

read the letter

The main point is that the paper establishes strong duality over stationary policies for these CMDPs by exploiting the geometry of the occupation measure set, even though the set is not convex and there is no LP. They then build a primal-dual algorithm that adapts clipped value iteration with a finite-horizon stabilization step on the dual variables, which lets them split the Lagrangian regret into separate reward and constraint terms and prove both are O(T^{2/3}). That is the concrete improvement over earlier bounds. The stabilization trick is the part that actually makes the analysis close. The duality result is load-bearing for the whole regret decomposition, so the geometric argument has to hold without extra conditions on recurrent classes. The paper walks through the supporting hyperplanes and shows the min-max gap is zero under the weak communication assumption, but the proof is technical and the edge cases with multiple recurrent classes are the place to check hardest. If that step is solid, the rest follows cleanly. This is for people working on theoretical constrained RL, especially average-reward settings where safety constraints matter. A reader who already knows the unconstrained average-reward literature will see the value in the duality closure and the algorithm tweak. It is worth sending to a serious referee because the claims are precise, the approach is reproducible in principle, and the improvement is stated clearly enough to check.

Referee Report

2 major / 2 minor

Summary. The paper claims to establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces by exploiting the geometric structure of the occupation measure set, despite nonconvexity and lack of an LP formulation. Building on this, it proposes a primal-dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs, achieving regret and constraint violation bounds of Õ(T^{2/3}) via a finite-horizon approximation to stabilize the dual variable and a novel Lagrangian regret decomposition into separate regret and violation terms.

Significance. If the duality result holds under the stated assumptions, this provides improved regret bounds for constrained average-reward RL, advancing beyond prior work on communicating or unichain settings. The geometric approach to duality in nonconvex occupation-measure sets and the regret decomposition technique are technically useful contributions that could apply to other constrained RL problems. The finite-state/action and weak-communication assumptions are standard, but the extension to linear CMDPs adds practical value.

major comments (2)

[Theorem 1 / Section 3] In the proof of strong duality (Theorem 1, Section 3): the geometric structure argument for closing the duality gap must be verified to hold for all weakly communicating CMDPs. Cases with multiple recurrent classes or particular transient structures permitted by the weak-communication definition may leave a positive gap, which would invalidate the Lagrangian regret decomposition used for the Õ(T^{2/3}) bounds in the subsequent algorithm analysis.
[Section 4] In the regret analysis (Section 4): the decomposition of composite Lagrangian regret into separate regret and constraint-violation bounds explicitly invokes the strong-duality result. If the gap is not identically zero, both the primal-dual clipped value iteration analysis and the claimed improvement over prior bounds require revision; the finite-horizon approximation's role in dual stabilization should include explicit dependence on the weak-communication parameters.

minor comments (2)

[Abstract] Clarify in the abstract and introduction whether 'linear CMDPs' refers to linear function approximation, linear constraints, or both, as the title refers to general CMDPs.
[Notation / Preliminaries] Ensure all notation for occupation measures, dual variables, and the clipping operator is defined before first use and used consistently across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address the major comments point by point below, providing clarifications on the duality result and analysis.

read point-by-point responses

Referee: [Theorem 1 / Section 3] In the proof of strong duality (Theorem 1, Section 3): the geometric structure argument for closing the duality gap must be verified to hold for all weakly communicating CMDPs. Cases with multiple recurrent classes or particular transient structures permitted by the weak-communication definition may leave a positive gap, which would invalidate the Lagrangian regret decomposition used for the Õ(T^{2/3}) bounds in the subsequent algorithm analysis.

Authors: The proof of Theorem 1 exploits the geometric structure of the occupation measure set under the weakly communicating assumption to establish that the duality gap is zero. The weak communication condition ensures that any recurrent classes are connected via transient states in a manner that preserves the geometric properties needed to close the gap, without requiring convexity or an LP formulation. The argument is formulated to apply to the full class of weakly communicating CMDPs, including permitted transient structures. We will add a clarifying remark in the revised manuscript to explicitly address why the definition precludes gaps from multiple recurrent classes. revision: partial
Referee: [Section 4] In the regret analysis (Section 4): the decomposition of composite Lagrangian regret into separate regret and constraint-violation bounds explicitly invokes the strong-duality result. If the gap is not identically zero, both the primal-dual clipped value iteration analysis and the claimed improvement over prior bounds require revision; the finite-horizon approximation's role in dual stabilization should include explicit dependence on the weak-communication parameters.

Authors: Since Theorem 1 establishes that the duality gap is identically zero under the stated assumptions, the Lagrangian regret decomposition in Section 4 remains valid and underpins the Õ(T^{2/3}) bounds. The finite-horizon approximation stabilizes the dual variable, with its length selected to control approximation error relative to the weak communication parameters (e.g., effective diameter). We will revise Section 4 to make this parameter dependence explicit in the horizon choice and bound derivations. revision: partial

Circularity Check

0 steps flagged

No circularity: duality and regret bounds derived from geometric and algorithmic arguments without reduction to inputs.

full rationale

The paper's core contributions rest on establishing strong duality via the geometric structure of occupation measures under the weakly communicating assumption, followed by a primal-dual clipped value iteration algorithm whose regret analysis decomposes the Lagrangian using that duality. No quoted steps reduce by construction to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations; the derivations invoke finite state-action spaces and weak communication as external assumptions rather than circularly redefining them. The analysis is self-contained against the stated mathematical premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the weakly communicating assumption and finite state-action spaces as domain assumptions; no free parameters or invented entities are indicated in the abstract.

axioms (1)

domain assumption The CMDP is weakly communicating
Invoked to establish strong duality despite absence of LP formulation and nonconvexity.

pith-pipeline@v0.9.0 · 5502 in / 1220 out tokens · 50440 ms · 2026-05-13T01:45:03.449942+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
we establish strong duality for weakly communicating average-reward CMDPs over stationary policies ... by carefully exploiting the geometric structure of the occupation measure set
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
Lemma 1. ... relint(P)⊆bP(s1), ext(P)⊆bP(s1), and bP(s1)⊆P

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, and Honghao Wei

URLhttps://openreview.net/forum?id=SdBApv9iT4. Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, and Honghao Wei. An optimistic algorithm for online CMDPS with anytime adversarial constraints. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=fFgiXamW8E. Matthew Zurek and Yudong Chen. Span-based optimal sample...

work page 2025
[2]

TX t=1 r(st, at)|s1 # = lim T→∞ 1 T EP,π

developed a computationally efficient bias-span-constrained algorithm. Model-free algorithms for average-reward MDPs were studied by Wei et al. [2020], Zhang and Xie [2023], and recent works further investigate planning and sample-complexity aspects of average-reward MDPs [Agrawal and Agrawal, 2024, Zurek and Chen, 2024, Lee and Ryu, 2023, Lee et al., 202...

work page 2020
[3]

There exists a unique stationary distributiondπ for Pπ such that dπ(s) > 0for s∈ S C and dπ(s) = 0fors∈ S T

work page
[4]

Then we havelim n→∞ 1 n Pn t=1 µt(s) =d π(s)for alls∈ S

Let µt(s) = EPπ[1{st = s}|s1 ∼µ 1], where µ1 is the given initial state distribution. Then we havelim n→∞ 1 n Pn t=1 µt(s) =d π(s)for alls∈ S. Proof. For simplicity, with some abuse of notation, letPπ ∈R |S|×|S| denote the transition matrix, where Pπ,ij denotes the probability of transition from statei to state j. Also, we usedπ, µt ∈R |S| for 24 the vect...

work page 1994
[5]

This leads to lim n→∞ 1 n nX t=1 ¯µt(s) = lim n→∞ 1 n τ−1X t=1 ¯µt(s) + n−τ+ 1 n 1 n−τ+ 1 nX t=τ ¯µt(s) ! =d π(s)a.s

implies that fors∈ S C, lim n→∞ 1 n−τ+ 1 nX t=τ ¯µt(s) =d π(s)a.s. This leads to lim n→∞ 1 n nX t=1 ¯µt(s) = lim n→∞ 1 n τ−1X t=1 ¯µt(s) + n−τ+ 1 n 1 n−τ+ 1 nX t=τ ¯µt(s) ! =d π(s)a.s. Moreover, we know that| 1 n Pn t=1 ¯µt,SC | ≤ 1. Then by the bounded convergence theorem (Theorem 1.5.3 in Durrett [2019]), it follows that fors∈ SC, lim n→∞ 1 n nX t=1 EPπ...

work page 2019
[7]

Note thatJ ∗ r , J∗ g are assumed to be constant acrosss∈ Sby Assumption 2

+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k 35 where the first inequality follows from(20), and the last inequality is because Lemma 4 implies that ϕ(sk h, ak h)⊤wk h+1 + mins′ V k h+1(s′) ≤P V k h+1(sk h, ak h) + β∥ϕ(sk h, ak h)∥Σ−1 k . Note thatJ ∗ r , J∗ g are assumed to be constant acrosss∈ Sby Assumption 2. Then,...

work page
[8]

+λ k HJ ∗ g (s1)−V ∗ g,1(sk 1) +V ∗ r,1(sk

work page
[9]

+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k ≤sp(v ∗ r) +λ ksp(v∗ g) +V ∗ r,1(sk

work page
[10]

+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k ≤sp(v ∗ r) + 2 γ sp(v∗ g) +V ∗ r,1(sk

work page
[11]

where the last inequality follows fromλk ≤ 2 γ for all k

+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) | {z } (III) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k . where the last inequality follows fromλk ≤ 2 γ for all k. Term (III) can be further bounded as follows. (III)=V ∗ r,1(sk

work page
[14]

+ HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) =V ∗ r,1(sk

work page
[15]

+λ kV ∗ g,1(sk 1)−V k 1 (sk 1) +V k 1 (sk 1)−Q k 1(sk 1, ak

work page
[16]

+P V k 2 (sk 1, ak 1)−V k 2 (sk

work page
[17]

=V ∗ r,1(sk

+ HX h=2 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) ... =V ∗ r,1(sk

work page
[19]

+ HX h=1 V k h (sk h)−Q k h(sk h, ak h) + HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) 36 This implies that for eachk∈[K], (I)≤sp(v ∗ r) + 2 γ sp(v∗ g) +V ∗ r,1(sk

work page
[20]

+λ kV ∗ g,1(sk 1)−V k 1 (sk

work page
[21]

(22) Note that taking sum overk = 1,

+ HX h=1 V k h (sk h)−Q k h(sk h, ak h) + HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k . (22) Note that taking sum overk = 1, . . . , Kto (22) yields the first term in(18). In addition to this, by applying (19), we have for anyλ∈[0, 2 γ ], TX t=1 (J ∗ r (s1)−r(s t, at)) +λ TX t=1 (b−g(s t, at)) ≤K sp(v∗ r) + 2 γ sp(v∗ g) + ...

work page
[22]

Note that V k h (sk h) ≤ eV k h (sk h) = Qk h(sk h, ak h)

+λ kV ∗ g,1(sk 1)−V k 1 (sk 1) + KX k=1 HX h=1 V k h (sk h)−Q k h(sk h, ak h) + KX k=1 HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) + λ2 2η + ηKH 2 2 . Note that V k h (sk h) ≤ eV k h (sk h) = Qk h(sk h, ak h). It follows thatPK k=1 PH h=1 V k h (sk h)−Q k h(sk h, ak h) ≤ 0. This concludes the proof. 37 E.5 Proof of Lemma 6 Assume that the statement of Le...

work page
[23]

(23) By Lemma 5, we have (II)≤0.(24) Moreover, since (III) is the sum of a martingale difference sequence, it can be bounded using the Azuma-Hoeffding inequality

+λ kV ∗ g,1(sk 1)−V k 1 (sk 1) | {z } (II) + KX k=1 HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) | {z } (III) + λ2 2η + ηKH 2 2 . (23) By Lemma 5, we have (II)≤0.(24) Moreover, since (III) is the sum of a martingale difference sequence, it can be bounded using the Azuma-Hoeffding inequality. Note thatsp(Vk h+1)≤2sp(v ∗ r) + 4 γ sp(v∗ g) = 2ζ. Then, with p...

work page 2021