pith. machine review for the scientific record. sign in

arxiv: 2605.11586 · v1 · submitted 2026-05-12 · 💻 cs.LG · math.OC

Recognition: 2 theorem links

· Lean Theorem

Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret

Beomhan Baek, Dabeen Lee, Kihyun Yu

Pith reviewed 2026-05-13 01:45 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords average-reward CMDPsstrong dualityweakly communicatingregret boundsprimal-dual algorithmclipped value iterationconstrained Markov decision processesoccupation measures
0
0 comments X

The pith

Strong duality holds for weakly communicating average-reward CMDPs over stationary policies by exploiting the geometric structure of occupation measures, enabling an algorithm with improved O(T^{2/3}) regret and constraint violation bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that strong duality holds for average-reward constrained Markov decision processes under the weakly communicating assumption, even though the usual linear programming formulation is absent and the problem is nonconvex. The authors demonstrate this by analyzing the geometry of the set of occupation measures for finite state and action spaces. They then build a primal-dual clipped value iteration algorithm tailored to the linear case, using a finite-horizon approximation to stabilize the dual variable. This yields regret and constraint violation bounds of order T to the power 2/3, which improve on prior results. Readers would care because average-reward models capture long-run performance in ongoing tasks and constraints appear in many practical control problems.

Core claim

The central claim is that for weakly communicating average-reward CMDPs with finite states and actions, strong duality holds over stationary policies by exploiting the geometric structure of the occupation measure set, despite nonconvexity. Building on this duality, a primal-dual clipped value iteration algorithm is proposed for the linear setting; it achieves regret and constraint violation bounds of order tilde O(T^{2/3}) by decomposing the Lagrangian regret into separate components via strong duality and adapting clipped value iteration through finite-horizon approximation.

What carries the argument

The geometric structure of the occupation measure set, which establishes strong duality without requiring convexity under the weakly communicating assumption and supports the subsequent regret decomposition.

If this is right

  • Strong duality permits decomposition of the composite Lagrangian regret into independent bounds on regret and constraint violation.
  • The primal-dual algorithm extends clipped value iteration to the constrained average-reward setting through a finite-horizon approximation that stabilizes the dual variable.
  • The method attains tilde O(T^{2/3}) bounds simultaneously on regret and constraint violation for linear CMDPs.
  • The approach relies on stationary policies and works under the weakly communicating assumption with finite state-action spaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the geometric argument extends beyond finite spaces, similar duality results could apply to countably infinite weakly communicating CMDPs.
  • One could check the duality gap numerically on small hand-crafted CMDPs that satisfy weak communication but border on the boundary of the assumption.
  • The finite-horizon approximation technique for stabilizing dual variables might transfer to other primal-dual reinforcement learning settings with average rewards.

Load-bearing premise

The weakly communicating assumption on the CMDP together with finite state and action spaces is needed to obtain strong duality and to carry out the regret analysis.

What would settle it

A concrete falsifier is an explicit weakly communicating average-reward CMDP with finite states and actions in which the primal and dual values differ by a positive amount, or an instance where the proposed algorithm exhibits regret growing faster than order T to the 2/3.

Figures

Figures reproduced from arXiv: 2605.11586 by Beomhan Baek, Dabeen Lee, Kihyun Yu.

Figure 1
Figure 1. Figure 1: Comparison of Pb(s1) and P under the weakly communicating assumption. As a consequence, (Occ-CMDP) admits the following equivalent LP reformulation, and this connection allows us to rely on LP strong duality under the unichain assumption: (Occ-CMDP) Pb(s1)=P = max q∈P r ⊤q s.t. g ⊤q ≥ b. (1) However, this is not applicable under the weakly communicating assumption, as Pb(s1) ̸= P, and Pb(s1) may be nonconv… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of an average-reward CMDP over the occupation measure space. The green line [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of random agent, Algorithm 2 in [PITH_FULL_IMAGE:figures/full_fig_p042_3.png] view at source ↗
read the original abstract

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces by exploiting the geometric structure of the occupation measure set, despite nonconvexity and lack of an LP formulation. Building on this, it proposes a primal-dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs, achieving regret and constraint violation bounds of Õ(T^{2/3}) via a finite-horizon approximation to stabilize the dual variable and a novel Lagrangian regret decomposition into separate regret and violation terms.

Significance. If the duality result holds under the stated assumptions, this provides improved regret bounds for constrained average-reward RL, advancing beyond prior work on communicating or unichain settings. The geometric approach to duality in nonconvex occupation-measure sets and the regret decomposition technique are technically useful contributions that could apply to other constrained RL problems. The finite-state/action and weak-communication assumptions are standard, but the extension to linear CMDPs adds practical value.

major comments (2)
  1. [Theorem 1 / Section 3] In the proof of strong duality (Theorem 1, Section 3): the geometric structure argument for closing the duality gap must be verified to hold for all weakly communicating CMDPs. Cases with multiple recurrent classes or particular transient structures permitted by the weak-communication definition may leave a positive gap, which would invalidate the Lagrangian regret decomposition used for the Õ(T^{2/3}) bounds in the subsequent algorithm analysis.
  2. [Section 4] In the regret analysis (Section 4): the decomposition of composite Lagrangian regret into separate regret and constraint-violation bounds explicitly invokes the strong-duality result. If the gap is not identically zero, both the primal-dual clipped value iteration analysis and the claimed improvement over prior bounds require revision; the finite-horizon approximation's role in dual stabilization should include explicit dependence on the weak-communication parameters.
minor comments (2)
  1. [Abstract] Clarify in the abstract and introduction whether 'linear CMDPs' refers to linear function approximation, linear constraints, or both, as the title refers to general CMDPs.
  2. [Notation / Preliminaries] Ensure all notation for occupation measures, dual variables, and the clipping operator is defined before first use and used consistently across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on our manuscript. We address the major comments point by point below, providing clarifications on the duality result and analysis.

read point-by-point responses
  1. Referee: [Theorem 1 / Section 3] In the proof of strong duality (Theorem 1, Section 3): the geometric structure argument for closing the duality gap must be verified to hold for all weakly communicating CMDPs. Cases with multiple recurrent classes or particular transient structures permitted by the weak-communication definition may leave a positive gap, which would invalidate the Lagrangian regret decomposition used for the Õ(T^{2/3}) bounds in the subsequent algorithm analysis.

    Authors: The proof of Theorem 1 exploits the geometric structure of the occupation measure set under the weakly communicating assumption to establish that the duality gap is zero. The weak communication condition ensures that any recurrent classes are connected via transient states in a manner that preserves the geometric properties needed to close the gap, without requiring convexity or an LP formulation. The argument is formulated to apply to the full class of weakly communicating CMDPs, including permitted transient structures. We will add a clarifying remark in the revised manuscript to explicitly address why the definition precludes gaps from multiple recurrent classes. revision: partial

  2. Referee: [Section 4] In the regret analysis (Section 4): the decomposition of composite Lagrangian regret into separate regret and constraint-violation bounds explicitly invokes the strong-duality result. If the gap is not identically zero, both the primal-dual clipped value iteration analysis and the claimed improvement over prior bounds require revision; the finite-horizon approximation's role in dual stabilization should include explicit dependence on the weak-communication parameters.

    Authors: Since Theorem 1 establishes that the duality gap is identically zero under the stated assumptions, the Lagrangian regret decomposition in Section 4 remains valid and underpins the Õ(T^{2/3}) bounds. The finite-horizon approximation stabilizes the dual variable, with its length selected to control approximation error relative to the weak communication parameters (e.g., effective diameter). We will revise Section 4 to make this parameter dependence explicit in the horizon choice and bound derivations. revision: partial

Circularity Check

0 steps flagged

No circularity: duality and regret bounds derived from geometric and algorithmic arguments without reduction to inputs.

full rationale

The paper's core contributions rest on establishing strong duality via the geometric structure of occupation measures under the weakly communicating assumption, followed by a primal-dual clipped value iteration algorithm whose regret analysis decomposes the Lagrangian using that duality. No quoted steps reduce by construction to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations; the derivations invoke finite state-action spaces and weak communication as external assumptions rather than circularly redefining them. The analysis is self-contained against the stated mathematical premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the weakly communicating assumption and finite state-action spaces as domain assumptions; no free parameters or invented entities are indicated in the abstract.

axioms (1)
  • domain assumption The CMDP is weakly communicating
    Invoked to establish strong duality despite absence of LP formulation and nonconvexity.

pith-pipeline@v0.9.0 · 5502 in / 1220 out tokens · 50440 ms · 2026-05-13T01:45:03.449942+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, and Honghao Wei

    URLhttps://openreview.net/forum?id=SdBApv9iT4. Jiahui Zhu, Kihyun Yu, Dabeen Lee, Xin Liu, and Honghao Wei. An optimistic algorithm for online CMDPS with anytime adversarial constraints. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=fFgiXamW8E. Matthew Zurek and Yudong Chen. Span-based optimal sample...

  2. [2]

    TX t=1 r(st, at)|s1 # = lim T→∞ 1 T EP,π

    developed a computationally efficient bias-span-constrained algorithm. Model-free algorithms for average-reward MDPs were studied by Wei et al. [2020], Zhang and Xie [2023], and recent works further investigate planning and sample-complexity aspects of average-reward MDPs [Agrawal and Agrawal, 2024, Zurek and Chen, 2024, Lee and Ryu, 2023, Lee et al., 202...

  3. [3]

    There exists a unique stationary distributiondπ for Pπ such that dπ(s) > 0for s∈ S C and dπ(s) = 0fors∈ S T

  4. [4]

    Then we havelim n→∞ 1 n Pn t=1 µt(s) =d π(s)for alls∈ S

    Let µt(s) = EPπ[1{st = s}|s1 ∼µ 1], where µ1 is the given initial state distribution. Then we havelim n→∞ 1 n Pn t=1 µt(s) =d π(s)for alls∈ S. Proof. For simplicity, with some abuse of notation, letPπ ∈R |S|×|S| denote the transition matrix, where Pπ,ij denotes the probability of transition from statei to state j. Also, we usedπ, µt ∈R |S| for 24 the vect...

  5. [5]

    This leads to lim n→∞ 1 n nX t=1 ¯µt(s) = lim n→∞ 1 n τ−1X t=1 ¯µt(s) + n−τ+ 1 n 1 n−τ+ 1 nX t=τ ¯µt(s) ! =d π(s)a.s

    implies that fors∈ S C, lim n→∞ 1 n−τ+ 1 nX t=τ ¯µt(s) =d π(s)a.s. This leads to lim n→∞ 1 n nX t=1 ¯µt(s) = lim n→∞ 1 n τ−1X t=1 ¯µt(s) + n−τ+ 1 n 1 n−τ+ 1 nX t=τ ¯µt(s) ! =d π(s)a.s. Moreover, we know that| 1 n Pn t=1 ¯µt,SC | ≤ 1. Then by the bounded convergence theorem (Theorem 1.5.3 in Durrett [2019]), it follows that fors∈ SC, lim n→∞ 1 n nX t=1 EPπ...

  6. [7]

    Note thatJ ∗ r , J∗ g are assumed to be constant acrosss∈ Sby Assumption 2

    + HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k 35 where the first inequality follows from(20), and the last inequality is because Lemma 4 implies that ϕ(sk h, ak h)⊤wk h+1 + mins′ V k h+1(s′) ≤P V k h+1(sk h, ak h) + β∥ϕ(sk h, ak h)∥Σ−1 k . Note thatJ ∗ r , J∗ g are assumed to be constant acrosss∈ Sby Assumption 2. Then,...

  7. [8]

    +λ k HJ ∗ g (s1)−V ∗ g,1(sk 1) +V ∗ r,1(sk

  8. [9]

    + HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k ≤sp(v ∗ r) +λ ksp(v∗ g) +V ∗ r,1(sk

  9. [10]

    + HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k ≤sp(v ∗ r) + 2 γ sp(v∗ g) +V ∗ r,1(sk

  10. [11]

    where the last inequality follows fromλk ≤ 2 γ for all k

    + HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) | {z } (III) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k . where the last inequality follows fromλk ≤ 2 γ for all k. Term (III) can be further bounded as follows. (III)=V ∗ r,1(sk

  11. [14]

    + HX h=1 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) =V ∗ r,1(sk

  12. [15]

    +λ kV ∗ g,1(sk 1)−V k 1 (sk 1) +V k 1 (sk 1)−Q k 1(sk 1, ak

  13. [16]

    +P V k 2 (sk 1, ak 1)−V k 2 (sk

  14. [17]

    =V ∗ r,1(sk

    + HX h=2 −Qk h(sk h, ak h) +P V k h+1(sk h, ak h) ... =V ∗ r,1(sk

  15. [19]

    + HX h=1 V k h (sk h)−Q k h(sk h, ak h) + HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) 36 This implies that for eachk∈[K], (I)≤sp(v ∗ r) + 2 γ sp(v∗ g) +V ∗ r,1(sk

  16. [20]

    +λ kV ∗ g,1(sk 1)−V k 1 (sk

  17. [21]

    (22) Note that taking sum overk = 1,

    + HX h=1 V k h (sk h)−Q k h(sk h, ak h) + HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) + HX h=1 2β∥ϕ(sk h, ak h)∥Σ−1 k . (22) Note that taking sum overk = 1, . . . , Kto (22) yields the first term in(18). In addition to this, by applying (19), we have for anyλ∈[0, 2 γ ], TX t=1 (J ∗ r (s1)−r(s t, at)) +λ TX t=1 (b−g(s t, at)) ≤K sp(v∗ r) + 2 γ sp(v∗ g) + ...

  18. [22]

    Note that V k h (sk h) ≤ eV k h (sk h) = Qk h(sk h, ak h)

    +λ kV ∗ g,1(sk 1)−V k 1 (sk 1) + KX k=1 HX h=1 V k h (sk h)−Q k h(sk h, ak h) + KX k=1 HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) + λ2 2η + ηKH 2 2 . Note that V k h (sk h) ≤ eV k h (sk h) = Qk h(sk h, ak h). It follows thatPK k=1 PH h=1 V k h (sk h)−Q k h(sk h, ak h) ≤ 0. This concludes the proof. 37 E.5 Proof of Lemma 6 Assume that the statement of Le...

  19. [23]

    (23) By Lemma 5, we have (II)≤0.(24) Moreover, since (III) is the sum of a martingale difference sequence, it can be bounded using the Azuma-Hoeffding inequality

    +λ kV ∗ g,1(sk 1)−V k 1 (sk 1) | {z } (II) + KX k=1 HX h=1 P V k h+1(sk h, ak h)−V k h+1(sk h+1) | {z } (III) + λ2 2η + ηKH 2 2 . (23) By Lemma 5, we have (II)≤0.(24) Moreover, since (III) is the sum of a martingale difference sequence, it can be bounded using the Azuma-Hoeffding inequality. Note thatsp(Vk h+1)≤2sp(v ∗ r) + 4 γ sp(v∗ g) = 2ζ. Then, with p...