pith. machine review for the scientific record. sign in

arxiv: 2605.11535 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:05 UTC · model grok-4.3

classification 💻 cs.LG
keywords linear CMDPsadversarial lossesprimal-dual optimizationregret boundsconstraint violationpolicy optimizationonline learningMarkov decision processes
0
0 comments X

The pith

A primal-dual algorithm achieves the first sublinear regret and constraint violation bounds for linear CMDPs with adversarial losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a primal-dual policy optimization algorithm for linear constrained Markov decision processes where the losses are chosen adversarially under full information while the costs remain stochastic under bandit feedback. It establishes that this method delivers regret and constraint violation bounds that both scale as roughly K to the three-quarters power across K episodes. Prior approaches assumed fixed or randomly drawn losses and costs, so they produced linear growth in regret or violations once the losses began to change adversarially. The new algorithm relies on a fresh class of weighted LogSumExp softmax policies together with periodic mixing and regularized dual updates to keep the effective policy set manageable and the dual variables stable.

Core claim

We propose a primal-dual policy optimization algorithm for online finite-horizon adversarial linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the first to achieve sublinear regret and constraint violation bounds in this setting, both bounded by O(K^{3/4}). The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from a new covering number argument for these policies and two novel algorithmic components—periodic policy mixing and a regularized dual—up

What carries the argument

weighted LogSumExp softmax policies: a new policy class that adapts to adversarially chosen losses while admitting a covering number bound that supports sublinear regret analysis when combined with periodic mixing.

If this is right

  • The algorithm simultaneously controls regret under full-information adversarial losses and constraint violations under bandit stochastic costs without linear growth.
  • Periodic policy mixing together with regularized dual updates keeps the dual variables bounded while preserving the covering number control.
  • Numerical experiments on the algorithm confirm that the derived O(K^{3/4}) bounds are observed in practice.
  • The same primal-dual structure can be applied to other online constrained decision problems that mix full-information and bandit feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The covering number argument developed for these policies could be reused to obtain sublinear bounds for other policy classes in adversarial reinforcement learning.
  • In real deployments the regularized dual update may reduce sensitivity to non-stationary constraints even when the theoretical covering number is not explicitly computed.
  • If the 3/4 exponent proves tight, future work could test whether tighter bounds require a different policy parameterization or a stronger feedback model.

Load-bearing premise

The weighted LogSumExp softmax policies admit a covering number bound that interacts productively with periodic policy mixing and regularized dual updates to keep both regret and violations sublinear.

What would settle it

Run the algorithm on a linear CMDP instance with a sequence of adversarial losses chosen to maximize the effective covering number of the policy class, then measure whether cumulative regret or total constraint violation grows faster than order K to the 3/4.

Figures

Figures reproduced from arXiv: 2605.11535 by Dabeen Lee, Kihyun Yu, Seoungbin Bae.

Figure 1
Figure 1. Figure 1: Plots of regret and constraint violation for [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components -- periodic policy mixing and a regularized dual update -- which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims to introduce the first primal-dual policy optimization algorithm for finite-horizon linear CMDPs with adversarial losses under full-information feedback and stochastic costs under bandit feedback. It achieves sublinear regret and constraint violation bounds of ~O(K^{3/4}) by introducing weighted LogSumExp softmax policies, a new covering number argument for this policy class, periodic policy mixing, and regularized dual updates, with numerical results validating the theory.

Significance. Should the covering number bound and its integration with the algorithmic components hold, the result would be significant as it provides the first sublinear guarantees for this mixed adversarial-stochastic constrained MDP setting, advancing the field beyond stochastic assumptions to more general adversarial environments.

major comments (1)
  1. [Abstract and theoretical analysis] The new covering number argument for the weighted LogSumExp softmax policies (key contribution (i) in the abstract) is load-bearing for the O(K^{3/4}) rate. It must be shown that this bound, combined with periodic policy mixing and the regularized dual update, controls the error from the noisy bandit feedback estimator such that the total error is o(K^{1/4}). If the covering number grows with the dual regularization parameter or horizon without sufficient control by the mixing schedule, the claimed exponent may not be attained under the mixed feedback model.
minor comments (2)
  1. [Experiments] Provide more details on the CMDP environments used in the numerical validation and the choice of baselines for comparison.
  2. [Notation] Ensure the definition of the weighted LogSumExp softmax policy is clearly stated before its use in the algorithm description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the key technical point regarding error control in our analysis. We address the comment below and will revise the manuscript to improve clarity.

read point-by-point responses
  1. Referee: [Abstract and theoretical analysis] The new covering number argument for the weighted LogSumExp softmax policies (key contribution (i) in the abstract) is load-bearing for the O(K^{3/4}) rate. It must be shown that this bound, combined with periodic policy mixing and the regularized dual update, controls the error from the noisy bandit feedback estimator such that the total error is o(K^{1/4}). If the covering number grows with the dual regularization parameter or horizon without sufficient control by the mixing schedule, the claimed exponent may not be attained under the mixed feedback model.

    Authors: We agree that the covering number bound is central and that its interaction with the mixed feedback must be tightly controlled. In Appendix B (proof of Theorem 1), we first establish a covering number bound for the weighted LogSumExp softmax class (Lemma 3) that grows as O(H log(1/λ) + log |A|), where λ is the dual regularization parameter. We then select the mixing period τ = Θ(K^{1/2}) and λ = Θ(K^{-1/4}). Substituting these into the regret decomposition (Equation (12) in the appendix) yields an estimation-error term from the bandit-feedback cost estimator of order O(K^{3/4} / √log K). This term is absorbed into the leading O(K^{3/4}) bound and is strictly o(K^{1/4}) relative to the dominant terms once the periodic mixing and regularization are applied. We will add a short clarifying paragraph in Section 4.2 that explicitly states the parameter choices and verifies that the total additive error from the noisy estimator remains o(K^{1/4}), thereby confirming the claimed exponent under the mixed feedback model. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The claimed O(K^{3/4}) regret and violation bounds are derived from a newly introduced covering-number argument for the weighted LogSumExp softmax policy class together with the two algorithmic devices (periodic mixing and regularized dual updates). These are presented as original contributions whose analysis controls the relevant quantities under the mixed full-information/bandit feedback model; no step reduces by definition or by self-citation to a fitted parameter, a prior result of the same authors, or an input that is merely renamed. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract provides no explicit free parameters, background axioms, or invented entities beyond naming the new policy class; any learning rates or regularization coefficients are implicit in the algorithm but not quantified here.

invented entities (1)
  • weighted LogSumExp softmax policies no independent evidence
    purpose: To adapt to adversarially chosen loss functions while enabling a covering-number argument
    The abstract presents this as a new class of policies designed specifically for the adversarial setting.

pith-pipeline@v0.9.0 · 5531 in / 1287 out tokens · 43197 ms · 2026-05-13T01:05:55.328607+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Provably efficient model-free constrained rl with linear function approximation , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Conference on Learning Theory , pages=

    Nearly minimax optimal reinforcement learning for linear mixture markov decision processes , author=. Conference on Learning Theory , pages=. 2021 , organization=

  3. [3]

    Advances in neural information processing systems , volume=

    Is Q-learning provably efficient? , author=. Advances in neural information processing systems , volume=

  4. [4]

    Advances in neural information processing systems , volume=

    Improved algorithms for linear stochastic bandits , author=. Advances in neural information processing systems , volume=

  5. [5]

    Conference on learning theory , pages=

    Provably efficient reinforcement learning with linear function approximation , author=. Conference on learning theory , pages=. 2020 , organization=

  6. [6]

    International Conference on Machine Learning , pages=

    Nearly minimax optimal reinforcement learning for linear markov decision processes , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  7. [7]

    International Conference on Machine Learning , pages=

    Provably efficient exploration in policy optimization , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  8. [8]

    Advances in Neural Information Processing Systems , volume=

    Provably efficient reinforcement learning with linear function approximation under adaptivity constraints , author=. Advances in Neural Information Processing Systems , volume=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Warm-up free policy optimization: Improved regret in linear Markov decision processes , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    International Conference on Machine Learning , pages=

    Rate-Optimal Policy Optimization for Linear Markov Decision Processes , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Online convex optimization with stochastic constraints , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    arXiv preprint , volume =

    On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , author =. arXiv preprint , volume =

  13. [13]

    Proc.\ NeurIPS , year =

    Optimal Approximation--Smoothness Trade-offs for Soft-max , author =. Proc.\ NeurIPS , year =

  14. [14]

    Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=

    Online primal-dual mirror descent under stochastic constraints , author=. Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=. 2020 , publisher=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    Mathematical Programming , volume=

    Primal-dual first-order methods with iteration-complexity for cone programming , author=. Mathematical Programming , volume=. 2011 , publisher=

  17. [17]

    arXiv preprint arXiv:2502.10138 , year=

    Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation , author=. arXiv preprint arXiv:2502.10138 , year=

  18. [18]

    A convex upper bound on the log-partition function for binary graphical models , author=. Proc. NIPS , year=

  19. [19]

    International Conference on Machine Learning , pages=

    Optimistic policy optimization with bandit feedback , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  20. [20]

    International Conference on Machine Learning , pages=

    Near-optimal regret bounds for stochastic shortest path , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  21. [21]

    Foundations and Trends

    Introduction to online convex optimization , author=. Foundations and Trends. 2016 , publisher=

  22. [22]

    2020 , publisher=

    Bandit algorithms , author=. 2020 , publisher=

  23. [23]

    International Conference on Machine Learning , pages=

    Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  24. [24]

    arXiv preprint arXiv:2003.02189 , year=

    Exploration-exploitation in constrained mdps , author=. arXiv preprint arXiv:2003.02189 , year=

  25. [25]

    Advances in Neural Information Processing Systems , volume=

    Learning policies with zero or bounded constraint violation for constrained mdps , author=. Advances in Neural Information Processing Systems , volume=

  26. [26]

    International conference on artificial intelligence and statistics , pages=

    Provably efficient safe exploration via primal-dual policy optimization , author=. International conference on artificial intelligence and statistics , pages=. 2021 , organization=

  27. [27]

    Advances in Neural Information Processing Systems , volume=

    A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes , author=. Advances in Neural Information Processing Systems , volume=

  28. [28]

    International Conference on Artificial Intelligence and Statistics , pages=

    Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  29. [29]

    International Conference on Machine Learning , pages=

    Model-based reinforcement learning with value-targeted regression , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  30. [30]

    An Optimistic Algorithm for online

    Jiahui Zhu and Kihyun Yu and Dabeen Lee and Xin Liu and Honghao Wei , booktitle=. An Optimistic Algorithm for online. 2025 , url=

  31. [31]

    2021 , publisher=

    Constrained Markov decision processes , author=. 2021 , publisher=

  32. [32]

    2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

    Safe reinforcement learning on autonomous vehicles , author=. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2018 , organization=

  33. [33]

    International conference on machine learning , pages=

    Constrained policy optimization , author=. International conference on machine learning , pages=. 2017 , organization=

  34. [34]

    Artificial intelligence in medicine , volume=

    Reinforcement learning for intelligent healthcare applications: A survey , author=. Artificial intelligence in medicine , volume=. 2020 , publisher=

  35. [35]

    IEEE Transactions on Automatic Control , volume=

    A general safety framework for learning-based control in uncertain robotic systems , author=. IEEE Transactions on Automatic Control , volume=. 2018 , publisher=

  36. [36]

    arXiv preprint arXiv:2101.10895 , year=

    A primal-dual approach to constrained markov decision processes , author=. arXiv preprint arXiv:2101.10895 , year=

  37. [37]

    1998 , publisher=

    Reinforcement learning: An introduction , author=. 1998 , publisher=

  38. [38]

    International Conference on Artificial Intelligence and Statistics , pages=

    Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  39. [39]

    Advances in neural information processing systems , volume=

    DOPE: Doubly optimistic and pessimistic exploration for safe reinforcement learning , author=. Advances in neural information processing systems , volume=

  40. [40]

    Reinforcement Learning Journal , year=

    Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism , author=. Reinforcement Learning Journal , year=

  41. [41]

    International Conference on Machine Learning , pages=

    Truly No-Regret Learning in Constrained MDPs , author=. International Conference on Machine Learning , pages=. 2024 , organization=

  42. [42]

    International Conference on Artificial Intelligence and Statistics , pages=

    Towards achieving sub-linear regret and hard constraint violation in model-free rl , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

  43. [43]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Safe reinforcement learning with instantaneous constraints: the role of aggressive exploration , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  44. [44]

    International Conference on Machine Learning , pages=

    Safe reinforcement learning with linear function approximation , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  45. [45]

    Conference on Learning Theory , pages=

    Cautiously optimistic policy optimization and exploration with linear function approximation , author=. Conference on Learning Theory , pages=. 2021 , organization=

  46. [46]

    International Conference on Machine Learning , pages=

    Refined regret for adversarial mdps with linear function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  47. [47]

    International Conference on Machine Learning , pages=

    Improved regret for efficient online reinforcement learning with linear function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  48. [48]

    arXiv preprint arXiv:2401.17780 , year=

    A policy gradient primal-dual algorithm for constrained mdps with uniform pac guarantees , author=. arXiv preprint arXiv:2401.17780 , year=

  49. [49]

    FAccTRec Workshop , year=

    Building healthy recommendation sequences for everyone: A safe reinforcement learning approach , author=. FAccTRec Workshop , year=

  50. [50]

    Learning for Dynamics and Control , pages=

    Constrained upper confidence reinforcement learning , author=. Learning for Dynamics and Control , pages=. 2020 , organization=

  51. [51]

    Optimal Strong Regret and Violation in Constrained

    Francesco Emanuele Stradi and Matteo Castiglioni and Alberto Marchesi and Nicola Gatti , booktitle=. Optimal Strong Regret and Violation in Constrained. 2025 , url=

  52. [52]

    Forty-second International Conference on Machine Learning , year=

    Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints , author=. Forty-second International Conference on Machine Learning , year=

  53. [53]

    Forty-first International Conference on Machine Learning , year=

    Online learning in CMDPs: Handling stochastic and adversarial constraints , author=. Forty-first International Conference on Machine Learning , year=

  54. [54]

    Learning Adversarial

    Francesco Emanuele Stradi and Matteo Castiglioni and Alberto Marchesi and Nicola Gatti , booktitle=. Learning Adversarial. 2025 , url=

  55. [55]

    Towards Optimal Regret in Adversarial Linear

    Haolin Liu and Chen-Yu Wei and Julian Zimmert , booktitle=. Towards Optimal Regret in Adversarial Linear. 2024 , url=

  56. [56]

    Improved Regret Bounds for Linear Adversarial

    Fang Kong and XiangCheng Zhang and Baoxiang Wang and Shuai Li , journal=. Improved Regret Bounds for Linear Adversarial. 2024 , url=

  57. [57]

    , author=

    Online learning in MDPs with linear function approximation and bandit feedback. , author=. Advances in Neural Information Processing Systems , volume=

  58. [58]

    Advances in Neural Information Processing Systems , volume=

    Policy optimization in adversarial mdps: Improved exploration via dilated bonuses , author=. Advances in Neural Information Processing Systems , volume=

  59. [59]

    On the convergence of projected policy gradient for any constant step sizes.Journal of Machine Learning Research, 26(193):1–35, 2025a

    Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model , author=. arXiv preprint arXiv:2507.02089 , year=

  60. [60]

    Advances in Neural Information Processing Systems , volume=

    Confident Natural Policy Gradient for Local Planning in q_ -realizable Constrained MDPs , author=. Advances in Neural Information Processing Systems , volume=

  61. [61]

    International Conference on Machine Learning , pages=

    Crpo: A new approach for safe reinforcement learning with convergence guarantee , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  62. [62]

    Advances in Neural Information Processing Systems , volume=

    Last-iterate convergent policy gradient primal-dual methods for constrained mdps , author=. Advances in Neural Information Processing Systems , volume=

  63. [63]

    International Conference on Machine Learning , pages=

    A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  64. [64]

    International Conference on Machine Learning , pages=

    Learning infinite-horizon average-reward Markov decision process with constraints , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  65. [65]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  66. [66]

    International Conference on Artificial Intelligence and Statistics , pages=

    Provably efficient model-free algorithms for non-stationary cmdps , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

  67. [67]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  68. [68]

    Cancellation-free regret bounds for lagrangian approaches in constrained markov decision processes.arXiv preprint arXiv:2306.07001,

    Cancellation-free regret bounds for lagrangian approaches in constrained markov decision processes , author=. arXiv preprint arXiv:2306.07001 , year=

  69. [69]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  70. [70]

    Provably Efficient

    Amirhossein Roknilamouki and Arnob Ghosh and Ming Shi and Fatemeh Nourzad and Eylem Ekici and Ness Shroff , booktitle=. Provably Efficient. 2025 , url=

  71. [71]

    Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained

    Arnob Ghosh and Xingyu Zhou and Ness Shroff , booktitle=. Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained. 2023 , url=