Recognition: 2 theorem links
· Lean TheoremPrimal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses
Pith reviewed 2026-05-13 01:05 UTC · model grok-4.3
The pith
A primal-dual algorithm achieves the first sublinear regret and constraint violation bounds for linear CMDPs with adversarial losses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a primal-dual policy optimization algorithm for online finite-horizon adversarial linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the first to achieve sublinear regret and constraint violation bounds in this setting, both bounded by O(K^{3/4}). The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from a new covering number argument for these policies and two novel algorithmic components—periodic policy mixing and a regularized dual—up
What carries the argument
weighted LogSumExp softmax policies: a new policy class that adapts to adversarially chosen losses while admitting a covering number bound that supports sublinear regret analysis when combined with periodic mixing.
If this is right
- The algorithm simultaneously controls regret under full-information adversarial losses and constraint violations under bandit stochastic costs without linear growth.
- Periodic policy mixing together with regularized dual updates keeps the dual variables bounded while preserving the covering number control.
- Numerical experiments on the algorithm confirm that the derived O(K^{3/4}) bounds are observed in practice.
- The same primal-dual structure can be applied to other online constrained decision problems that mix full-information and bandit feedback.
Where Pith is reading between the lines
- The covering number argument developed for these policies could be reused to obtain sublinear bounds for other policy classes in adversarial reinforcement learning.
- In real deployments the regularized dual update may reduce sensitivity to non-stationary constraints even when the theoretical covering number is not explicitly computed.
- If the 3/4 exponent proves tight, future work could test whether tighter bounds require a different policy parameterization or a stronger feedback model.
Load-bearing premise
The weighted LogSumExp softmax policies admit a covering number bound that interacts productively with periodic policy mixing and regularized dual updates to keep both regret and violations sublinear.
What would settle it
Run the algorithm on a linear CMDP instance with a sequence of adversarial losses chosen to maximize the effective covering number of the policy class, then measure whether cumulative regret or total constraint violation grows faster than order K to the 3/4.
Figures
read the original abstract
Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components -- periodic policy mixing and a regularized dual update -- which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce the first primal-dual policy optimization algorithm for finite-horizon linear CMDPs with adversarial losses under full-information feedback and stochastic costs under bandit feedback. It achieves sublinear regret and constraint violation bounds of ~O(K^{3/4}) by introducing weighted LogSumExp softmax policies, a new covering number argument for this policy class, periodic policy mixing, and regularized dual updates, with numerical results validating the theory.
Significance. Should the covering number bound and its integration with the algorithmic components hold, the result would be significant as it provides the first sublinear guarantees for this mixed adversarial-stochastic constrained MDP setting, advancing the field beyond stochastic assumptions to more general adversarial environments.
major comments (1)
- [Abstract and theoretical analysis] The new covering number argument for the weighted LogSumExp softmax policies (key contribution (i) in the abstract) is load-bearing for the O(K^{3/4}) rate. It must be shown that this bound, combined with periodic policy mixing and the regularized dual update, controls the error from the noisy bandit feedback estimator such that the total error is o(K^{1/4}). If the covering number grows with the dual regularization parameter or horizon without sufficient control by the mixing schedule, the claimed exponent may not be attained under the mixed feedback model.
minor comments (2)
- [Experiments] Provide more details on the CMDP environments used in the numerical validation and the choice of baselines for comparison.
- [Notation] Ensure the definition of the weighted LogSumExp softmax policy is clearly stated before its use in the algorithm description.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the key technical point regarding error control in our analysis. We address the comment below and will revise the manuscript to improve clarity.
read point-by-point responses
-
Referee: [Abstract and theoretical analysis] The new covering number argument for the weighted LogSumExp softmax policies (key contribution (i) in the abstract) is load-bearing for the O(K^{3/4}) rate. It must be shown that this bound, combined with periodic policy mixing and the regularized dual update, controls the error from the noisy bandit feedback estimator such that the total error is o(K^{1/4}). If the covering number grows with the dual regularization parameter or horizon without sufficient control by the mixing schedule, the claimed exponent may not be attained under the mixed feedback model.
Authors: We agree that the covering number bound is central and that its interaction with the mixed feedback must be tightly controlled. In Appendix B (proof of Theorem 1), we first establish a covering number bound for the weighted LogSumExp softmax class (Lemma 3) that grows as O(H log(1/λ) + log |A|), where λ is the dual regularization parameter. We then select the mixing period τ = Θ(K^{1/2}) and λ = Θ(K^{-1/4}). Substituting these into the regret decomposition (Equation (12) in the appendix) yields an estimation-error term from the bandit-feedback cost estimator of order O(K^{3/4} / √log K). This term is absorbed into the leading O(K^{3/4}) bound and is strictly o(K^{1/4}) relative to the dominant terms once the periodic mixing and regularization are applied. We will add a short clarifying paragraph in Section 4.2 that explicitly states the parameter choices and verifies that the total additive error from the noisy estimator remains o(K^{1/4}), thereby confirming the claimed exponent under the mixed feedback model. revision: yes
Circularity Check
No significant circularity detected
full rationale
The claimed O(K^{3/4}) regret and violation bounds are derived from a newly introduced covering-number argument for the weighted LogSumExp softmax policy class together with the two algorithmic devices (periodic mixing and regularized dual updates). These are presented as original contributions whose analysis controls the relevant quantities under the mixed full-information/bandit feedback model; no step reduces by definition or by self-citation to a fitted parameter, a prior result of the same authors, or an input that is merely renamed. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
invented entities (1)
-
weighted LogSumExp softmax policies
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a new covering number argument for the weighted LogSumExp softmax policies... log N_ε = Õ(n²d²)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
periodic policy mixing and a regularized dual update... to effectively control both the covering number and the dual variable
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , volume=
Provably efficient model-free constrained rl with linear function approximation , author=. Advances in Neural Information Processing Systems , volume=
-
[2]
Conference on Learning Theory , pages=
Nearly minimax optimal reinforcement learning for linear mixture markov decision processes , author=. Conference on Learning Theory , pages=. 2021 , organization=
work page 2021
-
[3]
Advances in neural information processing systems , volume=
Is Q-learning provably efficient? , author=. Advances in neural information processing systems , volume=
-
[4]
Advances in neural information processing systems , volume=
Improved algorithms for linear stochastic bandits , author=. Advances in neural information processing systems , volume=
-
[5]
Conference on learning theory , pages=
Provably efficient reinforcement learning with linear function approximation , author=. Conference on learning theory , pages=. 2020 , organization=
work page 2020
-
[6]
International Conference on Machine Learning , pages=
Nearly minimax optimal reinforcement learning for linear markov decision processes , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[7]
International Conference on Machine Learning , pages=
Provably efficient exploration in policy optimization , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[8]
Advances in Neural Information Processing Systems , volume=
Provably efficient reinforcement learning with linear function approximation under adaptivity constraints , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
Advances in Neural Information Processing Systems , volume=
Warm-up free policy optimization: Improved regret in linear Markov decision processes , author=. Advances in Neural Information Processing Systems , volume=
-
[10]
International Conference on Machine Learning , pages=
Rate-Optimal Policy Optimization for Linear Markov Decision Processes , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[11]
Advances in Neural Information Processing Systems , volume=
Online convex optimization with stochastic constraints , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , author =. arXiv preprint , volume =
-
[13]
Optimal Approximation--Smoothness Trade-offs for Soft-max , author =. Proc.\ NeurIPS , year =
-
[14]
Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=
Online primal-dual mirror descent under stochastic constraints , author=. Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=. 2020 , publisher=
work page 2020
-
[15]
Advances in Neural Information Processing Systems , volume=
Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss , author=. Advances in Neural Information Processing Systems , volume=
-
[16]
Mathematical Programming , volume=
Primal-dual first-order methods with iteration-complexity for cone programming , author=. Mathematical Programming , volume=. 2011 , publisher=
work page 2011
-
[17]
arXiv preprint arXiv:2502.10138 , year=
Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation , author=. arXiv preprint arXiv:2502.10138 , year=
-
[18]
A convex upper bound on the log-partition function for binary graphical models , author=. Proc. NIPS , year=
-
[19]
International Conference on Machine Learning , pages=
Optimistic policy optimization with bandit feedback , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[20]
International Conference on Machine Learning , pages=
Near-optimal regret bounds for stochastic shortest path , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[21]
Introduction to online convex optimization , author=. Foundations and Trends. 2016 , publisher=
work page 2016
- [22]
-
[23]
International Conference on Machine Learning , pages=
Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[24]
arXiv preprint arXiv:2003.02189 , year=
Exploration-exploitation in constrained mdps , author=. arXiv preprint arXiv:2003.02189 , year=
-
[25]
Advances in Neural Information Processing Systems , volume=
Learning policies with zero or bounded constraint violation for constrained mdps , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
International conference on artificial intelligence and statistics , pages=
Provably efficient safe exploration via primal-dual policy optimization , author=. International conference on artificial intelligence and statistics , pages=. 2021 , organization=
work page 2021
-
[27]
Advances in Neural Information Processing Systems , volume=
A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes , author=. Advances in Neural Information Processing Systems , volume=
-
[28]
International Conference on Artificial Intelligence and Statistics , pages=
Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[29]
International Conference on Machine Learning , pages=
Model-based reinforcement learning with value-targeted regression , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[30]
An Optimistic Algorithm for online
Jiahui Zhu and Kihyun Yu and Dabeen Lee and Xin Liu and Honghao Wei , booktitle=. An Optimistic Algorithm for online. 2025 , url=
work page 2025
- [31]
-
[32]
2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=
Safe reinforcement learning on autonomous vehicles , author=. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2018 , organization=
work page 2018
-
[33]
International conference on machine learning , pages=
Constrained policy optimization , author=. International conference on machine learning , pages=. 2017 , organization=
work page 2017
-
[34]
Artificial intelligence in medicine , volume=
Reinforcement learning for intelligent healthcare applications: A survey , author=. Artificial intelligence in medicine , volume=. 2020 , publisher=
work page 2020
-
[35]
IEEE Transactions on Automatic Control , volume=
A general safety framework for learning-based control in uncertain robotic systems , author=. IEEE Transactions on Automatic Control , volume=. 2018 , publisher=
work page 2018
-
[36]
arXiv preprint arXiv:2101.10895 , year=
A primal-dual approach to constrained markov decision processes , author=. arXiv preprint arXiv:2101.10895 , year=
-
[37]
Reinforcement learning: An introduction , author=. 1998 , publisher=
work page 1998
-
[38]
International Conference on Artificial Intelligence and Statistics , pages=
Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[39]
Advances in neural information processing systems , volume=
DOPE: Doubly optimistic and pessimistic exploration for safe reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[40]
Reinforcement Learning Journal , year=
Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism , author=. Reinforcement Learning Journal , year=
-
[41]
International Conference on Machine Learning , pages=
Truly No-Regret Learning in Constrained MDPs , author=. International Conference on Machine Learning , pages=. 2024 , organization=
work page 2024
-
[42]
International Conference on Artificial Intelligence and Statistics , pages=
Towards achieving sub-linear regret and hard constraint violation in model-free rl , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=
work page 2024
-
[43]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Safe reinforcement learning with instantaneous constraints: the role of aggressive exploration , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[44]
International Conference on Machine Learning , pages=
Safe reinforcement learning with linear function approximation , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[45]
Conference on Learning Theory , pages=
Cautiously optimistic policy optimization and exploration with linear function approximation , author=. Conference on Learning Theory , pages=. 2021 , organization=
work page 2021
-
[46]
International Conference on Machine Learning , pages=
Refined regret for adversarial mdps with linear function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[47]
International Conference on Machine Learning , pages=
Improved regret for efficient online reinforcement learning with linear function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[48]
arXiv preprint arXiv:2401.17780 , year=
A policy gradient primal-dual algorithm for constrained mdps with uniform pac guarantees , author=. arXiv preprint arXiv:2401.17780 , year=
-
[49]
Building healthy recommendation sequences for everyone: A safe reinforcement learning approach , author=. FAccTRec Workshop , year=
-
[50]
Learning for Dynamics and Control , pages=
Constrained upper confidence reinforcement learning , author=. Learning for Dynamics and Control , pages=. 2020 , organization=
work page 2020
-
[51]
Optimal Strong Regret and Violation in Constrained
Francesco Emanuele Stradi and Matteo Castiglioni and Alberto Marchesi and Nicola Gatti , booktitle=. Optimal Strong Regret and Violation in Constrained. 2025 , url=
work page 2025
-
[52]
Forty-second International Conference on Machine Learning , year=
Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints , author=. Forty-second International Conference on Machine Learning , year=
-
[53]
Forty-first International Conference on Machine Learning , year=
Online learning in CMDPs: Handling stochastic and adversarial constraints , author=. Forty-first International Conference on Machine Learning , year=
-
[54]
Francesco Emanuele Stradi and Matteo Castiglioni and Alberto Marchesi and Nicola Gatti , booktitle=. Learning Adversarial. 2025 , url=
work page 2025
-
[55]
Towards Optimal Regret in Adversarial Linear
Haolin Liu and Chen-Yu Wei and Julian Zimmert , booktitle=. Towards Optimal Regret in Adversarial Linear. 2024 , url=
work page 2024
-
[56]
Improved Regret Bounds for Linear Adversarial
Fang Kong and XiangCheng Zhang and Baoxiang Wang and Shuai Li , journal=. Improved Regret Bounds for Linear Adversarial. 2024 , url=
work page 2024
- [57]
-
[58]
Advances in Neural Information Processing Systems , volume=
Policy optimization in adversarial mdps: Improved exploration via dilated bonuses , author=. Advances in Neural Information Processing Systems , volume=
-
[59]
Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model , author=. arXiv preprint arXiv:2507.02089 , year=
-
[60]
Advances in Neural Information Processing Systems , volume=
Confident Natural Policy Gradient for Local Planning in q_ -realizable Constrained MDPs , author=. Advances in Neural Information Processing Systems , volume=
-
[61]
International Conference on Machine Learning , pages=
Crpo: A new approach for safe reinforcement learning with convergence guarantee , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[62]
Advances in Neural Information Processing Systems , volume=
Last-iterate convergent policy gradient primal-dual methods for constrained mdps , author=. Advances in Neural Information Processing Systems , volume=
-
[63]
International Conference on Machine Learning , pages=
A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[64]
International Conference on Machine Learning , pages=
Learning infinite-horizon average-reward Markov decision process with constraints , author=. International Conference on Machine Learning , pages=. 2022 , organization=
work page 2022
-
[65]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[66]
International Conference on Artificial Intelligence and Statistics , pages=
Provably efficient model-free algorithms for non-stationary cmdps , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=
work page 2023
-
[67]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[68]
Cancellation-free regret bounds for lagrangian approaches in constrained markov decision processes , author=. arXiv preprint arXiv:2306.07001 , year=
-
[69]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[70]
Amirhossein Roknilamouki and Arnob Ghosh and Ming Shi and Fatemeh Nourzad and Eylem Ekici and Ness Shroff , booktitle=. Provably Efficient. 2025 , url=
work page 2025
-
[71]
Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained
Arnob Ghosh and Xingyu Zhou and Ness Shroff , booktitle=. Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained. 2023 , url=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.