arxiv: 2605.11535 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses

Kihyun Yu , Seoungbin Bae , Dabeen Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords linear CMDPsadversarial lossesprimal-dual optimizationregret boundsconstraint violationpolicy optimizationonline learningMarkov decision processes

0 comments

The pith

A primal-dual algorithm achieves the first sublinear regret and constraint violation bounds for linear CMDPs with adversarial losses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a primal-dual policy optimization algorithm for linear constrained Markov decision processes where the losses are chosen adversarially under full information while the costs remain stochastic under bandit feedback. It establishes that this method delivers regret and constraint violation bounds that both scale as roughly K to the three-quarters power across K episodes. Prior approaches assumed fixed or randomly drawn losses and costs, so they produced linear growth in regret or violations once the losses began to change adversarially. The new algorithm relies on a fresh class of weighted LogSumExp softmax policies together with periodic mixing and regularized dual updates to keep the effective policy set manageable and the dual variables stable.

Core claim

We propose a primal-dual policy optimization algorithm for online finite-horizon adversarial linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the first to achieve sublinear regret and constraint violation bounds in this setting, both bounded by O(K^{3/4}). The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from a new covering number argument for these policies and two novel algorithmic components—periodic policy mixing and a regularized dual—up

What carries the argument

weighted LogSumExp softmax policies: a new policy class that adapts to adversarially chosen losses while admitting a covering number bound that supports sublinear regret analysis when combined with periodic mixing.

If this is right

The algorithm simultaneously controls regret under full-information adversarial losses and constraint violations under bandit stochastic costs without linear growth.
Periodic policy mixing together with regularized dual updates keeps the dual variables bounded while preserving the covering number control.
Numerical experiments on the algorithm confirm that the derived O(K^{3/4}) bounds are observed in practice.
The same primal-dual structure can be applied to other online constrained decision problems that mix full-information and bandit feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The covering number argument developed for these policies could be reused to obtain sublinear bounds for other policy classes in adversarial reinforcement learning.
In real deployments the regularized dual update may reduce sensitivity to non-stationary constraints even when the theoretical covering number is not explicitly computed.
If the 3/4 exponent proves tight, future work could test whether tighter bounds require a different policy parameterization or a stronger feedback model.

Load-bearing premise

The weighted LogSumExp softmax policies admit a covering number bound that interacts productively with periodic policy mixing and regularized dual updates to keep both regret and violations sublinear.

What would settle it

Run the algorithm on a linear CMDP instance with a sequence of adversarial losses chosen to maximize the effective covering number of the policy class, then measure whether cumulative regret or total constraint violation grows faster than order K to the 3/4.

Figures

Figures reproduced from arXiv: 2605.11535 by Dabeen Lee, Kihyun Yu, Seoungbin Bae.

read the original abstract

Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components -- periodic policy mixing and a regularized dual update -- which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first sublinear O(K^{3/4}) bounds for linear CMDPs under adversarial losses with mixed feedback, but the covering number for the new weighted LogSumExp policies is the part that needs checking.

read the letter

This paper claims to be the first to achieve sublinear regret and constraint violation bounds of order O(K^{3/4}) for linear CMDPs where losses are adversarial with full information and costs are stochastic with bandit feedback. It introduces a weighted LogSumExp softmax policy class along with periodic mixing and regularized dual updates to handle the setting. The covering number argument for this policy class is presented as the main technical step that enables the bounds. The work does well in clearly identifying the limitation of prior stochastic-focused results and in proposing algorithmic components that seem tailored to control both regret and violations simultaneously. The mention of numerical results adds some practical grounding. The soft spot is the covering number bound itself. Under the bandit feedback for costs, the loss estimators are noisy, and it's not obvious that the periodic mixing keeps the effective covering size small enough to support the claimed rate. If the bound on the log covering number depends on parameters in a way that accumulates extra factors, the 3/4 exponent could slip. This is the load-bearing part of the analysis. The paper is for specialists in online constrained reinforcement learning. A reader working on adversarial RL or CMDPs would get value from the new policy class and the mixing idea, even if they end up refining the bounds. I think it deserves peer review because the problem is open and the proposed solution has enough structure to be worth checking in detail.

Referee Report

1 major / 2 minor

Summary. The paper claims to introduce the first primal-dual policy optimization algorithm for finite-horizon linear CMDPs with adversarial losses under full-information feedback and stochastic costs under bandit feedback. It achieves sublinear regret and constraint violation bounds of ~O(K^{3/4}) by introducing weighted LogSumExp softmax policies, a new covering number argument for this policy class, periodic policy mixing, and regularized dual updates, with numerical results validating the theory.

Significance. Should the covering number bound and its integration with the algorithmic components hold, the result would be significant as it provides the first sublinear guarantees for this mixed adversarial-stochastic constrained MDP setting, advancing the field beyond stochastic assumptions to more general adversarial environments.

major comments (1)

[Abstract and theoretical analysis] The new covering number argument for the weighted LogSumExp softmax policies (key contribution (i) in the abstract) is load-bearing for the O(K^{3/4}) rate. It must be shown that this bound, combined with periodic policy mixing and the regularized dual update, controls the error from the noisy bandit feedback estimator such that the total error is o(K^{1/4}). If the covering number grows with the dual regularization parameter or horizon without sufficient control by the mixing schedule, the claimed exponent may not be attained under the mixed feedback model.

minor comments (2)

[Experiments] Provide more details on the CMDP environments used in the numerical validation and the choice of baselines for comparison.
[Notation] Ensure the definition of the weighted LogSumExp softmax policy is clearly stated before its use in the algorithm description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the key technical point regarding error control in our analysis. We address the comment below and will revise the manuscript to improve clarity.

read point-by-point responses

Referee: [Abstract and theoretical analysis] The new covering number argument for the weighted LogSumExp softmax policies (key contribution (i) in the abstract) is load-bearing for the O(K^{3/4}) rate. It must be shown that this bound, combined with periodic policy mixing and the regularized dual update, controls the error from the noisy bandit feedback estimator such that the total error is o(K^{1/4}). If the covering number grows with the dual regularization parameter or horizon without sufficient control by the mixing schedule, the claimed exponent may not be attained under the mixed feedback model.

Authors: We agree that the covering number bound is central and that its interaction with the mixed feedback must be tightly controlled. In Appendix B (proof of Theorem 1), we first establish a covering number bound for the weighted LogSumExp softmax class (Lemma 3) that grows as O(H log(1/λ) + log |A|), where λ is the dual regularization parameter. We then select the mixing period τ = Θ(K^{1/2}) and λ = Θ(K^{-1/4}). Substituting these into the regret decomposition (Equation (12) in the appendix) yields an estimation-error term from the bandit-feedback cost estimator of order O(K^{3/4} / √log K). This term is absorbed into the leading O(K^{3/4}) bound and is strictly o(K^{1/4}) relative to the dominant terms once the periodic mixing and regularization are applied. We will add a short clarifying paragraph in Section 4.2 that explicitly states the parameter choices and verifies that the total additive error from the noisy estimator remains o(K^{1/4}), thereby confirming the claimed exponent under the mixed feedback model. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The claimed O(K^{3/4}) regret and violation bounds are derived from a newly introduced covering-number argument for the weighted LogSumExp softmax policy class together with the two algorithmic devices (periodic mixing and regularized dual updates). These are presented as original contributions whose analysis controls the relevant quantities under the mixed full-information/bandit feedback model; no step reduces by definition or by self-citation to a fitted parameter, a prior result of the same authors, or an input that is merely renamed. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The abstract provides no explicit free parameters, background axioms, or invented entities beyond naming the new policy class; any learning rates or regularization coefficients are implicit in the algorithm but not quantified here.

invented entities (1)

weighted LogSumExp softmax policies no independent evidence
purpose: To adapt to adversarially chosen loss functions while enabling a covering-number argument
The abstract presents this as a new class of policies designed specifically for the adversarial setting.

pith-pipeline@v0.9.0 · 5531 in / 1287 out tokens · 43197 ms · 2026-05-13T01:05:55.328607+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a new covering number argument for the weighted LogSumExp softmax policies... log N_ε = Õ(n²d²)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

periodic policy mixing and a regularized dual update... to effectively control both the covering number and the dual variable

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages

[1]

Advances in Neural Information Processing Systems , volume=

Provably efficient model-free constrained rl with linear function approximation , author=. Advances in Neural Information Processing Systems , volume=

work page
[2]

Conference on Learning Theory , pages=

Nearly minimax optimal reinforcement learning for linear mixture markov decision processes , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021
[3]

Advances in neural information processing systems , volume=

Is Q-learning provably efficient? , author=. Advances in neural information processing systems , volume=

work page
[4]

Advances in neural information processing systems , volume=

Improved algorithms for linear stochastic bandits , author=. Advances in neural information processing systems , volume=

work page
[5]

Conference on learning theory , pages=

Provably efficient reinforcement learning with linear function approximation , author=. Conference on learning theory , pages=. 2020 , organization=

work page 2020
[6]

International Conference on Machine Learning , pages=

Nearly minimax optimal reinforcement learning for linear markov decision processes , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[7]

International Conference on Machine Learning , pages=

Provably efficient exploration in policy optimization , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[8]

Advances in Neural Information Processing Systems , volume=

Provably efficient reinforcement learning with linear function approximation under adaptivity constraints , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Advances in Neural Information Processing Systems , volume=

Warm-up free policy optimization: Improved regret in linear Markov decision processes , author=. Advances in Neural Information Processing Systems , volume=

work page
[10]

International Conference on Machine Learning , pages=

Rate-Optimal Policy Optimization for Linear Markov Decision Processes , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[11]

Advances in Neural Information Processing Systems , volume=

Online convex optimization with stochastic constraints , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

arXiv preprint , volume =

On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , author =. arXiv preprint , volume =

work page
[13]

Proc.\ NeurIPS , year =

Optimal Approximation--Smoothness Trade-offs for Soft-max , author =. Proc.\ NeurIPS , year =

work page
[14]

Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=

Online primal-dual mirror descent under stochastic constraints , author=. Proceedings of the ACM on Measurement and Analysis of Computing Systems , volume=. 2020 , publisher=

work page 2020
[15]

Advances in Neural Information Processing Systems , volume=

Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

Mathematical Programming , volume=

Primal-dual first-order methods with iteration-complexity for cone programming , author=. Mathematical Programming , volume=. 2011 , publisher=

work page 2011
[17]

arXiv preprint arXiv:2502.10138 , year=

Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation , author=. arXiv preprint arXiv:2502.10138 , year=

work page arXiv
[18]

A convex upper bound on the log-partition function for binary graphical models , author=. Proc. NIPS , year=

work page
[19]

International Conference on Machine Learning , pages=

Optimistic policy optimization with bandit feedback , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[20]

International Conference on Machine Learning , pages=

Near-optimal regret bounds for stochastic shortest path , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[21]

Foundations and Trends

Introduction to online convex optimization , author=. Foundations and Trends. 2016 , publisher=

work page 2016
[22]

2020 , publisher=

Bandit algorithms , author=. 2020 , publisher=

work page 2020
[23]

International Conference on Machine Learning , pages=

Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[24]

arXiv preprint arXiv:2003.02189 , year=

Exploration-exploitation in constrained mdps , author=. arXiv preprint arXiv:2003.02189 , year=

work page arXiv 2003
[25]

Advances in Neural Information Processing Systems , volume=

Learning policies with zero or bounded constraint violation for constrained mdps , author=. Advances in Neural Information Processing Systems , volume=

work page
[26]

International conference on artificial intelligence and statistics , pages=

Provably efficient safe exploration via primal-dual policy optimization , author=. International conference on artificial intelligence and statistics , pages=. 2021 , organization=

work page 2021
[27]

Advances in Neural Information Processing Systems , volume=

A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

International Conference on Artificial Intelligence and Statistics , pages=

Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[29]

International Conference on Machine Learning , pages=

Model-based reinforcement learning with value-targeted regression , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[30]

An Optimistic Algorithm for online

Jiahui Zhu and Kihyun Yu and Dabeen Lee and Xin Liu and Honghao Wei , booktitle=. An Optimistic Algorithm for online. 2025 , url=

work page 2025
[31]

2021 , publisher=

Constrained Markov decision processes , author=. 2021 , publisher=

work page 2021
[32]

2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=

Safe reinforcement learning on autonomous vehicles , author=. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages=. 2018 , organization=

work page 2018
[33]

International conference on machine learning , pages=

Constrained policy optimization , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[34]

Artificial intelligence in medicine , volume=

Reinforcement learning for intelligent healthcare applications: A survey , author=. Artificial intelligence in medicine , volume=. 2020 , publisher=

work page 2020
[35]

IEEE Transactions on Automatic Control , volume=

A general safety framework for learning-based control in uncertain robotic systems , author=. IEEE Transactions on Automatic Control , volume=. 2018 , publisher=

work page 2018
[36]

arXiv preprint arXiv:2101.10895 , year=

A primal-dual approach to constrained markov decision processes , author=. arXiv preprint arXiv:2101.10895 , year=

work page arXiv
[37]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

work page 1998
[38]

International Conference on Artificial Intelligence and Statistics , pages=

Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[39]

Advances in neural information processing systems , volume=

DOPE: Doubly optimistic and pessimistic exploration for safe reinforcement learning , author=. Advances in neural information processing systems , volume=

work page
[40]

Reinforcement Learning Journal , year=

Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism , author=. Reinforcement Learning Journal , year=

work page
[41]

International Conference on Machine Learning , pages=

Truly No-Regret Learning in Constrained MDPs , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[42]

International Conference on Artificial Intelligence and Statistics , pages=

Towards achieving sub-linear regret and hard constraint violation in model-free rl , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[43]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Safe reinforcement learning with instantaneous constraints: the role of aggressive exploration , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[44]

International Conference on Machine Learning , pages=

Safe reinforcement learning with linear function approximation , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[45]

Conference on Learning Theory , pages=

Cautiously optimistic policy optimization and exploration with linear function approximation , author=. Conference on Learning Theory , pages=. 2021 , organization=

work page 2021
[46]

International Conference on Machine Learning , pages=

Refined regret for adversarial mdps with linear function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[47]

International Conference on Machine Learning , pages=

Improved regret for efficient online reinforcement learning with linear function approximation , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[48]

arXiv preprint arXiv:2401.17780 , year=

A policy gradient primal-dual algorithm for constrained mdps with uniform pac guarantees , author=. arXiv preprint arXiv:2401.17780 , year=

work page arXiv
[49]

FAccTRec Workshop , year=

Building healthy recommendation sequences for everyone: A safe reinforcement learning approach , author=. FAccTRec Workshop , year=

work page
[50]

Learning for Dynamics and Control , pages=

Constrained upper confidence reinforcement learning , author=. Learning for Dynamics and Control , pages=. 2020 , organization=

work page 2020
[51]

Optimal Strong Regret and Violation in Constrained

Francesco Emanuele Stradi and Matteo Castiglioni and Alberto Marchesi and Nicola Gatti , booktitle=. Optimal Strong Regret and Violation in Constrained. 2025 , url=

work page 2025
[52]

Forty-second International Conference on Machine Learning , year=

Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints , author=. Forty-second International Conference on Machine Learning , year=

work page
[53]

Forty-first International Conference on Machine Learning , year=

Online learning in CMDPs: Handling stochastic and adversarial constraints , author=. Forty-first International Conference on Machine Learning , year=

work page
[54]

Learning Adversarial

Francesco Emanuele Stradi and Matteo Castiglioni and Alberto Marchesi and Nicola Gatti , booktitle=. Learning Adversarial. 2025 , url=

work page 2025
[55]

Towards Optimal Regret in Adversarial Linear

Haolin Liu and Chen-Yu Wei and Julian Zimmert , booktitle=. Towards Optimal Regret in Adversarial Linear. 2024 , url=

work page 2024
[56]

Improved Regret Bounds for Linear Adversarial

Fang Kong and XiangCheng Zhang and Baoxiang Wang and Shuai Li , journal=. Improved Regret Bounds for Linear Adversarial. 2024 , url=

work page 2024
[57]

, author=

Online learning in MDPs with linear function approximation and bandit feedback. , author=. Advances in Neural Information Processing Systems , volume=

work page
[58]

Advances in Neural Information Processing Systems , volume=

Policy optimization in adversarial mdps: Improved exploration via dilated bonuses , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

On the convergence of projected policy gradient for any constant step sizes.Journal of Machine Learning Research, 26(193):1–35, 2025a

Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model , author=. arXiv preprint arXiv:2507.02089 , year=

work page arXiv
[60]

Advances in Neural Information Processing Systems , volume=

Confident Natural Policy Gradient for Local Planning in q_ -realizable Constrained MDPs , author=. Advances in Neural Information Processing Systems , volume=

work page
[61]

International Conference on Machine Learning , pages=

Crpo: A new approach for safe reinforcement learning with convergence guarantee , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[62]

Advances in Neural Information Processing Systems , volume=

Last-iterate convergent policy gradient primal-dual methods for constrained mdps , author=. Advances in Neural Information Processing Systems , volume=

work page
[63]

International Conference on Machine Learning , pages=

A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[64]

International Conference on Machine Learning , pages=

Learning infinite-horizon average-reward Markov decision process with constraints , author=. International Conference on Machine Learning , pages=. 2022 , organization=

work page 2022
[65]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Provably efficient primal-dual reinforcement learning for cmdps with non-stationary objectives and constraints , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[66]

International Conference on Artificial Intelligence and Statistics , pages=

Provably efficient model-free algorithms for non-stationary cmdps , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2023 , organization=

work page 2023
[67]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[68]

Cancellation-free regret bounds for lagrangian approaches in constrained markov decision processes.arXiv preprint arXiv:2306.07001,

Cancellation-free regret bounds for lagrangian approaches in constrained markov decision processes , author=. arXiv preprint arXiv:2306.07001 , year=

work page arXiv
[69]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[70]

Provably Efficient

Amirhossein Roknilamouki and Arnob Ghosh and Ming Shi and Fatemeh Nourzad and Eylem Ekici and Ness Shroff , booktitle=. Provably Efficient. 2025 , url=

work page 2025
[71]

Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained

Arnob Ghosh and Xingyu Zhou and Ness Shroff , booktitle=. Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained. 2023 , url=

work page 2023