Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form
Pith reviewed 2026-05-23 21:56 UTC · model grok-4.3
The pith
The epigraph form of robust constrained MDPs lets a bisection search with policy gradient identify an ε-optimal policy using Õ(ε^{-4}) robust policy evaluations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that the epigraph form of the RCMDP converts the original constrained min-max problem into a form where a bisection search algorithm equipped with a policy gradient subroutine identifies an ε-optimal policy with Õ(ε^{-4}) robust policy evaluations, while the conventional Lagrangian max-min formulation can fail because its inner minimization encounters a sum of conflicting gradients.
What carries the argument
The epigraph form of the RCMDP problem, which introduces an auxiliary variable so that a single gradient is chosen from either the objective or one of the constraints at each step.
If this is right
- The algorithm succeeds on instances where the Lagrangian method is trapped by summed conflicting gradients.
- Near-optimal robust policies are obtained with a polynomial number of robust policy evaluations that scales as ε to the minus four.
- The method guarantees both near-optimality of the objective and satisfaction of the constraints in every environment inside the uncertainty set.
- Policy gradient updates can be performed safely without needing to solve a joint min-max at every iteration.
Where Pith is reading between the lines
- The same epigraph reformulation might be tried on other constrained robust problems that currently rely on Lagrangian relaxations.
- Replacing the inner policy gradient subroutine with a more sample-efficient estimator could reduce the overall evaluation count further.
- The approach supplies a template for turning other min-max constrained problems into searchable forms that avoid gradient cancellation.
Load-bearing premise
Reformulating the constrained problem in epigraph form still lets the algorithm recover an optimal policy for the original RCMDP.
What would settle it
An explicit RCMDP instance in which the bisection search returns a policy whose worst-case cost exceeds the claimed ε-optimal value after the stated number of robust policy evaluations.
Figures
read the original abstract
Designing a safe policy for uncertain environments is crucial in real-world control systems. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions. This occurs when its inner minimization encounters a sum of conflicting gradients from the objective and constraint functions. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a bisection search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\tilde{\mathcal{O}}(\varepsilon^{-4})$ robust policy evaluations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce the first algorithm for identifying an ε-optimal policy in robust constrained MDPs (RCMDPs), where the policy minimizes cost subject to worst-case constraints over an uncertainty set. It proves that standard policy gradient on the Lagrangian formulation can trap in suboptimal points due to conflicting gradients from objective and constraints. To resolve this, the authors reformulate the RCMDP in epigraph form, which allows selecting a single gradient (either objective or constraint) at each step. They then propose a bisection search algorithm over the epigraph variable combined with a policy gradient subroutine, and prove that this identifies an ε-optimal policy using Õ(ε^{-4}) robust policy evaluations.
Significance. If the central equivalence between the epigraph reformulation and the original RCMDP optimum holds under robust evaluation, the result would be significant: it supplies the first polynomial-sample guarantee for near-optimal safe policy identification in RCMDPs, a setting that directly models real-world control under model uncertainty. The explicit proof of the Lagrangian trapping phenomenon and the reduction to a single-gradient subroutine are concrete technical contributions that could influence subsequent work on constrained robust RL.
major comments (2)
- [Abstract / epigraph reformulation section] Abstract (paragraph on epigraph form) and the claimed equivalence proof: the assertion that selecting a single gradient from either the objective or the constraints inside the bisection search recovers an ε-optimal policy for the original min-cost + worst-case constraint problem is load-bearing for both the algorithm and the Õ(ε^{-4}) bound. The inner minimization is over an uncertainty set; it is not immediate that a per-step single-gradient choice preserves joint feasibility across all environments in the set, even if the epigraph value matches at optimality. An explicit statement of the equivalence (e.g., showing that any ε-optimal epigraph solution maps back to an ε-optimal robust policy) is required.
- [Algorithm and complexity analysis] The sample-complexity claim of Õ(ε^{-4}) robust policy evaluations rests on the bisection + policy-gradient subroutine being correct for the epigraph problem. Without a concrete reduction showing that the robust policy evaluation oracle used inside the subroutine matches the original RCMDP definition (including how the worst-case is taken inside the gradient step), the bound cannot be verified as applying to the original problem.
minor comments (1)
- [Abstract] Notation for the uncertainty set and the robust policy evaluation operator should be introduced once and used consistently; the abstract refers to both “robust policy evaluations” and “worst-case scenario across a set of environments” without a single symbol.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive feedback. We address each major comment below and will incorporate clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract / epigraph reformulation section] Abstract (paragraph on epigraph form) and the claimed equivalence proof: the assertion that selecting a single gradient from either the objective or the constraints inside the bisection search recovers an ε-optimal policy for the original min-cost + worst-case constraint problem is load-bearing for both the algorithm and the Õ(ε^{-4}) bound. The inner minimization is over an uncertainty set; it is not immediate that a per-step single-gradient choice preserves joint feasibility across all environments in the set, even if the epigraph value matches at optimality. An explicit statement of the equivalence (e.g., showing that any ε-optimal epigraph solution maps back to an ε-optimal robust policy) is required.
Authors: We agree that an explicit mapping statement strengthens the presentation. Theorem 3.1 proves equivalence at optimality between the epigraph form and the original RCMDP. The proof of the main result (Theorem 4.3) further shows that any ε-optimal epigraph solution maps to an ε-optimal robust policy: the epigraph variable enforces a uniform bound on the worst-case cost, and the single-gradient selection (objective or constraint) preserves joint feasibility over the uncertainty set because the worst-case operator is applied identically in both formulations. We will add a dedicated corollary after Theorem 3.1 stating this approximate equivalence explicitly and revise the abstract paragraph for clarity. revision: yes
-
Referee: [Algorithm and complexity analysis] The sample-complexity claim of Õ(ε^{-4}) robust policy evaluations rests on the bisection + policy-gradient subroutine being correct for the epigraph problem. Without a concrete reduction showing that the robust policy evaluation oracle used inside the subroutine matches the original RCMDP definition (including how the worst-case is taken inside the gradient step), the bound cannot be verified as applying to the original problem.
Authors: The robust policy evaluation oracle is defined identically in the epigraph problem (Definition 3.2) and the original RCMDP: each call computes the worst-case value over the uncertainty set for the active function (cost or constraint). This is used directly in the policy-gradient subroutine (Algorithm 2) and the bisection analysis (Section 4), so the Õ(ε^{-4}) bound applies to the original problem. To make the reduction fully transparent, we will insert a remark in Section 4.2 explicitly tracing how the worst-case operator enters each gradient step and confirming that every oracle call matches the RCMDP definition. revision: yes
Circularity Check
No significant circularity; derivation self-contained from epigraph reformulation and standard primitives.
full rationale
The paper proposes a bisection search algorithm using a policy gradient subroutine on the epigraph form of the RCMDP, claiming an Õ(ε^{-4}) bound on robust policy evaluations after proving Lagrangian issues with conflicting gradients. No quoted step reduces the final bound or optimality guarantee to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain; the epigraph choice is presented as a reformulation enabling single-gradient selection, with the complexity bound derived from the algorithm's analysis rather than tautological equivalence to inputs. This is the common honest outcome for theoretical algorithm papers whose central claims rest on explicit proofs rather than data fits or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Policy gradient methods converge to a stationary point of the inner minimization when a single gradient is supplied at each step.
- domain assumption The uncertainty set in the RCMDP admits a well-defined worst-case evaluation that can be approximated by robust policy evaluations.
Forward citations
Cited by 1 Pith paper
-
Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees
RHC-UCRL is the first algorithm for safety-constrained RL under explicit adversarial dynamics, providing sub-linear regret and constraint violation guarantees by maintaining optimism over both agent and adversary policies.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
On Frank-Wolfe and Equilibrium Computation
Jacob D Abernethy and Jun-Kun Wang. On Frank-Wolfe and Equilibrium Computation . In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[3]
Constrained Policy Optimization
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained Policy Optimization . In International Conference on Machine Learning, 2017
work page 2017
-
[4]
On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift
Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021
work page 2021
-
[5]
Constrained Markov Decision Processes , volume 7
Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999
work page 1999
-
[6]
James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System Level Synthesis . Annual Reviews in Control, 47: 0 364--393, 2019
work page 2019
-
[7]
On the Generation of Markov Decision Processes
TW Archibald, KIM McKinnon, and LC Thomas. On the Generation of Markov Decision Processes . Journal of the Operational Research Society, 46 0 (3): 0 354--361, 1995
work page 1995
-
[8]
First-Order Methods in Optimization
Amir Beck. First-Order Methods in Optimization . SIAM, 2017
work page 2017
-
[9]
R. Bellman, R.E. Bellman, and Rand Corporation. Dynamic Programming. Princeton University Press, 1957
work page 1957
-
[10]
Robust Model Predictive Control: A Survey
Alberto Bemporad and Manfred Morari. Robust Model Predictive Control: A Survey . In Robustness in Identification and Control, pp.\ 207--226. Springer, 2007
work page 2007
-
[11]
Robust Optimization--a Comprehensive Survey
Hans-Georg Beyer and Bernhard Sendhoff. Robust Optimization--a Comprehensive Survey . Computer Methods in Applied Mechanics and Engineering, 196 0 (33-34): 0 3190--3218, 2007
work page 2007
-
[12]
Stephen P Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004
work page 2004
-
[13]
DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning
Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning . In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[14]
Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality
Hyeong Soo Chang. Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality . arXiv preprint arXiv:2308.03297, 2023
-
[15]
Dynamic Programming Equations for Discounted Constrained Stochastic Control
Richard C Chen and Gilmer L Blankenship. Dynamic Programming Equations for Discounted Constrained Stochastic Control . IEEE Transactions on Automatic Control, 49 0 (5): 0 699--709, 2004
work page 2004
-
[16]
Non-Randomized Control of Constrained Markov Decision Processes
Richard C Chen and Eugene A Feinberg. Non-Randomized Control of Constrained Markov Decision Processes . In American Control Conference, 2006
work page 2006
-
[17]
A Primal-Dual Approach to Constrained Markov Decision Processes
Yi Chen, Jing Dong, and Zhaoran Wang. A Primal-Dual Approach to Constrained Markov Decision Processes . arXiv preprint arXiv:2101.10895, 2021
-
[18]
Distributionally Robust Optimization for Sequential Decision-Making
Zhi Chen, Pengqian Yu, and William B Haskell. Distributionally Robust Optimization for Sequential Decision-Making . Optimization, 68 0 (12): 0 2397--2426, 2019
work page 2019
-
[19]
Nonsmooth analysis and Control Theory , volume 178
Francis H Clarke, Yuri S Ledyaev, Ronald J Stern, and Peter R Wolenski. Nonsmooth analysis and Control Theory , volume 178. Springer Science & Business Media, 2008
work page 2008
-
[20]
Optimization and Nonsmooth Analysis
Frank H Clarke. Optimization and Nonsmooth Analysis . SIAM, 1990
work page 1990
-
[21]
Towards minimax Optimality of Model-based Robust Reinforcement Learning
Pierre Clavier, Erwan Le Pennec, and Matthieu Geist. Towards minimax Optimality of Model-based Robust Reinforcement Learning . arXiv preprint arXiv:2302.05372, 2023
-
[22]
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning . In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[23]
On Linear Programming in a Markov Decision Problem
Eric V Denardo. On Linear Programming in a Markov Decision Problem . Management Science, 16 0 (5): 0 281--288, 1970
work page 1970
-
[24]
Twice Regularized MDPs and the Equivalence Between Robustness and Regularization
Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[25]
Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes
Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes . In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[26]
Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs
Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Alejandro Ribeiro. Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs . In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[27]
Analysis of Feedback Systems with Structured Uncertainties
John Doyle. Analysis of Feedback Systems with Structured Uncertainties . In IEE Proceedings D-Control Theory and Applications, 1982
work page 1982
-
[28]
Bilinear Classes: A Structural Framework for Provable Generalization in RL
Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear Classes: A Structural Framework for Provable Generalization in RL . In International Conference on Machine Learning, 2021
work page 2021
-
[29]
Robust Nonparametric Regression under Huber's $\epsilon$-contamination Model
Simon S Du, Yining Wang, Sivaraman Balakrishnan, Pradeep Ravikumar, and Aarti Singh. Robust Nonparametric Regression under Huber's -contamination Model . arXiv preprint arXiv:1805.10406, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020
Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-Exploitation in Constrained MDPs . arXiv preprint arXiv:2003.02189, 2020
-
[31]
Arnob Ghosh. Sample Complexity for Obtaining Sub-optimality and Violation Bound for Distributionally Robust Constrained MDP . In Reinforcement Learning Safety Workshop, 2024
work page 2024
-
[32]
Robust Markov Decision Processes: Beyond Rectangularity
Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023
work page 2023
-
[33]
Scalable First-Order Methods for Robust Mdps
Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021
work page 2021
-
[34]
On the Convex Formulations of Robust Markov Decision Processes
Julien Grand-Cl \'e ment and Marek Petrik. On the Convex Formulations of Robust Markov Decision Processes . Mathematics of Operations Research, 0 0 (0), 2024
work page 2024
-
[35]
Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs
Aria HasanzadeZonuzy, Archana Bura, Dileep Kalathil, and Srinivas Shakkottai. Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs . In AAAI Conference on Artificial Intelligence, 2021
work page 2021
-
[36]
On Constrained Markov Decision Processes
Moshe Haviv. On Constrained Markov Decision Processes . Operations Research Letters, 19 0 (1): 0 25--28, 1996
work page 1996
-
[37]
Partial Policy Iteration for L1 -Robust Markov Decision Processes
Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial Policy Iteration for L1 -Robust Markov Decision Processes . Journal of Machine Learning Research, 22 0 (275): 0 1--46, 2021
work page 2021
-
[38]
Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005
work page 2005
-
[39]
PAC Reinforcement Learning with an Imperfect Model
Nan Jiang. PAC Reinforcement Learning with an Imperfect Model . In AAAI Conference on Artificial Intelligence, 2018
work page 2018
-
[40]
A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees
Toshinori Kitamura, Tadashi Kozuno, Masahiro Kato, Yuki Ichihara, Soichiro Nishimori, Akiyoshi Sannai, Sho Sonoda, Wataru Kumagai, and Yutaka Matsuo. A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees . arXiv preprint arXiv:2401.17780, 2024
-
[41]
A Ya Kruger. On Fr\'echet Subdifferentials . Journal of Mathematical Sciences, 116 0 (3): 0 3325--3358, 2003
work page 2003
-
[42]
Efficient Policy Iteration for Robust Markov Decision Processes via Regularization
Navdeep Kumar, Kfir Levy, Kaixin Wang, and Shie Mannor. Efficient Policy Iteration for Robust Markov Decision Processes via Regularization . arXiv preprint arXiv:2205.14327, 2022
-
[43]
Policy Gradient for Rectangular Robust Markov Decision Processes
Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[44]
Batch Policy Learning Under Constraints
Hoang Le, Cameron Voloshin, and Yisong Yue. Batch Policy Learning Under Constraints . In International Conference on Machine Learning, 2019
work page 2019
-
[45]
Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process
Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, and Guanghui Lan. Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process . Operations Research Letters, 54: 0 107107, 2024
work page 2024
-
[46]
First-Order Policy Optimization for Robust Markov Decision Process
Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022
-
[47]
A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes
Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes . arXiv preprint arXiv:2406.00274, 2024
-
[48]
Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs
Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs . In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[49]
Robust Entropy-Regularized Markov Decision Processes
Tien Mai and Patrick Jaillet. Robust Entropy-Regularized Markov Decision Processes . arXiv preprint arXiv:2112.15364, 2021
-
[50]
Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification
Daniel J Mankowitz, Dan A Calian, Rae Jeong, Cosmin Paduraru, Nicolas Heess, Sumanth Dathathri, Martin Riedmiller, and Timothy Mann. Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification . arXiv preprint arXiv:2010.10644, 2020
-
[51]
On the Global Convergence Rates of Softmax Policy Gradient Methods
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020
work page 2020
-
[52]
Methods of Nonconvex Optimization
VS Mikhalevich, AM Gupal, and VI Norkin. Methods of Nonconvex Optimization . arXiv preprint arXiv:2406.10406, 2024
-
[53]
Reinforcement Learning with Convex Constraints
Sobhan Miryoosefi, Kiant \'e Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. Reinforcement Learning with Convex Constraints . In Advances in Neural Information Processing Systems, 2019
work page 2019
-
[54]
Human-level Control through Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level Control through Deep Reinforcement Learning . Nature, 518 0 (7540): 0 529--533, 2015
work page 2015
-
[55]
Truly No-Regret Learning in Constrained MDPs
Adrian M \"u ller, Pragnya Alatur, Volkan Cevher, Giorgia Ramponi, and Niao He. Truly No-Regret Learning in Constrained MDPs . arXiv preprint arXiv:2402.15776, 2024
-
[56]
Reinforcement Learning via Fenchel-Rockafellar Duality
Ofir Nachum and Bo Dai. Reinforcement Learning via Fenchel-Rockafellar Duality . arXiv preprint arXiv:2001.01866, 2020
-
[57]
Robust Control of Markov Decision Processes with Uncertain Transition Matrices
Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005
work page 2005
-
[58]
Sample Complexity of Robust Reinforcement Learning with a Generative Model
Kishan Panaganti and Dileep Kalathil. Sample Complexity of Robust Reinforcement Learning with a Generative Model . In International Conference on Artificial Intelligence and Statistics, 2022
work page 2022
-
[59]
Neal Parikh, Stephen Boyd, et al. Proximal algorithms . Foundations and Trends in Optimization , 1 0 (3): 0 127--239, 2014
work page 2014
-
[60]
Constrained Reinforcement Learning Has Zero Duality Gap
Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained Reinforcement Learning Has Zero Duality Gap . In Advances in Neural Information Processing Systems, 2019
work page 2019
-
[61]
Safe Policies for Reinforcement Learning via Primal-Dual Methods
Santiago Paternain, Miguel Calvo-Fullana, Luiz FO Chamon, and Alejandro Ribeiro. Safe Policies for Reinforcement Learning via Primal-Dual Methods . IEEE Transactions on Automatic Control, 68 0 (3): 0 1321--1336, 2022
work page 2022
-
[62]
Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe Policy Iteration . In International Conference on Machine Learning, 2013
work page 2013
-
[63]
Distributionally robust optimization: A review
Hamed Rahimian and Sanjay Mehrotra. Distributionally Robust Optimization: A Review . arXiv preprint arXiv:1908.05659, 2019
-
[64]
Variational Analysis , volume 317
R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009
work page 2009
-
[65]
Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty
Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020
-
[66]
Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958
work page 1958
-
[67]
Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning
Oswin So and Chuchu Fan. Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning . In Proceedings of Robotics: Science and Systems, 2023
work page 2023
-
[68]
Solving Minimum-Cost Reach Avoid using Reinforcement Learning
Oswin So, Cheng Ge, and Chuchu Fan. Solving Minimum-Cost Reach Avoid using Reinforcement Learning . In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[69]
Constrained Reinforcement Learning Under Model Mismatch
Zhongchang Sun, Sihong He, Fei Miao, and Shaofeng Zou. Constrained Reinforcement Learning Under Model Mismatch . arXiv preprint arXiv:2405.01327, 2024
-
[70]
Reward Constrained Policy Optimization
Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward Constrained Policy Optimization . In International Conference on Learning Representations, 2018
work page 2018
-
[71]
Policy Gradient in Robust MDPs with Global Convergence Guarantee
Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023
work page 2023
-
[72]
Online Robust Reinforcement Learning with Model Uncertainty
Yue Wang and Shaofeng Zou. Online Robust Reinforcement Learning with Model Uncertainty . In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[73]
Policy Gradient Method for Robust Reinforcement Learning
Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022
work page 2022
-
[74]
Robust Constrained Reinforcement Learning
Yue Wang, Fei Miao, and Shaofeng Zou. Robust Constrained Reinforcement Learning . arXiv preprint arXiv:2209.06866, 2022
-
[75]
A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes
Honghao Wei, Xin Liu, and Lei Ying. A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes . arXiv preprint arXiv:2106.01577, 2021
-
[76]
Robust Markov Decision Processes
Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013
work page 2013
-
[77]
Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics . The Annals of Statistics, 50 0 (6): 0 3223--3248, 2022
work page 2022
-
[78]
Robust Markov Decision Processes without Model Estimation
Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023
-
[79]
A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization
Donghao Ying, Yuhao Ding, and Javad Lavaei. A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization . In International Conference on Artificial Intelligence and Statistics, 2022
work page 2022
-
[80]
Reward is Enough for Convex MDPs
Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.