Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Kazumi Kasaura; Kenta Hoshino; Masashi Hamaya; Paavo Parmas; Tadashi Kozuno; Toshinori Kitamura; Wataru Kumagai; Yohei Hosoe; Yutaka Matsuo

arxiv: 2408.16286 · v5 · submitted 2024-08-29 · 💻 cs.LG · math.OC

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Toshinori Kitamura , Tadashi Kozuno , Wataru Kumagai , Kenta Hoshino , Yohei Hosoe , Kazumi Kasaura , Masashi Hamaya , Paavo Parmas

show 1 more author

Yutaka Matsuo

This is my paper

Pith reviewed 2026-05-23 21:56 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords robust constrained Markov decision processepigraph formpolicy gradientbisection searchnear-optimal policysample complexityLagrangian formulationworst-case constraints

0 comments

The pith

The epigraph form of robust constrained MDPs lets a bisection search with policy gradient identify an ε-optimal policy using Õ(ε^{-4}) robust policy evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the usual Lagrangian formulation of robust constrained MDPs can trap a policy gradient method in suboptimal points because the inner minimization sums conflicting gradients from the objective and the constraints. Reformulating the problem in epigraph form removes the conflict by letting the algorithm pick a single gradient at each step from either the objective or a constraint. A bisection search then wraps a policy gradient subroutine to locate a policy that is near-optimal in the worst case over the uncertainty set. A reader would care because the result supplies the first algorithm with a concrete guarantee for learning safe policies when the environment belongs to an unknown set of possibilities.

Core claim

The paper claims that the epigraph form of the RCMDP converts the original constrained min-max problem into a form where a bisection search algorithm equipped with a policy gradient subroutine identifies an ε-optimal policy with Õ(ε^{-4}) robust policy evaluations, while the conventional Lagrangian max-min formulation can fail because its inner minimization encounters a sum of conflicting gradients.

What carries the argument

The epigraph form of the RCMDP problem, which introduces an auxiliary variable so that a single gradient is chosen from either the objective or one of the constraints at each step.

If this is right

The algorithm succeeds on instances where the Lagrangian method is trapped by summed conflicting gradients.
Near-optimal robust policies are obtained with a polynomial number of robust policy evaluations that scales as ε to the minus four.
The method guarantees both near-optimality of the objective and satisfaction of the constraints in every environment inside the uncertainty set.
Policy gradient updates can be performed safely without needing to solve a joint min-max at every iteration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same epigraph reformulation might be tried on other constrained robust problems that currently rely on Lagrangian relaxations.
Replacing the inner policy gradient subroutine with a more sample-efficient estimator could reduce the overall evaluation count further.
The approach supplies a template for turning other min-max constrained problems into searchable forms that avoid gradient cancellation.

Load-bearing premise

Reformulating the constrained problem in epigraph form still lets the algorithm recover an optimal policy for the original RCMDP.

What would settle it

An explicit RCMDP instance in which the bisection search returns a policy whose worst-case cost exceeds the claimed ε-optimal value after the stated number of robust policy evaluations.

Figures

Figures reproduced from arXiv: 2408.16286 by Kazumi Kasaura, Kenta Hoshino, Masashi Hamaya, Paavo Parmas, Tadashi Kozuno, Toshinori Kitamura, Wataru Kumagai, Yohei Hosoe, Yutaka Matsuo.

**Figure 1.** Figure 1: (a): An RCMDP example illustrating the gradient conflict challenge. Action labels are omitted when transitions are action-independent. (b): Policy gradients in the example with (γ, δ, b1) = (0.4, 0.09, 0). Arrows represent the gradient to decrease L1(π). π2 attracts policy gradients but is a local minimum since L1(π2) > L1(π1), where π1(·, a1) = 1 and π2(·, a2) = 1. c ′ := c0 + λ(c1 − b1/H1). Since Jc ′ ,P… view at source ↗

**Figure 2.** Figure 2: Algorithmic idea to find b0 = J ⋆ in Example 1 with (γ, δ, b1) = (0.1, 0, 2/3). This section describes the outer loop of EpiRC-PGS to identify b0 = J ⋆ . The outer loop utilizes the following monotonicity of ∆⋆ b0 for the identification. The proof is deferred to Appendix H.1: Lemma 2. ∆⋆ b0 is monotonically decreasing in b0 and ∆⋆ J⋆ = 0. Thanks to this monotonicity of ∆⋆ b0 , if ∆⋆ b0 can be efficiently c… view at source ↗

**Figure 3.** Figure 3: Comparison of the algorithms in different settings [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Designing a safe policy for uncertain environments is crucial in real-world control systems. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions. This occurs when its inner minimization encounters a sum of conflicting gradients from the objective and constraint functions. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a bisection search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\tilde{\mathcal{O}}(\varepsilon^{-4})$ robust policy evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first algorithm with sample-complexity guarantees for near-optimal policy identification in RCMDPs by switching to epigraph form to avoid Lagrangian gradient traps.

read the letter

The core advance here is a bisection search over an epigraph reformulation of the RCMDP, paired with a policy-gradient subroutine, that returns an ε-optimal policy after Õ(ε^{-4}) robust policy evaluations. They also prove that the usual Lagrangian max-min formulation can trap at suboptimal points when the inner min produces opposing gradients from cost and constraint. That diagnosis and the switch to epigraph are the genuinely new pieces relative to prior robust RL work cited in the abstract.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce the first algorithm for identifying an ε-optimal policy in robust constrained MDPs (RCMDPs), where the policy minimizes cost subject to worst-case constraints over an uncertainty set. It proves that standard policy gradient on the Lagrangian formulation can trap in suboptimal points due to conflicting gradients from objective and constraints. To resolve this, the authors reformulate the RCMDP in epigraph form, which allows selecting a single gradient (either objective or constraint) at each step. They then propose a bisection search algorithm over the epigraph variable combined with a policy gradient subroutine, and prove that this identifies an ε-optimal policy using Õ(ε^{-4}) robust policy evaluations.

Significance. If the central equivalence between the epigraph reformulation and the original RCMDP optimum holds under robust evaluation, the result would be significant: it supplies the first polynomial-sample guarantee for near-optimal safe policy identification in RCMDPs, a setting that directly models real-world control under model uncertainty. The explicit proof of the Lagrangian trapping phenomenon and the reduction to a single-gradient subroutine are concrete technical contributions that could influence subsequent work on constrained robust RL.

major comments (2)

[Abstract / epigraph reformulation section] Abstract (paragraph on epigraph form) and the claimed equivalence proof: the assertion that selecting a single gradient from either the objective or the constraints inside the bisection search recovers an ε-optimal policy for the original min-cost + worst-case constraint problem is load-bearing for both the algorithm and the Õ(ε^{-4}) bound. The inner minimization is over an uncertainty set; it is not immediate that a per-step single-gradient choice preserves joint feasibility across all environments in the set, even if the epigraph value matches at optimality. An explicit statement of the equivalence (e.g., showing that any ε-optimal epigraph solution maps back to an ε-optimal robust policy) is required.
[Algorithm and complexity analysis] The sample-complexity claim of Õ(ε^{-4}) robust policy evaluations rests on the bisection + policy-gradient subroutine being correct for the epigraph problem. Without a concrete reduction showing that the robust policy evaluation oracle used inside the subroutine matches the original RCMDP definition (including how the worst-case is taken inside the gradient step), the bound cannot be verified as applying to the original problem.

minor comments (1)

[Abstract] Notation for the uncertainty set and the robust policy evaluation operator should be introduced once and used consistently; the abstract refers to both “robust policy evaluations” and “worst-case scenario across a set of environments” without a single symbol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback. We address each major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract / epigraph reformulation section] Abstract (paragraph on epigraph form) and the claimed equivalence proof: the assertion that selecting a single gradient from either the objective or the constraints inside the bisection search recovers an ε-optimal policy for the original min-cost + worst-case constraint problem is load-bearing for both the algorithm and the Õ(ε^{-4}) bound. The inner minimization is over an uncertainty set; it is not immediate that a per-step single-gradient choice preserves joint feasibility across all environments in the set, even if the epigraph value matches at optimality. An explicit statement of the equivalence (e.g., showing that any ε-optimal epigraph solution maps back to an ε-optimal robust policy) is required.

Authors: We agree that an explicit mapping statement strengthens the presentation. Theorem 3.1 proves equivalence at optimality between the epigraph form and the original RCMDP. The proof of the main result (Theorem 4.3) further shows that any ε-optimal epigraph solution maps to an ε-optimal robust policy: the epigraph variable enforces a uniform bound on the worst-case cost, and the single-gradient selection (objective or constraint) preserves joint feasibility over the uncertainty set because the worst-case operator is applied identically in both formulations. We will add a dedicated corollary after Theorem 3.1 stating this approximate equivalence explicitly and revise the abstract paragraph for clarity. revision: yes
Referee: [Algorithm and complexity analysis] The sample-complexity claim of Õ(ε^{-4}) robust policy evaluations rests on the bisection + policy-gradient subroutine being correct for the epigraph problem. Without a concrete reduction showing that the robust policy evaluation oracle used inside the subroutine matches the original RCMDP definition (including how the worst-case is taken inside the gradient step), the bound cannot be verified as applying to the original problem.

Authors: The robust policy evaluation oracle is defined identically in the epigraph problem (Definition 3.2) and the original RCMDP: each call computes the worst-case value over the uncertainty set for the active function (cost or constraint). This is used directly in the policy-gradient subroutine (Algorithm 2) and the bisection analysis (Section 4), so the Õ(ε^{-4}) bound applies to the original problem. To make the reduction fully transparent, we will insert a remark in Section 4.2 explicitly tracing how the worst-case operator enters each gradient step and confirming that every oracle call matches the RCMDP definition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from epigraph reformulation and standard primitives.

full rationale

The paper proposes a bisection search algorithm using a policy gradient subroutine on the epigraph form of the RCMDP, claiming an Õ(ε^{-4}) bound on robust policy evaluations after proving Lagrangian issues with conflicting gradients. No quoted step reduces the final bound or optimality guarantee to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain; the epigraph choice is presented as a reformulation enabling single-gradient selection, with the complexity bound derived from the algorithm's analysis rather than tautological equivalence to inputs. This is the common honest outcome for theoretical algorithm papers whose central claims rest on explicit proofs rather than data fits or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the mathematical validity of the epigraph reformulation for RCMDPs and on standard convergence properties of policy gradient methods inside the bisection loop; no free parameters, invented entities, or ad-hoc axioms are mentioned.

axioms (2)

domain assumption Policy gradient methods converge to a stationary point of the inner minimization when a single gradient is supplied at each step.
Invoked when the epigraph form is said to resolve conflicting gradients (abstract).
domain assumption The uncertainty set in the RCMDP admits a well-defined worst-case evaluation that can be approximated by robust policy evaluations.
Required for the sample complexity statement to be meaningful.

pith-pipeline@v0.9.0 · 5751 in / 1416 out tokens · 52256 ms · 2026-05-23T21:56:36.453365+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees
cs.LG 2026-04 unverdicted novelty 8.0

RHC-UCRL is the first algorithm for safety-constrained RL under explicit adversarial dynamics, providing sub-linear regret and constraint violation guarantees by maintaining optimism over both agent and adversary policies.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

On Frank-Wolfe and Equilibrium Computation

Jacob D Abernethy and Jun-Kun Wang. On Frank-Wolfe and Equilibrium Computation . In Advances in Neural Information Processing Systems, 2017

work page 2017
[3]

Constrained Policy Optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained Policy Optimization . In International Conference on Machine Learning, 2017

work page 2017
[4]

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021

work page 2021
[5]

Constrained Markov Decision Processes , volume 7

Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999

work page 1999
[6]

System Level Synthesis

James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System Level Synthesis . Annual Reviews in Control, 47: 0 364--393, 2019

work page 2019
[7]

On the Generation of Markov Decision Processes

TW Archibald, KIM McKinnon, and LC Thomas. On the Generation of Markov Decision Processes . Journal of the Operational Research Society, 46 0 (3): 0 354--361, 1995

work page 1995
[8]

First-Order Methods in Optimization

Amir Beck. First-Order Methods in Optimization . SIAM, 2017

work page 2017
[9]

Bellman, R.E

R. Bellman, R.E. Bellman, and Rand Corporation. Dynamic Programming. Princeton University Press, 1957

work page 1957
[10]

Robust Model Predictive Control: A Survey

Alberto Bemporad and Manfred Morari. Robust Model Predictive Control: A Survey . In Robustness in Identification and Control, pp.\ 207--226. Springer, 2007

work page 2007
[11]

Robust Optimization--a Comprehensive Survey

Hans-Georg Beyer and Bernhard Sendhoff. Robust Optimization--a Comprehensive Survey . Computer Methods in Applied Mechanics and Engineering, 196 0 (33-34): 0 3190--3218, 2007

work page 2007
[12]

Convex Optimization

Stephen P Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004

work page 2004
[13]

DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning

Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning . In Advances in Neural Information Processing Systems, 2022

work page 2022
[14]

Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality

Hyeong Soo Chang. Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality . arXiv preprint arXiv:2308.03297, 2023

work page arXiv 2023
[15]

Dynamic Programming Equations for Discounted Constrained Stochastic Control

Richard C Chen and Gilmer L Blankenship. Dynamic Programming Equations for Discounted Constrained Stochastic Control . IEEE Transactions on Automatic Control, 49 0 (5): 0 699--709, 2004

work page 2004
[16]

Non-Randomized Control of Constrained Markov Decision Processes

Richard C Chen and Eugene A Feinberg. Non-Randomized Control of Constrained Markov Decision Processes . In American Control Conference, 2006

work page 2006
[17]

A Primal-Dual Approach to Constrained Markov Decision Processes

Yi Chen, Jing Dong, and Zhaoran Wang. A Primal-Dual Approach to Constrained Markov Decision Processes . arXiv preprint arXiv:2101.10895, 2021

work page arXiv 2021
[18]

Distributionally Robust Optimization for Sequential Decision-Making

Zhi Chen, Pengqian Yu, and William B Haskell. Distributionally Robust Optimization for Sequential Decision-Making . Optimization, 68 0 (12): 0 2397--2426, 2019

work page 2019
[19]

Nonsmooth analysis and Control Theory , volume 178

Francis H Clarke, Yuri S Ledyaev, Ronald J Stern, and Peter R Wolenski. Nonsmooth analysis and Control Theory , volume 178. Springer Science & Business Media, 2008

work page 2008
[20]

Optimization and Nonsmooth Analysis

Frank H Clarke. Optimization and Nonsmooth Analysis . SIAM, 1990

work page 1990
[21]

Towards minimax Optimality of Model-based Robust Reinforcement Learning

Pierre Clavier, Erwan Le Pennec, and Matthieu Geist. Towards minimax Optimality of Model-based Robust Reinforcement Learning . arXiv preprint arXiv:2302.05372, 2023

work page arXiv 2023
[22]

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning . In Advances in Neural Information Processing Systems, 2017

work page 2017
[23]

On Linear Programming in a Markov Decision Problem

Eric V Denardo. On Linear Programming in a Markov Decision Problem . Management Science, 16 0 (5): 0 281--288, 1970

work page 1970
[24]

Twice Regularized MDPs and the Equivalence Between Robustness and Regularization

Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021

work page 2021
[25]

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes . In Advances in Neural Information Processing Systems, 2020

work page 2020
[26]

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Alejandro Ribeiro. Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs . In Advances in Neural Information Processing Systems, 2024

work page 2024
[27]

Analysis of Feedback Systems with Structured Uncertainties

John Doyle. Analysis of Feedback Systems with Structured Uncertainties . In IEE Proceedings D-Control Theory and Applications, 1982

work page 1982
[28]

Bilinear Classes: A Structural Framework for Provable Generalization in RL

Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear Classes: A Structural Framework for Provable Generalization in RL . In International Conference on Machine Learning, 2021

work page 2021
[29]

Robust Nonparametric Regression under Huber's $\epsilon$-contamination Model

Simon S Du, Yining Wang, Sivaraman Balakrishnan, Pradeep Ravikumar, and Aarti Singh. Robust Nonparametric Regression under Huber's -contamination Model . arXiv preprint arXiv:1805.10406, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-Exploitation in Constrained MDPs . arXiv preprint arXiv:2003.02189, 2020

work page arXiv 2003
[31]

Sample Complexity for Obtaining Sub-optimality and Violation Bound for Distributionally Robust Constrained MDP

Arnob Ghosh. Sample Complexity for Obtaining Sub-optimality and Violation Bound for Distributionally Robust Constrained MDP . In Reinforcement Learning Safety Workshop, 2024

work page 2024
[32]

Robust Markov Decision Processes: Beyond Rectangularity

Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023

work page 2023
[33]

Scalable First-Order Methods for Robust Mdps

Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021

work page 2021
[34]

On the Convex Formulations of Robust Markov Decision Processes

Julien Grand-Cl \'e ment and Marek Petrik. On the Convex Formulations of Robust Markov Decision Processes . Mathematics of Operations Research, 0 0 (0), 2024

work page 2024
[35]

Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs

Aria HasanzadeZonuzy, Archana Bura, Dileep Kalathil, and Srinivas Shakkottai. Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs . In AAAI Conference on Artificial Intelligence, 2021

work page 2021
[36]

On Constrained Markov Decision Processes

Moshe Haviv. On Constrained Markov Decision Processes . Operations Research Letters, 19 0 (1): 0 25--28, 1996

work page 1996
[37]

Partial Policy Iteration for L1 -Robust Markov Decision Processes

Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial Policy Iteration for L1 -Robust Markov Decision Processes . Journal of Machine Learning Research, 22 0 (275): 0 1--46, 2021

work page 2021
[38]

Robust Dynamic Programming

Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005

work page 2005
[39]

PAC Reinforcement Learning with an Imperfect Model

Nan Jiang. PAC Reinforcement Learning with an Imperfect Model . In AAAI Conference on Artificial Intelligence, 2018

work page 2018
[40]

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Toshinori Kitamura, Tadashi Kozuno, Masahiro Kato, Yuki Ichihara, Soichiro Nishimori, Akiyoshi Sannai, Sho Sonoda, Wataru Kumagai, and Yutaka Matsuo. A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees . arXiv preprint arXiv:2401.17780, 2024

work page arXiv 2024
[41]

On Fr\'echet Subdifferentials

A Ya Kruger. On Fr\'echet Subdifferentials . Journal of Mathematical Sciences, 116 0 (3): 0 3325--3358, 2003

work page 2003
[42]

Efficient Policy Iteration for Robust Markov Decision Processes via Regularization

Navdeep Kumar, Kfir Levy, Kaixin Wang, and Shie Mannor. Efficient Policy Iteration for Robust Markov Decision Processes via Regularization . arXiv preprint arXiv:2205.14327, 2022

work page arXiv 2022
[43]

Policy Gradient for Rectangular Robust Markov Decision Processes

Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024

work page 2024
[44]

Batch Policy Learning Under Constraints

Hoang Le, Cameron Voloshin, and Yisong Yue. Batch Policy Learning Under Constraints . In International Conference on Machine Learning, 2019

work page 2019
[45]

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, and Guanghui Lan. Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process . Operations Research Letters, 54: 0 107107, 2024

work page 2024
[46]

First-Order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022

work page arXiv 2022
[47]

A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes

Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes . arXiv preprint arXiv:2406.00274, 2024

work page arXiv 2024
[48]

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs . In Advances in Neural Information Processing Systems, 2021

work page 2021
[49]

Robust Entropy-Regularized Markov Decision Processes

Tien Mai and Patrick Jaillet. Robust Entropy-Regularized Markov Decision Processes . arXiv preprint arXiv:2112.15364, 2021

work page arXiv 2021
[50]

Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification

Daniel J Mankowitz, Dan A Calian, Rae Jeong, Cosmin Paduraru, Nicolas Heess, Sumanth Dathathri, Martin Riedmiller, and Timothy Mann. Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification . arXiv preprint arXiv:2010.10644, 2020

work page arXiv 2010
[51]

On the Global Convergence Rates of Softmax Policy Gradient Methods

Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020

work page 2020
[52]

Methods of Nonconvex Optimization

VS Mikhalevich, AM Gupal, and VI Norkin. Methods of Nonconvex Optimization . arXiv preprint arXiv:2406.10406, 2024

work page arXiv 2024
[53]

Reinforcement Learning with Convex Constraints

Sobhan Miryoosefi, Kiant \'e Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. Reinforcement Learning with Convex Constraints . In Advances in Neural Information Processing Systems, 2019

work page 2019
[54]

Human-level Control through Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level Control through Deep Reinforcement Learning . Nature, 518 0 (7540): 0 529--533, 2015

work page 2015
[55]

Truly No-Regret Learning in Constrained MDPs

Adrian M \"u ller, Pragnya Alatur, Volkan Cevher, Giorgia Ramponi, and Niao He. Truly No-Regret Learning in Constrained MDPs . arXiv preprint arXiv:2402.15776, 2024

work page arXiv 2024
[56]

Reinforcement Learning via Fenchel-Rockafellar Duality

Ofir Nachum and Bo Dai. Reinforcement Learning via Fenchel-Rockafellar Duality . arXiv preprint arXiv:2001.01866, 2020

work page arXiv 2001
[57]

Robust Control of Markov Decision Processes with Uncertain Transition Matrices

Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005

work page 2005
[58]

Sample Complexity of Robust Reinforcement Learning with a Generative Model

Kishan Panaganti and Dileep Kalathil. Sample Complexity of Robust Reinforcement Learning with a Generative Model . In International Conference on Artificial Intelligence and Statistics, 2022

work page 2022
[59]

Proximal algorithms

Neal Parikh, Stephen Boyd, et al. Proximal algorithms . Foundations and Trends in Optimization , 1 0 (3): 0 127--239, 2014

work page 2014
[60]

Constrained Reinforcement Learning Has Zero Duality Gap

Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained Reinforcement Learning Has Zero Duality Gap . In Advances in Neural Information Processing Systems, 2019

work page 2019
[61]

Safe Policies for Reinforcement Learning via Primal-Dual Methods

Santiago Paternain, Miguel Calvo-Fullana, Luiz FO Chamon, and Alejandro Ribeiro. Safe Policies for Reinforcement Learning via Primal-Dual Methods . IEEE Transactions on Automatic Control, 68 0 (3): 0 1321--1336, 2022

work page 2022
[62]

Safe Policy Iteration

Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe Policy Iteration . In International Conference on Machine Learning, 2013

work page 2013
[63]

Distributionally robust optimization: A review

Hamed Rahimian and Sanjay Mehrotra. Distributionally Robust Optimization: A Review . arXiv preprint arXiv:1908.05659, 2019

work page arXiv 1908
[64]

Variational Analysis , volume 317

R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009

work page 2009
[65]

Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020

work page arXiv 2010
[66]

On General Minimax Theorems

Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958

work page 1958
[67]

Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning

Oswin So and Chuchu Fan. Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning . In Proceedings of Robotics: Science and Systems, 2023

work page 2023
[68]

Solving Minimum-Cost Reach Avoid using Reinforcement Learning

Oswin So, Cheng Ge, and Chuchu Fan. Solving Minimum-Cost Reach Avoid using Reinforcement Learning . In Advances in Neural Information Processing Systems, 2024

work page 2024
[69]

Constrained Reinforcement Learning Under Model Mismatch

Zhongchang Sun, Sihong He, Fei Miao, and Shaofeng Zou. Constrained Reinforcement Learning Under Model Mismatch . arXiv preprint arXiv:2405.01327, 2024

work page arXiv 2024
[70]

Reward Constrained Policy Optimization

Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward Constrained Policy Optimization . In International Conference on Learning Representations, 2018

work page 2018
[71]

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023

work page 2023
[72]

Online Robust Reinforcement Learning with Model Uncertainty

Yue Wang and Shaofeng Zou. Online Robust Reinforcement Learning with Model Uncertainty . In Advances in Neural Information Processing Systems, 2021

work page 2021
[73]

Policy Gradient Method for Robust Reinforcement Learning

Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022

work page 2022
[74]

Robust Constrained Reinforcement Learning

Yue Wang, Fei Miao, and Shaofeng Zou. Robust Constrained Reinforcement Learning . arXiv preprint arXiv:2209.06866, 2022

work page arXiv 2022
[75]

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Honghao Wei, Xin Liu, and Lei Ying. A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes . arXiv preprint arXiv:2106.01577, 2021

work page arXiv 2021
[76]

Robust Markov Decision Processes

Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013

work page 2013
[77]

Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics

Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics . The Annals of Statistics, 50 0 (6): 0 3223--3248, 2022

work page 2022
[78]

Robust Markov Decision Processes without Model Estimation

Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023

work page arXiv 2023
[79]

A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

Donghao Ying, Yuhao Ding, and Javad Lavaei. A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization . In International Conference on Artificial Intelligence and Statistics, 2022

work page 2022
[80]

Reward is Enough for Convex MDPs

Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021

work page 2021

Showing first 80 references.

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

On Frank-Wolfe and Equilibrium Computation

Jacob D Abernethy and Jun-Kun Wang. On Frank-Wolfe and Equilibrium Computation . In Advances in Neural Information Processing Systems, 2017

work page 2017

[3] [3]

Constrained Policy Optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained Policy Optimization . In International Conference on Machine Learning, 2017

work page 2017

[4] [4]

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021

work page 2021

[5] [5]

Constrained Markov Decision Processes , volume 7

Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999

work page 1999

[6] [6]

System Level Synthesis

James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System Level Synthesis . Annual Reviews in Control, 47: 0 364--393, 2019

work page 2019

[7] [7]

On the Generation of Markov Decision Processes

TW Archibald, KIM McKinnon, and LC Thomas. On the Generation of Markov Decision Processes . Journal of the Operational Research Society, 46 0 (3): 0 354--361, 1995

work page 1995

[8] [8]

First-Order Methods in Optimization

Amir Beck. First-Order Methods in Optimization . SIAM, 2017

work page 2017

[9] [9]

Bellman, R.E

R. Bellman, R.E. Bellman, and Rand Corporation. Dynamic Programming. Princeton University Press, 1957

work page 1957

[10] [10]

Robust Model Predictive Control: A Survey

Alberto Bemporad and Manfred Morari. Robust Model Predictive Control: A Survey . In Robustness in Identification and Control, pp.\ 207--226. Springer, 2007

work page 2007

[11] [11]

Robust Optimization--a Comprehensive Survey

Hans-Georg Beyer and Bernhard Sendhoff. Robust Optimization--a Comprehensive Survey . Computer Methods in Applied Mechanics and Engineering, 196 0 (33-34): 0 3190--3218, 2007

work page 2007

[12] [12]

Convex Optimization

Stephen P Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004

work page 2004

[13] [13]

DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning

Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning . In Advances in Neural Information Processing Systems, 2022

work page 2022

[14] [14]

Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality

Hyeong Soo Chang. Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality . arXiv preprint arXiv:2308.03297, 2023

work page arXiv 2023

[15] [15]

Dynamic Programming Equations for Discounted Constrained Stochastic Control

Richard C Chen and Gilmer L Blankenship. Dynamic Programming Equations for Discounted Constrained Stochastic Control . IEEE Transactions on Automatic Control, 49 0 (5): 0 699--709, 2004

work page 2004

[16] [16]

Non-Randomized Control of Constrained Markov Decision Processes

Richard C Chen and Eugene A Feinberg. Non-Randomized Control of Constrained Markov Decision Processes . In American Control Conference, 2006

work page 2006

[17] [17]

A Primal-Dual Approach to Constrained Markov Decision Processes

Yi Chen, Jing Dong, and Zhaoran Wang. A Primal-Dual Approach to Constrained Markov Decision Processes . arXiv preprint arXiv:2101.10895, 2021

work page arXiv 2021

[18] [18]

Distributionally Robust Optimization for Sequential Decision-Making

Zhi Chen, Pengqian Yu, and William B Haskell. Distributionally Robust Optimization for Sequential Decision-Making . Optimization, 68 0 (12): 0 2397--2426, 2019

work page 2019

[19] [19]

Nonsmooth analysis and Control Theory , volume 178

Francis H Clarke, Yuri S Ledyaev, Ronald J Stern, and Peter R Wolenski. Nonsmooth analysis and Control Theory , volume 178. Springer Science & Business Media, 2008

work page 2008

[20] [20]

Optimization and Nonsmooth Analysis

Frank H Clarke. Optimization and Nonsmooth Analysis . SIAM, 1990

work page 1990

[21] [21]

Towards minimax Optimality of Model-based Robust Reinforcement Learning

Pierre Clavier, Erwan Le Pennec, and Matthieu Geist. Towards minimax Optimality of Model-based Robust Reinforcement Learning . arXiv preprint arXiv:2302.05372, 2023

work page arXiv 2023

[22] [22]

Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning . In Advances in Neural Information Processing Systems, 2017

work page 2017

[23] [23]

On Linear Programming in a Markov Decision Problem

Eric V Denardo. On Linear Programming in a Markov Decision Problem . Management Science, 16 0 (5): 0 281--288, 1970

work page 1970

[24] [24]

Twice Regularized MDPs and the Equivalence Between Robustness and Regularization

Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021

work page 2021

[25] [25]

Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes . In Advances in Neural Information Processing Systems, 2020

work page 2020

[26] [26]

Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Alejandro Ribeiro. Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs . In Advances in Neural Information Processing Systems, 2024

work page 2024

[27] [27]

Analysis of Feedback Systems with Structured Uncertainties

John Doyle. Analysis of Feedback Systems with Structured Uncertainties . In IEE Proceedings D-Control Theory and Applications, 1982

work page 1982

[28] [28]

Bilinear Classes: A Structural Framework for Provable Generalization in RL

Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear Classes: A Structural Framework for Provable Generalization in RL . In International Conference on Machine Learning, 2021

work page 2021

[29] [29]

Robust Nonparametric Regression under Huber's $\epsilon$-contamination Model

Simon S Du, Yining Wang, Sivaraman Balakrishnan, Pradeep Ravikumar, and Aarti Singh. Robust Nonparametric Regression under Huber's -contamination Model . arXiv preprint arXiv:1805.10406, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-Exploitation in Constrained MDPs . arXiv preprint arXiv:2003.02189, 2020

work page arXiv 2003

[31] [31]

Sample Complexity for Obtaining Sub-optimality and Violation Bound for Distributionally Robust Constrained MDP

Arnob Ghosh. Sample Complexity for Obtaining Sub-optimality and Violation Bound for Distributionally Robust Constrained MDP . In Reinforcement Learning Safety Workshop, 2024

work page 2024

[32] [32]

Robust Markov Decision Processes: Beyond Rectangularity

Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023

work page 2023

[33] [33]

Scalable First-Order Methods for Robust Mdps

Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021

work page 2021

[34] [34]

On the Convex Formulations of Robust Markov Decision Processes

Julien Grand-Cl \'e ment and Marek Petrik. On the Convex Formulations of Robust Markov Decision Processes . Mathematics of Operations Research, 0 0 (0), 2024

work page 2024

[35] [35]

Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs

Aria HasanzadeZonuzy, Archana Bura, Dileep Kalathil, and Srinivas Shakkottai. Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs . In AAAI Conference on Artificial Intelligence, 2021

work page 2021

[36] [36]

On Constrained Markov Decision Processes

Moshe Haviv. On Constrained Markov Decision Processes . Operations Research Letters, 19 0 (1): 0 25--28, 1996

work page 1996

[37] [37]

Partial Policy Iteration for L1 -Robust Markov Decision Processes

Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial Policy Iteration for L1 -Robust Markov Decision Processes . Journal of Machine Learning Research, 22 0 (275): 0 1--46, 2021

work page 2021

[38] [38]

Robust Dynamic Programming

Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005

work page 2005

[39] [39]

PAC Reinforcement Learning with an Imperfect Model

Nan Jiang. PAC Reinforcement Learning with an Imperfect Model . In AAAI Conference on Artificial Intelligence, 2018

work page 2018

[40] [40]

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Toshinori Kitamura, Tadashi Kozuno, Masahiro Kato, Yuki Ichihara, Soichiro Nishimori, Akiyoshi Sannai, Sho Sonoda, Wataru Kumagai, and Yutaka Matsuo. A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees . arXiv preprint arXiv:2401.17780, 2024

work page arXiv 2024

[41] [41]

On Fr\'echet Subdifferentials

A Ya Kruger. On Fr\'echet Subdifferentials . Journal of Mathematical Sciences, 116 0 (3): 0 3325--3358, 2003

work page 2003

[42] [42]

Efficient Policy Iteration for Robust Markov Decision Processes via Regularization

Navdeep Kumar, Kfir Levy, Kaixin Wang, and Shie Mannor. Efficient Policy Iteration for Robust Markov Decision Processes via Regularization . arXiv preprint arXiv:2205.14327, 2022

work page arXiv 2022

[43] [43]

Policy Gradient for Rectangular Robust Markov Decision Processes

Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024

work page 2024

[44] [44]

Batch Policy Learning Under Constraints

Hoang Le, Cameron Voloshin, and Yisong Yue. Batch Policy Learning Under Constraints . In International Conference on Machine Learning, 2019

work page 2019

[45] [45]

Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, and Guanghui Lan. Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process . Operations Research Letters, 54: 0 107107, 2024

work page 2024

[46] [46]

First-Order Policy Optimization for Robust Markov Decision Process

Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022

work page arXiv 2022

[47] [47]

A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes

Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes . arXiv preprint arXiv:2406.00274, 2024

work page arXiv 2024

[48] [48]

Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs . In Advances in Neural Information Processing Systems, 2021

work page 2021

[49] [49]

Robust Entropy-Regularized Markov Decision Processes

Tien Mai and Patrick Jaillet. Robust Entropy-Regularized Markov Decision Processes . arXiv preprint arXiv:2112.15364, 2021

work page arXiv 2021

[50] [50]

Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification

Daniel J Mankowitz, Dan A Calian, Rae Jeong, Cosmin Paduraru, Nicolas Heess, Sumanth Dathathri, Martin Riedmiller, and Timothy Mann. Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification . arXiv preprint arXiv:2010.10644, 2020

work page arXiv 2010

[51] [51]

On the Global Convergence Rates of Softmax Policy Gradient Methods

Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020

work page 2020

[52] [52]

Methods of Nonconvex Optimization

VS Mikhalevich, AM Gupal, and VI Norkin. Methods of Nonconvex Optimization . arXiv preprint arXiv:2406.10406, 2024

work page arXiv 2024

[53] [53]

Reinforcement Learning with Convex Constraints

Sobhan Miryoosefi, Kiant \'e Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. Reinforcement Learning with Convex Constraints . In Advances in Neural Information Processing Systems, 2019

work page 2019

[54] [54]

Human-level Control through Deep Reinforcement Learning

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level Control through Deep Reinforcement Learning . Nature, 518 0 (7540): 0 529--533, 2015

work page 2015

[55] [55]

Truly No-Regret Learning in Constrained MDPs

Adrian M \"u ller, Pragnya Alatur, Volkan Cevher, Giorgia Ramponi, and Niao He. Truly No-Regret Learning in Constrained MDPs . arXiv preprint arXiv:2402.15776, 2024

work page arXiv 2024

[56] [56]

Reinforcement Learning via Fenchel-Rockafellar Duality

Ofir Nachum and Bo Dai. Reinforcement Learning via Fenchel-Rockafellar Duality . arXiv preprint arXiv:2001.01866, 2020

work page arXiv 2001

[57] [57]

Robust Control of Markov Decision Processes with Uncertain Transition Matrices

Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005

work page 2005

[58] [58]

Sample Complexity of Robust Reinforcement Learning with a Generative Model

Kishan Panaganti and Dileep Kalathil. Sample Complexity of Robust Reinforcement Learning with a Generative Model . In International Conference on Artificial Intelligence and Statistics, 2022

work page 2022

[59] [59]

Proximal algorithms

Neal Parikh, Stephen Boyd, et al. Proximal algorithms . Foundations and Trends in Optimization , 1 0 (3): 0 127--239, 2014

work page 2014

[60] [60]

Constrained Reinforcement Learning Has Zero Duality Gap

Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained Reinforcement Learning Has Zero Duality Gap . In Advances in Neural Information Processing Systems, 2019

work page 2019

[61] [61]

Safe Policies for Reinforcement Learning via Primal-Dual Methods

Santiago Paternain, Miguel Calvo-Fullana, Luiz FO Chamon, and Alejandro Ribeiro. Safe Policies for Reinforcement Learning via Primal-Dual Methods . IEEE Transactions on Automatic Control, 68 0 (3): 0 1321--1336, 2022

work page 2022

[62] [62]

Safe Policy Iteration

Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe Policy Iteration . In International Conference on Machine Learning, 2013

work page 2013

[63] [63]

Distributionally robust optimization: A review

Hamed Rahimian and Sanjay Mehrotra. Distributionally Robust Optimization: A Review . arXiv preprint arXiv:1908.05659, 2019

work page arXiv 1908

[64] [64]

Variational Analysis , volume 317

R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009

work page 2009

[65] [65]

Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020

work page arXiv 2010

[66] [66]

On General Minimax Theorems

Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958

work page 1958

[67] [67]

Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning

Oswin So and Chuchu Fan. Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning . In Proceedings of Robotics: Science and Systems, 2023

work page 2023

[68] [68]

Solving Minimum-Cost Reach Avoid using Reinforcement Learning

Oswin So, Cheng Ge, and Chuchu Fan. Solving Minimum-Cost Reach Avoid using Reinforcement Learning . In Advances in Neural Information Processing Systems, 2024

work page 2024

[69] [69]

Constrained Reinforcement Learning Under Model Mismatch

Zhongchang Sun, Sihong He, Fei Miao, and Shaofeng Zou. Constrained Reinforcement Learning Under Model Mismatch . arXiv preprint arXiv:2405.01327, 2024

work page arXiv 2024

[70] [70]

Reward Constrained Policy Optimization

Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward Constrained Policy Optimization . In International Conference on Learning Representations, 2018

work page 2018

[71] [71]

Policy Gradient in Robust MDPs with Global Convergence Guarantee

Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023

work page 2023

[72] [72]

Online Robust Reinforcement Learning with Model Uncertainty

Yue Wang and Shaofeng Zou. Online Robust Reinforcement Learning with Model Uncertainty . In Advances in Neural Information Processing Systems, 2021

work page 2021

[73] [73]

Policy Gradient Method for Robust Reinforcement Learning

Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022

work page 2022

[74] [74]

Robust Constrained Reinforcement Learning

Yue Wang, Fei Miao, and Shaofeng Zou. Robust Constrained Reinforcement Learning . arXiv preprint arXiv:2209.06866, 2022

work page arXiv 2022

[75] [75]

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Honghao Wei, Xin Liu, and Lei Ying. A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes . arXiv preprint arXiv:2106.01577, 2021

work page arXiv 2021

[76] [76]

Robust Markov Decision Processes

Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013

work page 2013

[77] [77]

Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics

Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics . The Annals of Statistics, 50 0 (6): 0 3223--3248, 2022

work page 2022

[78] [78]

Robust Markov Decision Processes without Model Estimation

Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023

work page arXiv 2023

[79] [79]

A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

Donghao Ying, Yuhao Ding, and Javad Lavaei. A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization . In International Conference on Artificial Intelligence and Statistics, 2022

work page 2022

[80] [80]

Reward is Enough for Convex MDPs

Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021

work page 2021