pith. sign in

arxiv: 2408.16286 · v5 · submitted 2024-08-29 · 💻 cs.LG · math.OC

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Pith reviewed 2026-05-23 21:56 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords robust constrained Markov decision processepigraph formpolicy gradientbisection searchnear-optimal policysample complexityLagrangian formulationworst-case constraints
0
0 comments X

The pith

The epigraph form of robust constrained MDPs lets a bisection search with policy gradient identify an ε-optimal policy using Õ(ε^{-4}) robust policy evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the usual Lagrangian formulation of robust constrained MDPs can trap a policy gradient method in suboptimal points because the inner minimization sums conflicting gradients from the objective and the constraints. Reformulating the problem in epigraph form removes the conflict by letting the algorithm pick a single gradient at each step from either the objective or a constraint. A bisection search then wraps a policy gradient subroutine to locate a policy that is near-optimal in the worst case over the uncertainty set. A reader would care because the result supplies the first algorithm with a concrete guarantee for learning safe policies when the environment belongs to an unknown set of possibilities.

Core claim

The paper claims that the epigraph form of the RCMDP converts the original constrained min-max problem into a form where a bisection search algorithm equipped with a policy gradient subroutine identifies an ε-optimal policy with Õ(ε^{-4}) robust policy evaluations, while the conventional Lagrangian max-min formulation can fail because its inner minimization encounters a sum of conflicting gradients.

What carries the argument

The epigraph form of the RCMDP problem, which introduces an auxiliary variable so that a single gradient is chosen from either the objective or one of the constraints at each step.

If this is right

  • The algorithm succeeds on instances where the Lagrangian method is trapped by summed conflicting gradients.
  • Near-optimal robust policies are obtained with a polynomial number of robust policy evaluations that scales as ε to the minus four.
  • The method guarantees both near-optimality of the objective and satisfaction of the constraints in every environment inside the uncertainty set.
  • Policy gradient updates can be performed safely without needing to solve a joint min-max at every iteration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same epigraph reformulation might be tried on other constrained robust problems that currently rely on Lagrangian relaxations.
  • Replacing the inner policy gradient subroutine with a more sample-efficient estimator could reduce the overall evaluation count further.
  • The approach supplies a template for turning other min-max constrained problems into searchable forms that avoid gradient cancellation.

Load-bearing premise

Reformulating the constrained problem in epigraph form still lets the algorithm recover an optimal policy for the original RCMDP.

What would settle it

An explicit RCMDP instance in which the bisection search returns a policy whose worst-case cost exceeds the claimed ε-optimal value after the stated number of robust policy evaluations.

Figures

Figures reproduced from arXiv: 2408.16286 by Kazumi Kasaura, Kenta Hoshino, Masashi Hamaya, Paavo Parmas, Tadashi Kozuno, Toshinori Kitamura, Wataru Kumagai, Yohei Hosoe, Yutaka Matsuo.

Figure 1
Figure 1. Figure 1: (a): An RCMDP example illustrating the gradient conflict challenge. Action labels are omitted when transitions are action-independent. (b): Policy gradients in the example with (γ, δ, b1) = (0.4, 0.09, 0). Arrows represent the gradient to decrease L1(π). π2 attracts policy gradients but is a local minimum since L1(π2) > L1(π1), where π1(·, a1) = 1 and π2(·, a2) = 1. c ′ := c0 + λ(c1 − b1/H1). Since Jc ′ ,P… view at source ↗
Figure 2
Figure 2. Figure 2: Algorithmic idea to find b0 = J ⋆ in Example 1 with (γ, δ, b1) = (0.1, 0, 2/3). This section describes the outer loop of EpiRC-PGS to identify b0 = J ⋆ . The outer loop utilizes the following monotonicity of ∆⋆ b0 for the identification. The proof is deferred to Appendix H.1: Lemma 2. ∆⋆ b0 is monotonically decreasing in b0 and ∆⋆ J⋆ = 0. Thanks to this monotonicity of ∆⋆ b0 , if ∆⋆ b0 can be efficiently c… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of the algorithms in different settings [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Designing a safe policy for uncertain environments is crucial in real-world control systems. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions. This occurs when its inner minimization encounters a sum of conflicting gradients from the objective and constraint functions. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a bisection search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\tilde{\mathcal{O}}(\varepsilon^{-4})$ robust policy evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce the first algorithm for identifying an ε-optimal policy in robust constrained MDPs (RCMDPs), where the policy minimizes cost subject to worst-case constraints over an uncertainty set. It proves that standard policy gradient on the Lagrangian formulation can trap in suboptimal points due to conflicting gradients from objective and constraints. To resolve this, the authors reformulate the RCMDP in epigraph form, which allows selecting a single gradient (either objective or constraint) at each step. They then propose a bisection search algorithm over the epigraph variable combined with a policy gradient subroutine, and prove that this identifies an ε-optimal policy using Õ(ε^{-4}) robust policy evaluations.

Significance. If the central equivalence between the epigraph reformulation and the original RCMDP optimum holds under robust evaluation, the result would be significant: it supplies the first polynomial-sample guarantee for near-optimal safe policy identification in RCMDPs, a setting that directly models real-world control under model uncertainty. The explicit proof of the Lagrangian trapping phenomenon and the reduction to a single-gradient subroutine are concrete technical contributions that could influence subsequent work on constrained robust RL.

major comments (2)
  1. [Abstract / epigraph reformulation section] Abstract (paragraph on epigraph form) and the claimed equivalence proof: the assertion that selecting a single gradient from either the objective or the constraints inside the bisection search recovers an ε-optimal policy for the original min-cost + worst-case constraint problem is load-bearing for both the algorithm and the Õ(ε^{-4}) bound. The inner minimization is over an uncertainty set; it is not immediate that a per-step single-gradient choice preserves joint feasibility across all environments in the set, even if the epigraph value matches at optimality. An explicit statement of the equivalence (e.g., showing that any ε-optimal epigraph solution maps back to an ε-optimal robust policy) is required.
  2. [Algorithm and complexity analysis] The sample-complexity claim of Õ(ε^{-4}) robust policy evaluations rests on the bisection + policy-gradient subroutine being correct for the epigraph problem. Without a concrete reduction showing that the robust policy evaluation oracle used inside the subroutine matches the original RCMDP definition (including how the worst-case is taken inside the gradient step), the bound cannot be verified as applying to the original problem.
minor comments (1)
  1. [Abstract] Notation for the uncertainty set and the robust policy evaluation operator should be introduced once and used consistently; the abstract refers to both “robust policy evaluations” and “worst-case scenario across a set of environments” without a single symbol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback. We address each major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / epigraph reformulation section] Abstract (paragraph on epigraph form) and the claimed equivalence proof: the assertion that selecting a single gradient from either the objective or the constraints inside the bisection search recovers an ε-optimal policy for the original min-cost + worst-case constraint problem is load-bearing for both the algorithm and the Õ(ε^{-4}) bound. The inner minimization is over an uncertainty set; it is not immediate that a per-step single-gradient choice preserves joint feasibility across all environments in the set, even if the epigraph value matches at optimality. An explicit statement of the equivalence (e.g., showing that any ε-optimal epigraph solution maps back to an ε-optimal robust policy) is required.

    Authors: We agree that an explicit mapping statement strengthens the presentation. Theorem 3.1 proves equivalence at optimality between the epigraph form and the original RCMDP. The proof of the main result (Theorem 4.3) further shows that any ε-optimal epigraph solution maps to an ε-optimal robust policy: the epigraph variable enforces a uniform bound on the worst-case cost, and the single-gradient selection (objective or constraint) preserves joint feasibility over the uncertainty set because the worst-case operator is applied identically in both formulations. We will add a dedicated corollary after Theorem 3.1 stating this approximate equivalence explicitly and revise the abstract paragraph for clarity. revision: yes

  2. Referee: [Algorithm and complexity analysis] The sample-complexity claim of Õ(ε^{-4}) robust policy evaluations rests on the bisection + policy-gradient subroutine being correct for the epigraph problem. Without a concrete reduction showing that the robust policy evaluation oracle used inside the subroutine matches the original RCMDP definition (including how the worst-case is taken inside the gradient step), the bound cannot be verified as applying to the original problem.

    Authors: The robust policy evaluation oracle is defined identically in the epigraph problem (Definition 3.2) and the original RCMDP: each call computes the worst-case value over the uncertainty set for the active function (cost or constraint). This is used directly in the policy-gradient subroutine (Algorithm 2) and the bisection analysis (Section 4), so the Õ(ε^{-4}) bound applies to the original problem. To make the reduction fully transparent, we will insert a remark in Section 4.2 explicitly tracing how the worst-case operator enters each gradient step and confirming that every oracle call matches the RCMDP definition. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from epigraph reformulation and standard primitives.

full rationale

The paper proposes a bisection search algorithm using a policy gradient subroutine on the epigraph form of the RCMDP, claiming an Õ(ε^{-4}) bound on robust policy evaluations after proving Lagrangian issues with conflicting gradients. No quoted step reduces the final bound or optimality guarantee to a fitted input renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain; the epigraph choice is presented as a reformulation enabling single-gradient selection, with the complexity bound derived from the algorithm's analysis rather than tautological equivalence to inputs. This is the common honest outcome for theoretical algorithm papers whose central claims rest on explicit proofs rather than data fits or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the mathematical validity of the epigraph reformulation for RCMDPs and on standard convergence properties of policy gradient methods inside the bisection loop; no free parameters, invented entities, or ad-hoc axioms are mentioned.

axioms (2)
  • domain assumption Policy gradient methods converge to a stationary point of the inner minimization when a single gradient is supplied at each step.
    Invoked when the epigraph form is said to resolve conflicting gradients (abstract).
  • domain assumption The uncertainty set in the RCMDP admits a well-defined worst-case evaluation that can be approximated by robust policy evaluations.
    Required for the sample complexity statement to be meaningful.

pith-pipeline@v0.9.0 · 5751 in / 1416 out tokens · 52256 ms · 2026-05-23T21:56:36.453365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

    cs.LG 2026-04 unverdicted novelty 8.0

    RHC-UCRL is the first algorithm for safety-constrained RL under explicit adversarial dynamics, providing sub-linear regret and constraint violation guarantees by maintaining optimism over both agent and adversary policies.

Reference graph

Works this paper leans on

86 extracted references · 86 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    On Frank-Wolfe and Equilibrium Computation

    Jacob D Abernethy and Jun-Kun Wang. On Frank-Wolfe and Equilibrium Computation . In Advances in Neural Information Processing Systems, 2017

  3. [3]

    Constrained Policy Optimization

    Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained Policy Optimization . In International Conference on Machine Learning, 2017

  4. [4]

    On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

    Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift . Journal of Machine Learning Research, 22 0 (98): 0 1--76, 2021

  5. [5]

    Constrained Markov Decision Processes , volume 7

    Eitan Altman. Constrained Markov Decision Processes , volume 7. CRC Press, 1999

  6. [6]

    System Level Synthesis

    James Anderson, John C Doyle, Steven H Low, and Nikolai Matni. System Level Synthesis . Annual Reviews in Control, 47: 0 364--393, 2019

  7. [7]

    On the Generation of Markov Decision Processes

    TW Archibald, KIM McKinnon, and LC Thomas. On the Generation of Markov Decision Processes . Journal of the Operational Research Society, 46 0 (3): 0 354--361, 1995

  8. [8]

    First-Order Methods in Optimization

    Amir Beck. First-Order Methods in Optimization . SIAM, 2017

  9. [9]

    Bellman, R.E

    R. Bellman, R.E. Bellman, and Rand Corporation. Dynamic Programming. Princeton University Press, 1957

  10. [10]

    Robust Model Predictive Control: A Survey

    Alberto Bemporad and Manfred Morari. Robust Model Predictive Control: A Survey . In Robustness in Identification and Control, pp.\ 207--226. Springer, 2007

  11. [11]

    Robust Optimization--a Comprehensive Survey

    Hans-Georg Beyer and Bernhard Sendhoff. Robust Optimization--a Comprehensive Survey . Computer Methods in Applied Mechanics and Engineering, 196 0 (33-34): 0 3190--3218, 2007

  12. [12]

    Convex Optimization

    Stephen P Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004

  13. [13]

    DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning

    Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning . In Advances in Neural Information Processing Systems, 2022

  14. [14]

    Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality

    Hyeong Soo Chang. Approximate Constrained Discounted Dynamic Programming with Uniform Feasibility and Optimality . arXiv preprint arXiv:2308.03297, 2023

  15. [15]

    Dynamic Programming Equations for Discounted Constrained Stochastic Control

    Richard C Chen and Gilmer L Blankenship. Dynamic Programming Equations for Discounted Constrained Stochastic Control . IEEE Transactions on Automatic Control, 49 0 (5): 0 699--709, 2004

  16. [16]

    Non-Randomized Control of Constrained Markov Decision Processes

    Richard C Chen and Eugene A Feinberg. Non-Randomized Control of Constrained Markov Decision Processes . In American Control Conference, 2006

  17. [17]

    A Primal-Dual Approach to Constrained Markov Decision Processes

    Yi Chen, Jing Dong, and Zhaoran Wang. A Primal-Dual Approach to Constrained Markov Decision Processes . arXiv preprint arXiv:2101.10895, 2021

  18. [18]

    Distributionally Robust Optimization for Sequential Decision-Making

    Zhi Chen, Pengqian Yu, and William B Haskell. Distributionally Robust Optimization for Sequential Decision-Making . Optimization, 68 0 (12): 0 2397--2426, 2019

  19. [19]

    Nonsmooth analysis and Control Theory , volume 178

    Francis H Clarke, Yuri S Ledyaev, Ronald J Stern, and Peter R Wolenski. Nonsmooth analysis and Control Theory , volume 178. Springer Science & Business Media, 2008

  20. [20]

    Optimization and Nonsmooth Analysis

    Frank H Clarke. Optimization and Nonsmooth Analysis . SIAM, 1990

  21. [21]

    Towards minimax Optimality of Model-based Robust Reinforcement Learning

    Pierre Clavier, Erwan Le Pennec, and Matthieu Geist. Towards minimax Optimality of Model-based Robust Reinforcement Learning . arXiv preprint arXiv:2302.05372, 2023

  22. [22]

    Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning

    Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning . In Advances in Neural Information Processing Systems, 2017

  23. [23]

    On Linear Programming in a Markov Decision Problem

    Eric V Denardo. On Linear Programming in a Markov Decision Problem . Management Science, 16 0 (5): 0 281--288, 1970

  24. [24]

    Twice Regularized MDPs and the Equivalence Between Robustness and Regularization

    Esther Derman, Matthieu Geist, and Shie Mannor. Twice Regularized MDPs and the Equivalence Between Robustness and Regularization . In Advances in Neural Information Processing Systems, 2021

  25. [25]

    Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes

    Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes . In Advances in Neural Information Processing Systems, 2020

  26. [26]

    Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

    Dongsheng Ding, Chen-Yu Wei, Kaiqing Zhang, and Alejandro Ribeiro. Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs . In Advances in Neural Information Processing Systems, 2024

  27. [27]

    Analysis of Feedback Systems with Structured Uncertainties

    John Doyle. Analysis of Feedback Systems with Structured Uncertainties . In IEE Proceedings D-Control Theory and Applications, 1982

  28. [28]

    Bilinear Classes: A Structural Framework for Provable Generalization in RL

    Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear Classes: A Structural Framework for Provable Generalization in RL . In International Conference on Machine Learning, 2021

  29. [29]

    Robust Nonparametric Regression under Huber's $\epsilon$-contamination Model

    Simon S Du, Yining Wang, Sivaraman Balakrishnan, Pradeep Ravikumar, and Aarti Singh. Robust Nonparametric Regression under Huber's -contamination Model . arXiv preprint arXiv:1805.10406, 2018

  30. [30]

    Exploration-exploitation in constrained mdps.arXiv preprint arXiv:2003.02189, 2020

    Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-Exploitation in Constrained MDPs . arXiv preprint arXiv:2003.02189, 2020

  31. [31]

    Sample Complexity for Obtaining Sub-optimality and Violation Bound for Distributionally Robust Constrained MDP

    Arnob Ghosh. Sample Complexity for Obtaining Sub-optimality and Violation Bound for Distributionally Robust Constrained MDP . In Reinforcement Learning Safety Workshop, 2024

  32. [32]

    Robust Markov Decision Processes: Beyond Rectangularity

    Vineet Goyal and Julien Grand-Clement. Robust Markov Decision Processes: Beyond Rectangularity . Mathematics of Operations Research, 48 0 (1): 0 203--226, 2023

  33. [33]

    Scalable First-Order Methods for Robust Mdps

    Julien Grand-Cl \'e ment and Christian Kroer. Scalable First-Order Methods for Robust Mdps . In AAAI Conference on Artificial Intelligence, 2021

  34. [34]

    On the Convex Formulations of Robust Markov Decision Processes

    Julien Grand-Cl \'e ment and Marek Petrik. On the Convex Formulations of Robust Markov Decision Processes . Mathematics of Operations Research, 0 0 (0), 2024

  35. [35]

    Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs

    Aria HasanzadeZonuzy, Archana Bura, Dileep Kalathil, and Srinivas Shakkottai. Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs . In AAAI Conference on Artificial Intelligence, 2021

  36. [36]

    On Constrained Markov Decision Processes

    Moshe Haviv. On Constrained Markov Decision Processes . Operations Research Letters, 19 0 (1): 0 25--28, 1996

  37. [37]

    Partial Policy Iteration for L1 -Robust Markov Decision Processes

    Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial Policy Iteration for L1 -Robust Markov Decision Processes . Journal of Machine Learning Research, 22 0 (275): 0 1--46, 2021

  38. [38]

    Robust Dynamic Programming

    Garud N Iyengar. Robust Dynamic Programming . Mathematics of Operations Research, 30 0 (2): 0 257--280, 2005

  39. [39]

    PAC Reinforcement Learning with an Imperfect Model

    Nan Jiang. PAC Reinforcement Learning with an Imperfect Model . In AAAI Conference on Artificial Intelligence, 2018

  40. [40]

    A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

    Toshinori Kitamura, Tadashi Kozuno, Masahiro Kato, Yuki Ichihara, Soichiro Nishimori, Akiyoshi Sannai, Sho Sonoda, Wataru Kumagai, and Yutaka Matsuo. A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees . arXiv preprint arXiv:2401.17780, 2024

  41. [41]

    On Fr\'echet Subdifferentials

    A Ya Kruger. On Fr\'echet Subdifferentials . Journal of Mathematical Sciences, 116 0 (3): 0 3325--3358, 2003

  42. [42]

    Efficient Policy Iteration for Robust Markov Decision Processes via Regularization

    Navdeep Kumar, Kfir Levy, Kaixin Wang, and Shie Mannor. Efficient Policy Iteration for Robust Markov Decision Processes via Regularization . arXiv preprint arXiv:2205.14327, 2022

  43. [43]

    Policy Gradient for Rectangular Robust Markov Decision Processes

    Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Y Levy, and Shie Mannor. Policy Gradient for Rectangular Robust Markov Decision Processes . In Advances in Neural Information Processing Systems, 2024

  44. [44]

    Batch Policy Learning Under Constraints

    Hoang Le, Cameron Voloshin, and Yisong Yue. Batch Policy Learning Under Constraints . In International Conference on Machine Learning, 2019

  45. [45]

    Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process

    Tianjiao Li, Ziwei Guan, Shaofeng Zou, Tengyu Xu, Yingbin Liang, and Guanghui Lan. Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process . Operations Research Letters, 54: 0 107107, 2024

  46. [46]

    First-Order Policy Optimization for Robust Markov Decision Process

    Yan Li, Guanghui Lan, and Tuo Zhao. First-Order Policy Optimization for Robust Markov Decision Process . arXiv preprint arXiv:2209.10579, 2022

  47. [47]

    A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes

    Zhenwei Lin, Chenyu Xue, Qi Deng, and Yinyu Ye. A Single-Loop Robust Policy Gradient Method for Robust Markov Decision Processes . arXiv preprint arXiv:2406.00274, 2024

  48. [48]

    Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs

    Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs . In Advances in Neural Information Processing Systems, 2021

  49. [49]

    Robust Entropy-Regularized Markov Decision Processes

    Tien Mai and Patrick Jaillet. Robust Entropy-Regularized Markov Decision Processes . arXiv preprint arXiv:2112.15364, 2021

  50. [50]

    Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification

    Daniel J Mankowitz, Dan A Calian, Rae Jeong, Cosmin Paduraru, Nicolas Heess, Sumanth Dathathri, Martin Riedmiller, and Timothy Mann. Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification . arXiv preprint arXiv:2010.10644, 2020

  51. [51]

    On the Global Convergence Rates of Softmax Policy Gradient Methods

    Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the Global Convergence Rates of Softmax Policy Gradient Methods . In International Conference on Machine Learning, 2020

  52. [52]

    Methods of Nonconvex Optimization

    VS Mikhalevich, AM Gupal, and VI Norkin. Methods of Nonconvex Optimization . arXiv preprint arXiv:2406.10406, 2024

  53. [53]

    Reinforcement Learning with Convex Constraints

    Sobhan Miryoosefi, Kiant \'e Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. Reinforcement Learning with Convex Constraints . In Advances in Neural Information Processing Systems, 2019

  54. [54]

    Human-level Control through Deep Reinforcement Learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level Control through Deep Reinforcement Learning . Nature, 518 0 (7540): 0 529--533, 2015

  55. [55]

    Truly No-Regret Learning in Constrained MDPs

    Adrian M \"u ller, Pragnya Alatur, Volkan Cevher, Giorgia Ramponi, and Niao He. Truly No-Regret Learning in Constrained MDPs . arXiv preprint arXiv:2402.15776, 2024

  56. [56]

    Reinforcement Learning via Fenchel-Rockafellar Duality

    Ofir Nachum and Bo Dai. Reinforcement Learning via Fenchel-Rockafellar Duality . arXiv preprint arXiv:2001.01866, 2020

  57. [57]

    Robust Control of Markov Decision Processes with Uncertain Transition Matrices

    Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with Uncertain Transition Matrices . Operations Research, 53 0 (5): 0 780--798, 2005

  58. [58]

    Sample Complexity of Robust Reinforcement Learning with a Generative Model

    Kishan Panaganti and Dileep Kalathil. Sample Complexity of Robust Reinforcement Learning with a Generative Model . In International Conference on Artificial Intelligence and Statistics, 2022

  59. [59]

    Proximal algorithms

    Neal Parikh, Stephen Boyd, et al. Proximal algorithms . Foundations and Trends in Optimization , 1 0 (3): 0 127--239, 2014

  60. [60]

    Constrained Reinforcement Learning Has Zero Duality Gap

    Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained Reinforcement Learning Has Zero Duality Gap . In Advances in Neural Information Processing Systems, 2019

  61. [61]

    Safe Policies for Reinforcement Learning via Primal-Dual Methods

    Santiago Paternain, Miguel Calvo-Fullana, Luiz FO Chamon, and Alejandro Ribeiro. Safe Policies for Reinforcement Learning via Primal-Dual Methods . IEEE Transactions on Automatic Control, 68 0 (3): 0 1321--1336, 2022

  62. [62]

    Safe Policy Iteration

    Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe Policy Iteration . In International Conference on Machine Learning, 2013

  63. [63]

    Distributionally robust optimization: A review

    Hamed Rahimian and Sanjay Mehrotra. Distributionally Robust Optimization: A Review . arXiv preprint arXiv:1908.05659, 2019

  64. [64]

    Variational Analysis , volume 317

    R Tyrrell Rockafellar and Roger J-B Wets. Variational Analysis , volume 317. Springer Science & Business Media, 2009

  65. [65]

    Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty

    Reazul Hasan Russel, Mouhacine Benosman, and Jeroen Van Baar. Robust Constrained-MDPs: Soft-Constrained Robust Policy Optimization under Model Uncertainty . arXiv preprint arXiv:2010.04870, 2020

  66. [66]

    On General Minimax Theorems

    Maurice Sion. On General Minimax Theorems . Pacific Journal of Mathematics, 8 0 (1): 0 171 -- 176, 1958

  67. [67]

    Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning

    Oswin So and Chuchu Fan. Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning . In Proceedings of Robotics: Science and Systems, 2023

  68. [68]

    Solving Minimum-Cost Reach Avoid using Reinforcement Learning

    Oswin So, Cheng Ge, and Chuchu Fan. Solving Minimum-Cost Reach Avoid using Reinforcement Learning . In Advances in Neural Information Processing Systems, 2024

  69. [69]

    Constrained Reinforcement Learning Under Model Mismatch

    Zhongchang Sun, Sihong He, Fei Miao, and Shaofeng Zou. Constrained Reinforcement Learning Under Model Mismatch . arXiv preprint arXiv:2405.01327, 2024

  70. [70]

    Reward Constrained Policy Optimization

    Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward Constrained Policy Optimization . In International Conference on Learning Representations, 2018

  71. [71]

    Policy Gradient in Robust MDPs with Global Convergence Guarantee

    Qiuhao Wang, Chin Pang Ho, and Marek Petrik. Policy Gradient in Robust MDPs with Global Convergence Guarantee . In International Conference on Machine Learning, 2023

  72. [72]

    Online Robust Reinforcement Learning with Model Uncertainty

    Yue Wang and Shaofeng Zou. Online Robust Reinforcement Learning with Model Uncertainty . In Advances in Neural Information Processing Systems, 2021

  73. [73]

    Policy Gradient Method for Robust Reinforcement Learning

    Yue Wang and Shaofeng Zou. Policy Gradient Method for Robust Reinforcement Learning . In International Conference on Machine Learning, 2022

  74. [74]

    Robust Constrained Reinforcement Learning

    Yue Wang, Fei Miao, and Shaofeng Zou. Robust Constrained Reinforcement Learning . arXiv preprint arXiv:2209.06866, 2022

  75. [75]

    A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

    Honghao Wei, Xin Liu, and Lei Ying. A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes . arXiv preprint arXiv:2106.01577, 2021

  76. [76]

    Robust Markov Decision Processes

    Wolfram Wiesemann, Daniel Kuhn, and Ber c Rustem. Robust Markov Decision Processes . Mathematics of Operations Research, 38 0 (1): 0 153--183, 2013

  77. [77]

    Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics

    Wenhao Yang, Liangyu Zhang, and Zhihua Zhang. Toward Theoretical Understandings of Robust Markov Decision Processes: Sample Complexity and Asymptotics . The Annals of Statistics, 50 0 (6): 0 3223--3248, 2022

  78. [78]

    Robust Markov Decision Processes without Model Estimation

    Wenhao Yang, Han Wang, Tadashi Kozuno, Scott M Jordan, and Zhihua Zhang. Robust Markov Decision Processes without Model Estimation . arXiv preprint arXiv:2302.01248, 2023

  79. [79]

    A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization

    Donghao Ying, Yuhao Ding, and Javad Lavaei. A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization . In International Conference on Artificial Intelligence and Statistics, 2022

  80. [80]

    Reward is Enough for Convex MDPs

    Tom Zahavy, Brendan O'Donoghue, Guillaume Desjardins, and Satinder Singh. Reward is Enough for Convex MDPs . In Advances in Neural Information Processing Systems, 2021

Showing first 80 references.