A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety
Pith reviewed 2026-05-22 12:48 UTC · model grok-4.3
The pith
Safe reinforcement learning can be grounded in constrained Markov decision processes that enforce safety while optimizing rewards, with extensions to multi-agent settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that SafeRL admits a unified treatment through Constrained Markov Decision Processes, which augment standard MDPs with cost constraints and permit constrained optimization that balances expected return against safety violations; this formulation supports both single-agent policy gradient methods with formal guarantees and recent extensions to multi-agent cooperative and competitive settings, while five specific open problems—three in SafeMARL—identify concrete directions for further progress.
What carries the argument
Constrained Markov Decision Processes (CMDPs), which extend ordinary MDPs by adding one or more expected-cost constraints that must be satisfied by the learned policy.
If this is right
- Policy gradient methods equipped with CMDP constraints can converge to policies that satisfy safety requirements with high probability.
- Safe exploration strategies derived from CMDP analysis reduce the number of unsafe actions taken during training.
- The same CMDP machinery extends directly to cooperative multi-agent tasks, enabling joint policies that respect shared safety limits.
- Competitive multi-agent settings can incorporate CMDP-style constraints to bound worst-case safety violations.
- Solving the five listed open problems would yield more scalable algorithms for high-dimensional or partially observable SafeMARL.
Where Pith is reading between the lines
- The survey's CMDP-centric view could be used to create standardized safety benchmarks that compare algorithms across single-agent and multi-agent domains.
- Linking the reviewed methods to real-time verification tools might allow online monitoring of constraint satisfaction in deployed systems.
- The open problems on multi-agent safety suggest natural connections to differential games and robust control that remain largely unexplored.
- Practitioners could adopt the summarized safe exploration techniques to reduce risk in simulation-to-real transfer for robotics or autonomous driving.
Load-bearing premise
The reviewed theoretical foundations, algorithms, and selected open problems together give a representative and essentially complete picture of the current SafeRL and SafeMARL literature.
What would settle it
Locate any major recent paper that formulates SafeRL via CMDPs yet is omitted or described inaccurately in the survey; such an omission would show the coverage claim does not hold.
read the original abstract
Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained Markov Decision Processes (CMDPs) and extensions to Multi-Agent Safe RL (SafeMARL). We review theoretical foundations of CMDPs, covering definitions, constrained optimization techniques, and fundamental theorems. We then summarize state-of-the-art algorithms in SafeRL for single agents, including policy gradient methods with safety guarantees and safe exploration strategies, as well as recent advances in SafeMARL for cooperative and competitive settings. Additionally, we propose five open research problems to advance the field, with three focusing on SafeMARL. Each problem is described with motivation, key challenges, and related prior work. This survey is intended as a technical guide for researchers interested in SafeRL and SafeMARL, highlighting key concepts, methods, and open future research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to provide a mathematically rigorous overview of Safe Reinforcement Learning (SafeRL) formulations based on Constrained Markov Decision Processes (CMDPs), covering definitions, constrained optimization techniques, and fundamental theorems. It summarizes state-of-the-art algorithms for single-agent SafeRL (including policy gradient methods with safety guarantees and safe exploration strategies) as well as extensions to SafeMARL in cooperative and competitive settings, and proposes five open research problems (three focused on SafeMARL), each with motivation, key challenges, and related prior work. The survey positions itself as a technical guide highlighting key concepts, methods, and future directions.
Significance. If the coverage proves comprehensive and accurate, the survey would hold moderate significance as a consolidated technical reference for researchers entering or working in SafeRL and SafeMARL, particularly through its explicit identification of open problems that could guide subsequent work. The emphasis on mathematical rigor in CMDP foundations and the inclusion of both single- and multi-agent settings adds potential utility, though this is conditional on verifiable completeness of the reviewed literature.
major comments (1)
- Abstract and Introduction: The central claim of delivering a comprehensive, mathematically rigorous overview and SOTA summary (including the selection of the five open problems) is undermined by the complete absence of any literature search methodology, such as databases queried, keywords or search strings, date cutoffs, or inclusion/exclusion criteria. This directly affects the reliability of the coverage of CMDP foundations, single-agent algorithms, SafeMARL extensions, and the proposed open problems, as it leaves the 'without major omissions' assumption untestable.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for highlighting an important aspect of survey transparency. We address the single major comment below and commit to revisions that strengthen the manuscript without altering its core technical content.
read point-by-point responses
-
Referee: Abstract and Introduction: The central claim of delivering a comprehensive, mathematically rigorous overview and SOTA summary (including the selection of the five open problems) is undermined by the complete absence of any literature search methodology, such as databases queried, keywords or search strings, date cutoffs, or inclusion/exclusion criteria. This directly affects the reliability of the coverage of CMDP foundations, single-agent algorithms, SafeMARL extensions, and the proposed open problems, as it leaves the 'without major omissions' assumption untestable.
Authors: We agree that documenting the literature selection process improves verifiability. While the survey was compiled through iterative expert curation of the field rather than a formal PRISMA-style protocol (common in many technical RL surveys), we acknowledge that this leaves completeness assumptions harder to evaluate. In the revised version we will insert a concise 'Literature Review Methodology' subsection immediately after the Introduction. It will specify: primary sources (arXiv, Google Scholar, NeurIPS/ICML/ICLR proceedings), core search strings (e.g., 'constrained Markov decision process', 'safe reinforcement learning', 'SafeMARL', 'constrained policy optimization'), temporal focus (foundational works 2000–2015 plus post-2016 advances through early 2025), and inclusion criteria (mathematically rigorous CMDP formulations, algorithms with theoretical guarantees, and recent multi-agent extensions). This addition will make the rationale for the five open problems and the overall coverage explicit while preserving the survey's technical emphasis. revision: yes
Circularity Check
No circularity: survey aggregates external literature without self-referential derivations
full rationale
This survey reviews CMDP foundations, single-agent SafeRL algorithms, SafeMARL extensions, and proposes open problems by citing prior external work. No equations, predictions, or central claims reduce to the paper's own fitted parameters, self-definitions, or unverified self-citation chains. The structure explicitly positions the content as a summary of state-of-the-art from the broader literature rather than an internal derivation. Lack of explicit search methodology is a potential completeness issue but does not create circularity in any derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard MDP assumptions including finite or countable state and action spaces and Markov property.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Abstract and Sec. 3–4: CMDP formulation max J(π) s.t. Jc(i)(π) ≤ di, Lagrangian L(π,λ), CPO trust-region updates, MACPO for SafeMARL.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sec. 3.2–3.4: occupancy LP, strong duality for CMDPs, policy-gradient bounds (Achiam et al.).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning
AdaFair-MARL enforces workload fairness as an explicit second-order cone constraint in cooperative MARL via adaptive primal-dual optimization, achieving near-perfect constraint satisfaction while preserving team performance.
-
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.