pith. sign in

arxiv: 2505.17342 · v2 · submitted 2025-05-22 · 💻 cs.LG

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

Pith reviewed 2026-05-22 12:48 UTC · model grok-4.3

classification 💻 cs.LG
keywords Safe Reinforcement LearningConstrained Markov Decision ProcessesMulti-Agent Safe RLPolicy Gradient MethodsSafe ExplorationSafety ConstraintsOpen Research Problems
0
0 comments X

The pith

Safe reinforcement learning can be grounded in constrained Markov decision processes that enforce safety while optimizing rewards, with extensions to multi-agent settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper provides a mathematically rigorous overview of SafeRL by centering on CMDP formulations that incorporate explicit safety constraints into the standard MDP framework. It reviews core definitions, constrained optimization methods, and key theorems before summarizing algorithms such as policy gradient approaches with guarantees and safe exploration techniques for single agents. The survey then extends the same lens to SafeMARL in both cooperative and competitive environments and closes by defining five open research problems, three of which target multi-agent challenges. A reader would care because these formulations turn safety from an afterthought into a first-class, optimizable quantity that matters for any deployed learning agent.

Core claim

The central claim is that SafeRL admits a unified treatment through Constrained Markov Decision Processes, which augment standard MDPs with cost constraints and permit constrained optimization that balances expected return against safety violations; this formulation supports both single-agent policy gradient methods with formal guarantees and recent extensions to multi-agent cooperative and competitive settings, while five specific open problems—three in SafeMARL—identify concrete directions for further progress.

What carries the argument

Constrained Markov Decision Processes (CMDPs), which extend ordinary MDPs by adding one or more expected-cost constraints that must be satisfied by the learned policy.

If this is right

  • Policy gradient methods equipped with CMDP constraints can converge to policies that satisfy safety requirements with high probability.
  • Safe exploration strategies derived from CMDP analysis reduce the number of unsafe actions taken during training.
  • The same CMDP machinery extends directly to cooperative multi-agent tasks, enabling joint policies that respect shared safety limits.
  • Competitive multi-agent settings can incorporate CMDP-style constraints to bound worst-case safety violations.
  • Solving the five listed open problems would yield more scalable algorithms for high-dimensional or partially observable SafeMARL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The survey's CMDP-centric view could be used to create standardized safety benchmarks that compare algorithms across single-agent and multi-agent domains.
  • Linking the reviewed methods to real-time verification tools might allow online monitoring of constraint satisfaction in deployed systems.
  • The open problems on multi-agent safety suggest natural connections to differential games and robust control that remain largely unexplored.
  • Practitioners could adopt the summarized safe exploration techniques to reduce risk in simulation-to-real transfer for robotics or autonomous driving.

Load-bearing premise

The reviewed theoretical foundations, algorithms, and selected open problems together give a representative and essentially complete picture of the current SafeRL and SafeMARL literature.

What would settle it

Locate any major recent paper that formulates SafeRL via CMDPs yet is omitted or described inaccurately in the survey; such an omission would show the coverage claim does not hold.

read the original abstract

Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview of SafeRL formulations based on Constrained Markov Decision Processes (CMDPs) and extensions to Multi-Agent Safe RL (SafeMARL). We review theoretical foundations of CMDPs, covering definitions, constrained optimization techniques, and fundamental theorems. We then summarize state-of-the-art algorithms in SafeRL for single agents, including policy gradient methods with safety guarantees and safe exploration strategies, as well as recent advances in SafeMARL for cooperative and competitive settings. Additionally, we propose five open research problems to advance the field, with three focusing on SafeMARL. Each problem is described with motivation, key challenges, and related prior work. This survey is intended as a technical guide for researchers interested in SafeRL and SafeMARL, highlighting key concepts, methods, and open future research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to provide a mathematically rigorous overview of Safe Reinforcement Learning (SafeRL) formulations based on Constrained Markov Decision Processes (CMDPs), covering definitions, constrained optimization techniques, and fundamental theorems. It summarizes state-of-the-art algorithms for single-agent SafeRL (including policy gradient methods with safety guarantees and safe exploration strategies) as well as extensions to SafeMARL in cooperative and competitive settings, and proposes five open research problems (three focused on SafeMARL), each with motivation, key challenges, and related prior work. The survey positions itself as a technical guide highlighting key concepts, methods, and future directions.

Significance. If the coverage proves comprehensive and accurate, the survey would hold moderate significance as a consolidated technical reference for researchers entering or working in SafeRL and SafeMARL, particularly through its explicit identification of open problems that could guide subsequent work. The emphasis on mathematical rigor in CMDP foundations and the inclusion of both single- and multi-agent settings adds potential utility, though this is conditional on verifiable completeness of the reviewed literature.

major comments (1)
  1. Abstract and Introduction: The central claim of delivering a comprehensive, mathematically rigorous overview and SOTA summary (including the selection of the five open problems) is undermined by the complete absence of any literature search methodology, such as databases queried, keywords or search strings, date cutoffs, or inclusion/exclusion criteria. This directly affects the reliability of the coverage of CMDP foundations, single-agent algorithms, SafeMARL extensions, and the proposed open problems, as it leaves the 'without major omissions' assumption untestable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for highlighting an important aspect of survey transparency. We address the single major comment below and commit to revisions that strengthen the manuscript without altering its core technical content.

read point-by-point responses
  1. Referee: Abstract and Introduction: The central claim of delivering a comprehensive, mathematically rigorous overview and SOTA summary (including the selection of the five open problems) is undermined by the complete absence of any literature search methodology, such as databases queried, keywords or search strings, date cutoffs, or inclusion/exclusion criteria. This directly affects the reliability of the coverage of CMDP foundations, single-agent algorithms, SafeMARL extensions, and the proposed open problems, as it leaves the 'without major omissions' assumption untestable.

    Authors: We agree that documenting the literature selection process improves verifiability. While the survey was compiled through iterative expert curation of the field rather than a formal PRISMA-style protocol (common in many technical RL surveys), we acknowledge that this leaves completeness assumptions harder to evaluate. In the revised version we will insert a concise 'Literature Review Methodology' subsection immediately after the Introduction. It will specify: primary sources (arXiv, Google Scholar, NeurIPS/ICML/ICLR proceedings), core search strings (e.g., 'constrained Markov decision process', 'safe reinforcement learning', 'SafeMARL', 'constrained policy optimization'), temporal focus (foundational works 2000–2015 plus post-2016 advances through early 2025), and inclusion criteria (mathematically rigorous CMDP formulations, algorithms with theoretical guarantees, and recent multi-agent extensions). This addition will make the rationale for the five open problems and the overall coverage explicit while preserving the survey's technical emphasis. revision: yes

Circularity Check

0 steps flagged

No circularity: survey aggregates external literature without self-referential derivations

full rationale

This survey reviews CMDP foundations, single-agent SafeRL algorithms, SafeMARL extensions, and proposes open problems by citing prior external work. No equations, predictions, or central claims reduce to the paper's own fitted parameters, self-definitions, or unverified self-citation chains. The structure explicitly positions the content as a summary of state-of-the-art from the broader literature rather than an internal derivation. Lack of explicit search methodology is a potential completeness issue but does not create circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey relies on standard definitions from prior RL and CMDP literature without introducing new fitted parameters or invented entities.

axioms (1)
  • standard math Standard MDP assumptions including finite or countable state and action spaces and Markov property.
    Invoked when defining CMDPs and constrained optimization in the theoretical foundations section.

pith-pipeline@v0.9.0 · 5711 in / 1080 out tokens · 27775 ms · 2026-05-22T12:48:53.172366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning

    cs.LG 2025-11 unverdicted novelty 6.0

    AdaFair-MARL enforces workload fairness as an explicit second-order cone constraint in cooperative MARL via adaptive primal-dual optimization, achieving near-perfect constraint satisfaction while preserving team performance.

  2. Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.