pith. sign in

Lyapunov-based Safe Policy Optimization for Continuous Control

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it
abstract

We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.

citation-role summary

background 2

citation-polarity summary

verdicts

UNVERDICTED 7

roles

background 2

polarities

background 2

representative citing papers

Robust Shielding for Safe Reinforcement Learning

cs.AI · 2026-05-29 · unverdicted · novelty 6.0

A sound and optimal shielding method for robust MDPs ensures LTL safety under worst-case transitions and combines with PAC sampling to produce minimally restrictive shields for learned models.

Safe-Support Q-Learning: Learning without Unsafe Exploration

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

Safe-Support Q-Learning trains Q-functions and policies in reinforcement learning without ever visiting unsafe states by constraining the behavior policy to a safe set and using KL-regularized Bellman targets in a two-stage framework.

citing papers explorer

Showing 7 of 7 citing papers.