Joint Chance Constrained Safe-Optimal Control

Jared Miller; John Lygeros; Marta Fochesato; Niklas Schmid; Tobias Sutter; Tristan Zeller

arxiv: 2606.30829 · v1 · pith:5YY6GCEFnew · submitted 2026-06-29 · 🧮 math.OC · cs.SY· eess.SY

Joint Chance Constrained Safe-Optimal Control

Niklas Schmid , Jared Miller , Tristan Zeller , Marta Fochesato , Tobias Sutter , John Lygeros This is my paper

Pith reviewed 2026-07-01 01:36 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY

keywords joint chance constraintssafe optimal controlconstrained Markov decision processdynamic programmingstochastic systemsreach-avoidgridding approximationsunicycle system

0 comments

The pith

Minimizing expected cost only over safe trajectories removes the incentive for joint chance constraint violations in stochastic optimal control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard joint chance constrained control minimizes expected cost across all trajectories, which can encourage policies to exploit low-cost unsafe paths that violate safety requirements. The paper shows that restricting the objective to the expected cost of safe trajectories alone overcomes this incentive while preserving relevance for applications where unsafe costs do not matter. The reformulated problem is expressed as a constrained Markov decision process on an augmented state space that tracks cumulative safety information. Dynamic programming then solves for the safe-optimal policy, and explicit bounds quantify how state-space gridding affects the achieved safety level. Simulations on a 2D unicycle in cluttered reach-avoid settings illustrate the distinction between the two formulations.

Core claim

Restricting the cost objective to safe trajectories only yields policies that respect the joint chance constraint without inviting violations; the resulting problem admits an exact dynamic-programming solution once the state is augmented to encode the chance constraint as a constrained Markov decision process.

What carries the argument

Augmented state space that records whether the joint chance constraint has been satisfied so far, converting the original problem into a constrained Markov decision process solvable by dynamic programming.

If this is right

The safe-only objective produces policies whose safety level can be bounded under gridding approximations of continuous state spaces.
Dynamic programming on the augmented state yields the exact optimum for the safe-cost problem in finite discrete settings.
The formulation applies directly to finite-horizon reach-avoid tasks in environments with probabilistic obstacles.
Reinforcement-learning approximations remain feasible once the constrained-MDP structure is recognized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same augmentation trick could be combined with model-predictive control to obtain receding-horizon versions that retain the safety guarantee.
In multi-agent settings the approach might generalize by treating joint safety across agents as an additional dimension in the augmented state.
When unsafe trajectories carry domain-specific costs that are still worth modeling, a weighted combination of the two objectives could be explored as a direct extension.

Load-bearing premise

That the cost of unsafe trajectories can be ignored without changing what counts as optimal for the safe trajectories, and that the state augmentation keeps the problem Markovian enough for dynamic programming to recover the desired solution.

What would settle it

On a stochastic system whose unsafe trajectories have measurably lower cost than safe ones, compare the empirical violation rate of the policy obtained by minimizing total expected cost versus the policy obtained by minimizing expected cost only on safe trajectories; a statistically higher violation rate for the first policy would confirm the claimed incentive difference.

read the original abstract

We consider the finite-time optimal control of stochastic systems subject to a probabilistic constraint on the trajectories' safety. Such formulations are known as joint chance constrained optimal control problems. The common practice is to jointly minimise the expected cost of all trajectories, safe and unsafe. This leads to policies which invite constraint violations to exploit low-cost unsafe trajectories. When constraints represent states of critical failure, such behaviour is undesirable. We demonstrate that this behaviour can be overcome by only minimizing the expected cost of safe trajectories. The underlying rationale follows a practical intuition: In many applications, the cost incurred by unsafe trajectories is irrelevant (e.g., the battery usage of a crashed quadcopter), and one is usually interested in minimizing the cost of trajectories that are safe. We show that this problem can be cast as a constrained Markov Decision Process over an augmented state space. This allows solving it via dynamic programming. We derive bounds on the policies' safety under errors resulting from gridding approximations when the system's state space is continuous. Finally, we empirically compare dynamic programming as well as reinforcement learning solutions on a simulated 2D unicycle system in cluttered reach-avoid environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

By minimizing expected cost only over safe trajectories and casting the problem as a constrained MDP on an augmented state, the paper provides a practical fix for incentive issues in joint chance-constrained control.

read the letter

The paper's key move is to change the objective from minimizing expected cost over all trajectories to only the safe ones. This stops the optimizer from using unsafe low-cost paths, which is a real issue when violations mean failure like a crashed vehicle. They augment the state with a binary flag that stays true only if no violation occurred up to that point, turning the joint chance constraint into a constraint on the probability of ending in the safe absorbing state. This fits the constrained MDP framework and lets them use dynamic programming to find the optimal policy.

The gridding bounds for continuous states are a practical addition, giving guarantees on how much safety is lost due to discretization errors. The 2D unicycle example in cluttered environments illustrates that both DP and RL can solve the resulting problem, with the DP version presumably giving exact solutions on the grid.

The work is limited to finite-horizon discrete-time reach-avoid problems, so it doesn't claim broader applicability to infinite horizons or continuous-time systems. The bounds are derived but the paper doesn't show how loose they are in the example or provide much analysis on their conservatism. The RL comparison is secondary and doesn't introduce new algorithms. Since it's all simulation, there's no hardware validation, though that's typical for this area.

This is for specialists in safe control and stochastic optimization who deal with joint chance constraints. It has enough new structure in the objective and the augmentation trick, plus the bounds, to warrant peer review, even if revisions will be needed on tightening the analysis and adding more baselines.

Referee Report

2 major / 2 minor

Summary. The paper considers finite-horizon joint chance-constrained optimal control of stochastic systems. It argues that the standard objective of minimizing expected cost over all trajectories incentivizes unsafe behavior when unsafe paths have low cost. Instead, it proposes minimizing the expected cost conditional on safety (i.e., E[cost · 1_safe]). This objective is shown to be equivalent to a constrained Markov decision process on an augmented state that tracks a binary safety flag; the joint chance constraint becomes a terminal constraint on the probability of the safe flag. The resulting CMDP is solved by dynamic programming. Bounds are derived on the safety violation probability induced by state-space gridding approximations. The approach is illustrated on a 2D unicycle reach-avoid task using both exact DP and RL.

Significance. If the reformulation and gridding bounds are correct, the work supplies a principled objective that aligns optimization with the cost of safe operation only, which is practically relevant when unsafe trajectories (e.g., crashes) incur irrelevant costs. Casting the problem as a standard CMDP permits reuse of existing DP and RL algorithms, while the explicit safety bounds address a central implementation issue for continuous-state systems. The unicycle experiments provide concrete evidence of applicability. These elements together strengthen the case for the proposed objective over conventional joint-chance formulations.

major comments (2)

[Section deriving the CMDP formulation] The central claim that the augmented-state CMDP exactly recovers the original joint-chance problem rests on the construction of the safety flag and the terminal constraint E[1_safe] ≥ 1-δ. The manuscript should explicitly verify (in the section deriving the CMDP) that the flag update preserves the Markov property for arbitrary stochastic dynamics and that no additional history dependence is introduced.
[Section on gridding bounds] The safety bounds under gridding (mentioned in the abstract) are load-bearing for any continuous-state guarantee. The proof must state the precise dependence on grid resolution, the Lipschitz constants or other regularity assumptions on the transition kernel, and whether the bound holds uniformly over policies or only for the optimal policy.

minor comments (2)

[Empirical evaluation section] The unicycle experiments would be strengthened by reporting the exact grid resolution used for DP, the number of Monte-Carlo rollouts for safety estimation, and a direct comparison of achieved cost versus achieved safety probability against a standard (non-augmented) chance-constrained baseline.
[Notation and preliminaries] Notation for the augmented state and the indicator 1_safe should be introduced once, early, and used consistently; currently the abstract and later sections appear to employ slightly different symbols for the same quantity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments identify opportunities to strengthen the clarity of the CMDP derivation and the gridding analysis. We address each major comment below.

read point-by-point responses

Referee: [Section deriving the CMDP formulation] The central claim that the augmented-state CMDP exactly recovers the original joint-chance problem rests on the construction of the safety flag and the terminal constraint E[1_safe] ≥ 1-δ. The manuscript should explicitly verify (in the section deriving the CMDP) that the flag update preserves the Markov property for arbitrary stochastic dynamics and that no additional history dependence is introduced.

Authors: We agree that an explicit verification paragraph will improve readability. The safety flag is updated deterministically from the current augmented state (original state plus flag) and the next state according to whether the next state satisfies the per-step safety predicate; because this update is a fixed function of the current augmented state and the realized next state, the transition kernel on the augmented space remains Markovian for any stochastic dynamics of the original system and introduces no additional history dependence. In the revised manuscript we will insert a short dedicated paragraph immediately after the flag-update definition that states this argument formally. revision: yes
Referee: [Section on gridding bounds] The safety bounds under gridding (mentioned in the abstract) are load-bearing for any continuous-state guarantee. The proof must state the precise dependence on grid resolution, the Lipschitz constants or other regularity assumptions on the transition kernel, and whether the bound holds uniformly over policies or only for the optimal policy.

Authors: The existing proof assumes the transition kernel is Lipschitz continuous with constant L and derives a safety-violation bound that scales linearly with grid resolution Δ and horizon length T (specifically O(Δ L T)). The argument proceeds via a uniform bound on the approximation error of the value-function iteration and therefore holds for every policy, not merely the optimal one. We will revise the gridding section to state these assumptions, the explicit functional dependence on Δ and L, and the uniformity over policies at the beginning of the proof. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper redefines the objective as the expected cost only over safe trajectories (removing unsafe contributions by construction), augments the state space with a binary safety indicator to convert the joint chance constraint into a standard terminal probability constraint E[1_safe] ≥ 1-δ, and applies the known constrained-MDP dynamic programming template. This reduction is explicit and does not rely on fitted parameters renamed as predictions, self-citation chains for uniqueness, or ansatzes smuggled from prior work. The gridding error bounds are derived separately from the DP solution and the unicycle reach-avoid example serves as an independent numerical check. No load-bearing step reduces to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard stochastic control modeling assumptions and the new objective definition; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Stochastic system dynamics admit a Markov decision process representation.
Invoked when casting the problem as a constrained MDP.
domain assumption Joint chance constraint can be encoded via state augmentation without loss of optimality for the safe-cost objective.
Central to the reduction to dynamic programming.

pith-pipeline@v0.9.1-grok · 5745 in / 1263 out tokens · 40475 ms · 2026-07-01T01:36:42.945012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 3 canonical work pages

[1]

A survey of autonomous driving: Common practices and emerging technologies,

E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of autonomous driving: Common practices and emerging technologies,” IEEE access, vol. 8, pp. 58 443–58 469, 2020

2020
[2]

Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,

M. H. Rahman, M. M. Gulzar, T. S. Haque, S. Habib, A. Shakoor, and A. F. Murtaza, “Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,”Engineering Science and Technology, an International Journal, vol. 64, p. 101950, 2025

2025
[3]

Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,

C. Shang, W.-H. Chen, A. D. Stroock, and F. You, “Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,”Transactions on Control Systems Technology, vol. 28, no. 4, pp. 1493–1504, 2019

2019
[4]

V olume control of low-cost ventilator with automatic set-point adaptation,

L. Hewing, M. Menner, N. Tachatos, M. S. Daners, C. Du Pasquier, T. S. Lumpe, K. Shea, A. Carron, and M. N. Zeilinger, “V olume control of low-cost ventilator with automatic set-point adaptation,” inEuropean Control Conference, 2021, pp. 781–786

2021
[5]

Stochastic linear model predictive control with chance constraints–a review,

M. Farina, L. Giulioni, and R. Scattolini, “Stochastic linear model predictive control with chance constraints–a review,”Journal of Process Control, vol. 44, pp. 53–67, 2016

2016
[6]

Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,

K. Wang and S. Gros, “Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,” inConference on Decision and Control, 2021. 8

2021
[7]

Stochastic nonlinear model predictive control with joint chance constraints,

V . A. Bavdekar and A. Mesbah, “Stochastic nonlinear model predictive control with joint chance constraints,”IF AC-PapersOnLine, 2016

2016
[8]

Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,

V . Raghuraman and J. P. Koeln, “Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,”Control Systems Letters, vol. 7, pp. 865–870, 2022

2022
[9]

Stochastic model predictive control with joint chance constraints,

J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints,”International Journal of Control, vol. 93, no. 1, pp. 126–139, 2020

2020
[10]

Joint chance-constrained model predictive control with prob- abilistic resolvability,

M. Ono, “Joint chance-constrained model predictive control with prob- abilistic resolvability,” inAmerican Control Conference, 2012

2012
[11]

Comput- ing optimal joint chance constrained control policies,

N. Schmid, M. Fochesato, S. H. Li, T. Sutter, and J. Lygeros, “Comput- ing optimal joint chance constrained control policies,”Transactions on Automatic Control, vol. 70, no. 7, pp. 4904–4911, 2025

2025
[12]

Chance-constrained dynamic programming with application to risk-aware robotic space exploration,

Y . K. M. Ono, M. Pavone and J. Balaram, “Chance-constrained dynamic programming with application to risk-aware robotic space exploration,” Autonomous Robots, vol. 39, pp. 555–571, 2015

2015
[13]

Solving mission-wide chance-constrained opti- mal control using dynamic programming,

K. Wang and S. Gros, “Solving mission-wide chance-constrained opti- mal control using dynamic programming,” inConference on Decision and Control, 2022, pp. 2947–2952

2022
[14]

Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,

E. M. Hahn, V . Hashemi, H. Hermanns, M. Lahijanian, and A. Turrini, “Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,”Transactions on Modeling and Computer Simulation, vol. 29, no. 4, pp. 1–31, 2019

2019
[15]

Joint chance constrained optimal control via linear programming,

N. Schmid, M. Fochesato, T. Sutter, and J. Lygeros, “Joint chance constrained optimal control via linear programming,”Control Systems Letters, vol. 8, pp. 736–741, 2024

2024
[16]

Formal multi-objective synthe- sis of continuous-state MDPs,

S. Haesaert, P. Nilsson, and S. Soudjani, “Formal multi-objective synthe- sis of continuous-state MDPs,” inAmerican Control Conference, 2021, pp. 3428–3433

2021
[17]

Multi- objective model checking of Markov decision processes,

K. Etessami, M. Kwiatkowska, M. Y . Vardi, and M. Yannakakis, “Multi- objective model checking of Markov decision processes,”Tools and Algorithms for the Construction and Analysis of Systems, vol. 4, pp. 50–65, 2008

2008
[18]

Operator splitting for convex constrained Markov decision processes,

P. D. Grontas, A. Tsiamis, and J. Lygeros, “Operator splitting for convex constrained Markov decision processes,”arXiv preprint arXiv:2412.14002, 2024

work page arXiv 2024
[19]

Policy gradients for probabilistic constrained reinforcement learning,

W. Chen, D. Subramanian, and S. Paternain, “Policy gradients for probabilistic constrained reinforcement learning,” inConference on Information Sciences and Systems, 2023, pp. 1–6

2023
[20]

Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,

M. Xu, Z. Liu, P. Huang, W. Ding, Z. Cen, B. Li, and D. Zhao, “Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,”arXiv:2209.08025, 2022

work page arXiv 2022
[21]

First order constrained optimization in policy space,

Y . Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,”Conference on Neural Information Processing Systems, vol. 33, pp. 15 338–15 349, 2020

2020
[22]

Natural policy gradient primal-dual method for constrained Markov decision processes,

D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Confer- ence on Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020

2020
[23]

Reward constrained policy optimization,

C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” inInternational Conference on Learning Representations, 2018

2018
[24]

A learning-based approach to stochastic op- timal control under reach-avoid constraint,

T. Ni and M. Kamgarpour, “A learning-based approach to stochastic op- timal control under reach-avoid constraint,” inInternational Conference on Hybrid Systems: Computation and Control, 2025, pp. 1–8

2025
[25]

Probabilistic control barrier functions: Safety in probability for discrete-time stochas- tic systems,

P. Mestres, B. Werner, R. K. Cosner, and A. D. Ames, “Probabilistic control barrier functions: Safety in probability for discrete-time stochas- tic systems,”arXiv preprint arXiv:2510.01501, 2025

work page arXiv 2025
[26]

Hernández-Lerma and J

O. Hernández-Lerma and J. B. Lasserre,Discrete-time Markov control processes: basic optimality criteria, ser. Applications of Mathematics. Springer Science & Business Media, 2012, vol. 30

2012
[27]

Altman,Constrained Markov decision processes

E. Altman,Constrained Markov decision processes. Routledge, 2021

2021
[28]

Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,

A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008

2008
[29]

Approximate model checking of stochastic hybrid systems,

A. Abate, J.-P. Katoen, J. Lygeros, and M. Prandini, “Approximate model checking of stochastic hybrid systems,”European Journal of Control, vol. 16, no. 6, pp. 624–641, 2010

2010
[30]

Self-punishment and reward backfill for deep q-learning,

M. R. Bonyadi, R. Wang, and M. Ziaei, “Self-punishment and reward backfill for deep q-learning,”Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 8086–8093, 2022

2022
[31]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning, 2018, pp. 1861–1870

2018
[32]

all-rl-algorithms,

F. Khan, “all-rl-algorithms,” https://github.com/fareedkhan-dev/ all-rl-algorithms, 2024, accessed: 26.06.2026

2024
[33]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepelet al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,”Science, vol. 362, no. 6419, pp. 1140–1144, 2018

2018
[34]

Learning robust perceptive locomotion for quadrupedal robots in the wild,

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science robotics, vol. 7, no. 62, p. eabk2822, 2022

2022
[35]

Nonatomic total rewards Markov decision processes with multiple criteria,

E. A. Feinberg and A. B. Piunovskiy, “Nonatomic total rewards Markov decision processes with multiple criteria,”Journal of Mathematical Analysis and Applications, vol. 273, no. 1, pp. 93–111, 2002. APPENDIX The following sections contain the proofs of our main The- orems III.1 and III.2, as well as Proposition III.3. Throughout, we will integrate over his...

2002
[36]

Ifπ z is optimal for (5), then its adaptation(π, a)to (2) optimally solves (2)
[37]

gN+1(zN+1) + NX k=0 gk(zk, uk) # (25a) s.t.E π z0

If(π, a)is optimal for (2), then its adaptationπ z to (5) optimally solves (5). Proof.Letp z 0:N+1 be the occupation measure onH z 0:N+1 in- duced byπ z onM z, andp 0:N the occupation measure induced onH 0:N byπonM. For notational simplicity, we write L(hN) =ℓ N(xN) +PN−1 k=0 ℓk(xk, uk). As a consequence of 10 Lemma A.2, by a change of variablesh z N =ψ(h...

[1] [1]

A survey of autonomous driving: Common practices and emerging technologies,

E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of autonomous driving: Common practices and emerging technologies,” IEEE access, vol. 8, pp. 58 443–58 469, 2020

2020

[2] [2]

Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,

M. H. Rahman, M. M. Gulzar, T. S. Haque, S. Habib, A. Shakoor, and A. F. Murtaza, “Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,”Engineering Science and Technology, an International Journal, vol. 64, p. 101950, 2025

2025

[3] [3]

Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,

C. Shang, W.-H. Chen, A. D. Stroock, and F. You, “Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,”Transactions on Control Systems Technology, vol. 28, no. 4, pp. 1493–1504, 2019

2019

[4] [4]

V olume control of low-cost ventilator with automatic set-point adaptation,

L. Hewing, M. Menner, N. Tachatos, M. S. Daners, C. Du Pasquier, T. S. Lumpe, K. Shea, A. Carron, and M. N. Zeilinger, “V olume control of low-cost ventilator with automatic set-point adaptation,” inEuropean Control Conference, 2021, pp. 781–786

2021

[5] [5]

Stochastic linear model predictive control with chance constraints–a review,

M. Farina, L. Giulioni, and R. Scattolini, “Stochastic linear model predictive control with chance constraints–a review,”Journal of Process Control, vol. 44, pp. 53–67, 2016

2016

[6] [6]

Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,

K. Wang and S. Gros, “Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,” inConference on Decision and Control, 2021. 8

2021

[7] [7]

Stochastic nonlinear model predictive control with joint chance constraints,

V . A. Bavdekar and A. Mesbah, “Stochastic nonlinear model predictive control with joint chance constraints,”IF AC-PapersOnLine, 2016

2016

[8] [8]

Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,

V . Raghuraman and J. P. Koeln, “Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,”Control Systems Letters, vol. 7, pp. 865–870, 2022

2022

[9] [9]

Stochastic model predictive control with joint chance constraints,

J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints,”International Journal of Control, vol. 93, no. 1, pp. 126–139, 2020

2020

[10] [10]

Joint chance-constrained model predictive control with prob- abilistic resolvability,

M. Ono, “Joint chance-constrained model predictive control with prob- abilistic resolvability,” inAmerican Control Conference, 2012

2012

[11] [11]

Comput- ing optimal joint chance constrained control policies,

N. Schmid, M. Fochesato, S. H. Li, T. Sutter, and J. Lygeros, “Comput- ing optimal joint chance constrained control policies,”Transactions on Automatic Control, vol. 70, no. 7, pp. 4904–4911, 2025

2025

[12] [12]

Chance-constrained dynamic programming with application to risk-aware robotic space exploration,

Y . K. M. Ono, M. Pavone and J. Balaram, “Chance-constrained dynamic programming with application to risk-aware robotic space exploration,” Autonomous Robots, vol. 39, pp. 555–571, 2015

2015

[13] [13]

Solving mission-wide chance-constrained opti- mal control using dynamic programming,

K. Wang and S. Gros, “Solving mission-wide chance-constrained opti- mal control using dynamic programming,” inConference on Decision and Control, 2022, pp. 2947–2952

2022

[14] [14]

Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,

E. M. Hahn, V . Hashemi, H. Hermanns, M. Lahijanian, and A. Turrini, “Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,”Transactions on Modeling and Computer Simulation, vol. 29, no. 4, pp. 1–31, 2019

2019

[15] [15]

Joint chance constrained optimal control via linear programming,

N. Schmid, M. Fochesato, T. Sutter, and J. Lygeros, “Joint chance constrained optimal control via linear programming,”Control Systems Letters, vol. 8, pp. 736–741, 2024

2024

[16] [16]

Formal multi-objective synthe- sis of continuous-state MDPs,

S. Haesaert, P. Nilsson, and S. Soudjani, “Formal multi-objective synthe- sis of continuous-state MDPs,” inAmerican Control Conference, 2021, pp. 3428–3433

2021

[17] [17]

Multi- objective model checking of Markov decision processes,

K. Etessami, M. Kwiatkowska, M. Y . Vardi, and M. Yannakakis, “Multi- objective model checking of Markov decision processes,”Tools and Algorithms for the Construction and Analysis of Systems, vol. 4, pp. 50–65, 2008

2008

[18] [18]

Operator splitting for convex constrained Markov decision processes,

P. D. Grontas, A. Tsiamis, and J. Lygeros, “Operator splitting for convex constrained Markov decision processes,”arXiv preprint arXiv:2412.14002, 2024

work page arXiv 2024

[19] [19]

Policy gradients for probabilistic constrained reinforcement learning,

W. Chen, D. Subramanian, and S. Paternain, “Policy gradients for probabilistic constrained reinforcement learning,” inConference on Information Sciences and Systems, 2023, pp. 1–6

2023

[20] [20]

Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,

M. Xu, Z. Liu, P. Huang, W. Ding, Z. Cen, B. Li, and D. Zhao, “Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,”arXiv:2209.08025, 2022

work page arXiv 2022

[21] [21]

First order constrained optimization in policy space,

Y . Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,”Conference on Neural Information Processing Systems, vol. 33, pp. 15 338–15 349, 2020

2020

[22] [22]

Natural policy gradient primal-dual method for constrained Markov decision processes,

D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Confer- ence on Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020

2020

[23] [23]

Reward constrained policy optimization,

C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” inInternational Conference on Learning Representations, 2018

2018

[24] [24]

A learning-based approach to stochastic op- timal control under reach-avoid constraint,

T. Ni and M. Kamgarpour, “A learning-based approach to stochastic op- timal control under reach-avoid constraint,” inInternational Conference on Hybrid Systems: Computation and Control, 2025, pp. 1–8

2025

[25] [25]

Probabilistic control barrier functions: Safety in probability for discrete-time stochas- tic systems,

P. Mestres, B. Werner, R. K. Cosner, and A. D. Ames, “Probabilistic control barrier functions: Safety in probability for discrete-time stochas- tic systems,”arXiv preprint arXiv:2510.01501, 2025

work page arXiv 2025

[26] [26]

Hernández-Lerma and J

O. Hernández-Lerma and J. B. Lasserre,Discrete-time Markov control processes: basic optimality criteria, ser. Applications of Mathematics. Springer Science & Business Media, 2012, vol. 30

2012

[27] [27]

Altman,Constrained Markov decision processes

E. Altman,Constrained Markov decision processes. Routledge, 2021

2021

[28] [28]

Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,

A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008

2008

[29] [29]

Approximate model checking of stochastic hybrid systems,

A. Abate, J.-P. Katoen, J. Lygeros, and M. Prandini, “Approximate model checking of stochastic hybrid systems,”European Journal of Control, vol. 16, no. 6, pp. 624–641, 2010

2010

[30] [30]

Self-punishment and reward backfill for deep q-learning,

M. R. Bonyadi, R. Wang, and M. Ziaei, “Self-punishment and reward backfill for deep q-learning,”Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 8086–8093, 2022

2022

[31] [31]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning, 2018, pp. 1861–1870

2018

[32] [32]

all-rl-algorithms,

F. Khan, “all-rl-algorithms,” https://github.com/fareedkhan-dev/ all-rl-algorithms, 2024, accessed: 26.06.2026

2024

[33] [33]

A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepelet al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,”Science, vol. 362, no. 6419, pp. 1140–1144, 2018

2018

[34] [34]

Learning robust perceptive locomotion for quadrupedal robots in the wild,

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science robotics, vol. 7, no. 62, p. eabk2822, 2022

2022

[35] [35]

Nonatomic total rewards Markov decision processes with multiple criteria,

E. A. Feinberg and A. B. Piunovskiy, “Nonatomic total rewards Markov decision processes with multiple criteria,”Journal of Mathematical Analysis and Applications, vol. 273, no. 1, pp. 93–111, 2002. APPENDIX The following sections contain the proofs of our main The- orems III.1 and III.2, as well as Proposition III.3. Throughout, we will integrate over his...

2002

[36] [36]

Ifπ z is optimal for (5), then its adaptation(π, a)to (2) optimally solves (2)

[37] [37]

gN+1(zN+1) + NX k=0 gk(zk, uk) # (25a) s.t.E π z0

If(π, a)is optimal for (2), then its adaptationπ z to (5) optimally solves (5). Proof.Letp z 0:N+1 be the occupation measure onH z 0:N+1 in- duced byπ z onM z, andp 0:N the occupation measure induced onH 0:N byπonM. For notational simplicity, we write L(hN) =ℓ N(xN) +PN−1 k=0 ℓk(xk, uk). As a consequence of 10 Lemma A.2, by a change of variablesh z N =ψ(h...