Joint Chance Constrained Safe-Optimal Control
Pith reviewed 2026-07-01 01:36 UTC · model grok-4.3
The pith
Minimizing expected cost only over safe trajectories removes the incentive for joint chance constraint violations in stochastic optimal control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Restricting the cost objective to safe trajectories only yields policies that respect the joint chance constraint without inviting violations; the resulting problem admits an exact dynamic-programming solution once the state is augmented to encode the chance constraint as a constrained Markov decision process.
What carries the argument
Augmented state space that records whether the joint chance constraint has been satisfied so far, converting the original problem into a constrained Markov decision process solvable by dynamic programming.
If this is right
- The safe-only objective produces policies whose safety level can be bounded under gridding approximations of continuous state spaces.
- Dynamic programming on the augmented state yields the exact optimum for the safe-cost problem in finite discrete settings.
- The formulation applies directly to finite-horizon reach-avoid tasks in environments with probabilistic obstacles.
- Reinforcement-learning approximations remain feasible once the constrained-MDP structure is recognized.
Where Pith is reading between the lines
- The same augmentation trick could be combined with model-predictive control to obtain receding-horizon versions that retain the safety guarantee.
- In multi-agent settings the approach might generalize by treating joint safety across agents as an additional dimension in the augmented state.
- When unsafe trajectories carry domain-specific costs that are still worth modeling, a weighted combination of the two objectives could be explored as a direct extension.
Load-bearing premise
That the cost of unsafe trajectories can be ignored without changing what counts as optimal for the safe trajectories, and that the state augmentation keeps the problem Markovian enough for dynamic programming to recover the desired solution.
What would settle it
On a stochastic system whose unsafe trajectories have measurably lower cost than safe ones, compare the empirical violation rate of the policy obtained by minimizing total expected cost versus the policy obtained by minimizing expected cost only on safe trajectories; a statistically higher violation rate for the first policy would confirm the claimed incentive difference.
read the original abstract
We consider the finite-time optimal control of stochastic systems subject to a probabilistic constraint on the trajectories' safety. Such formulations are known as joint chance constrained optimal control problems. The common practice is to jointly minimise the expected cost of all trajectories, safe and unsafe. This leads to policies which invite constraint violations to exploit low-cost unsafe trajectories. When constraints represent states of critical failure, such behaviour is undesirable. We demonstrate that this behaviour can be overcome by only minimizing the expected cost of safe trajectories. The underlying rationale follows a practical intuition: In many applications, the cost incurred by unsafe trajectories is irrelevant (e.g., the battery usage of a crashed quadcopter), and one is usually interested in minimizing the cost of trajectories that are safe. We show that this problem can be cast as a constrained Markov Decision Process over an augmented state space. This allows solving it via dynamic programming. We derive bounds on the policies' safety under errors resulting from gridding approximations when the system's state space is continuous. Finally, we empirically compare dynamic programming as well as reinforcement learning solutions on a simulated 2D unicycle system in cluttered reach-avoid environments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper considers finite-horizon joint chance-constrained optimal control of stochastic systems. It argues that the standard objective of minimizing expected cost over all trajectories incentivizes unsafe behavior when unsafe paths have low cost. Instead, it proposes minimizing the expected cost conditional on safety (i.e., E[cost · 1_safe]). This objective is shown to be equivalent to a constrained Markov decision process on an augmented state that tracks a binary safety flag; the joint chance constraint becomes a terminal constraint on the probability of the safe flag. The resulting CMDP is solved by dynamic programming. Bounds are derived on the safety violation probability induced by state-space gridding approximations. The approach is illustrated on a 2D unicycle reach-avoid task using both exact DP and RL.
Significance. If the reformulation and gridding bounds are correct, the work supplies a principled objective that aligns optimization with the cost of safe operation only, which is practically relevant when unsafe trajectories (e.g., crashes) incur irrelevant costs. Casting the problem as a standard CMDP permits reuse of existing DP and RL algorithms, while the explicit safety bounds address a central implementation issue for continuous-state systems. The unicycle experiments provide concrete evidence of applicability. These elements together strengthen the case for the proposed objective over conventional joint-chance formulations.
major comments (2)
- [Section deriving the CMDP formulation] The central claim that the augmented-state CMDP exactly recovers the original joint-chance problem rests on the construction of the safety flag and the terminal constraint E[1_safe] ≥ 1-δ. The manuscript should explicitly verify (in the section deriving the CMDP) that the flag update preserves the Markov property for arbitrary stochastic dynamics and that no additional history dependence is introduced.
- [Section on gridding bounds] The safety bounds under gridding (mentioned in the abstract) are load-bearing for any continuous-state guarantee. The proof must state the precise dependence on grid resolution, the Lipschitz constants or other regularity assumptions on the transition kernel, and whether the bound holds uniformly over policies or only for the optimal policy.
minor comments (2)
- [Empirical evaluation section] The unicycle experiments would be strengthened by reporting the exact grid resolution used for DP, the number of Monte-Carlo rollouts for safety estimation, and a direct comparison of achieved cost versus achieved safety probability against a standard (non-augmented) chance-constrained baseline.
- [Notation and preliminaries] Notation for the augmented state and the indicator 1_safe should be introduced once, early, and used consistently; currently the abstract and later sections appear to employ slightly different symbols for the same quantity.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. The comments identify opportunities to strengthen the clarity of the CMDP derivation and the gridding analysis. We address each major comment below.
read point-by-point responses
-
Referee: [Section deriving the CMDP formulation] The central claim that the augmented-state CMDP exactly recovers the original joint-chance problem rests on the construction of the safety flag and the terminal constraint E[1_safe] ≥ 1-δ. The manuscript should explicitly verify (in the section deriving the CMDP) that the flag update preserves the Markov property for arbitrary stochastic dynamics and that no additional history dependence is introduced.
Authors: We agree that an explicit verification paragraph will improve readability. The safety flag is updated deterministically from the current augmented state (original state plus flag) and the next state according to whether the next state satisfies the per-step safety predicate; because this update is a fixed function of the current augmented state and the realized next state, the transition kernel on the augmented space remains Markovian for any stochastic dynamics of the original system and introduces no additional history dependence. In the revised manuscript we will insert a short dedicated paragraph immediately after the flag-update definition that states this argument formally. revision: yes
-
Referee: [Section on gridding bounds] The safety bounds under gridding (mentioned in the abstract) are load-bearing for any continuous-state guarantee. The proof must state the precise dependence on grid resolution, the Lipschitz constants or other regularity assumptions on the transition kernel, and whether the bound holds uniformly over policies or only for the optimal policy.
Authors: The existing proof assumes the transition kernel is Lipschitz continuous with constant L and derives a safety-violation bound that scales linearly with grid resolution Δ and horizon length T (specifically O(Δ L T)). The argument proceeds via a uniform bound on the approximation error of the value-function iteration and therefore holds for every policy, not merely the optimal one. We will revise the gridding section to state these assumptions, the explicit functional dependence on Δ and L, and the uniformity over policies at the beginning of the proof. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper redefines the objective as the expected cost only over safe trajectories (removing unsafe contributions by construction), augments the state space with a binary safety indicator to convert the joint chance constraint into a standard terminal probability constraint E[1_safe] ≥ 1-δ, and applies the known constrained-MDP dynamic programming template. This reduction is explicit and does not rely on fitted parameters renamed as predictions, self-citation chains for uniqueness, or ansatzes smuggled from prior work. The gridding error bounds are derived separately from the DP solution and the unicycle reach-avoid example serves as an independent numerical check. No load-bearing step reduces to its own inputs by definition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Stochastic system dynamics admit a Markov decision process representation.
- domain assumption Joint chance constraint can be encoded via state augmentation without loss of optimality for the safe-cost objective.
Reference graph
Works this paper leans on
-
[1]
A survey of autonomous driving: Common practices and emerging technologies,
E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of autonomous driving: Common practices and emerging technologies,” IEEE access, vol. 8, pp. 58 443–58 469, 2020
2020
-
[2]
Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,
M. H. Rahman, M. M. Gulzar, T. S. Haque, S. Habib, A. Shakoor, and A. F. Murtaza, “Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,”Engineering Science and Technology, an International Journal, vol. 64, p. 101950, 2025
2025
-
[3]
Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,
C. Shang, W.-H. Chen, A. D. Stroock, and F. You, “Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,”Transactions on Control Systems Technology, vol. 28, no. 4, pp. 1493–1504, 2019
2019
-
[4]
V olume control of low-cost ventilator with automatic set-point adaptation,
L. Hewing, M. Menner, N. Tachatos, M. S. Daners, C. Du Pasquier, T. S. Lumpe, K. Shea, A. Carron, and M. N. Zeilinger, “V olume control of low-cost ventilator with automatic set-point adaptation,” inEuropean Control Conference, 2021, pp. 781–786
2021
-
[5]
Stochastic linear model predictive control with chance constraints–a review,
M. Farina, L. Giulioni, and R. Scattolini, “Stochastic linear model predictive control with chance constraints–a review,”Journal of Process Control, vol. 44, pp. 53–67, 2016
2016
-
[6]
Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,
K. Wang and S. Gros, “Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,” inConference on Decision and Control, 2021. 8
2021
-
[7]
Stochastic nonlinear model predictive control with joint chance constraints,
V . A. Bavdekar and A. Mesbah, “Stochastic nonlinear model predictive control with joint chance constraints,”IF AC-PapersOnLine, 2016
2016
-
[8]
Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,
V . Raghuraman and J. P. Koeln, “Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,”Control Systems Letters, vol. 7, pp. 865–870, 2022
2022
-
[9]
Stochastic model predictive control with joint chance constraints,
J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints,”International Journal of Control, vol. 93, no. 1, pp. 126–139, 2020
2020
-
[10]
Joint chance-constrained model predictive control with prob- abilistic resolvability,
M. Ono, “Joint chance-constrained model predictive control with prob- abilistic resolvability,” inAmerican Control Conference, 2012
2012
-
[11]
Comput- ing optimal joint chance constrained control policies,
N. Schmid, M. Fochesato, S. H. Li, T. Sutter, and J. Lygeros, “Comput- ing optimal joint chance constrained control policies,”Transactions on Automatic Control, vol. 70, no. 7, pp. 4904–4911, 2025
2025
-
[12]
Chance-constrained dynamic programming with application to risk-aware robotic space exploration,
Y . K. M. Ono, M. Pavone and J. Balaram, “Chance-constrained dynamic programming with application to risk-aware robotic space exploration,” Autonomous Robots, vol. 39, pp. 555–571, 2015
2015
-
[13]
Solving mission-wide chance-constrained opti- mal control using dynamic programming,
K. Wang and S. Gros, “Solving mission-wide chance-constrained opti- mal control using dynamic programming,” inConference on Decision and Control, 2022, pp. 2947–2952
2022
-
[14]
Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,
E. M. Hahn, V . Hashemi, H. Hermanns, M. Lahijanian, and A. Turrini, “Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,”Transactions on Modeling and Computer Simulation, vol. 29, no. 4, pp. 1–31, 2019
2019
-
[15]
Joint chance constrained optimal control via linear programming,
N. Schmid, M. Fochesato, T. Sutter, and J. Lygeros, “Joint chance constrained optimal control via linear programming,”Control Systems Letters, vol. 8, pp. 736–741, 2024
2024
-
[16]
Formal multi-objective synthe- sis of continuous-state MDPs,
S. Haesaert, P. Nilsson, and S. Soudjani, “Formal multi-objective synthe- sis of continuous-state MDPs,” inAmerican Control Conference, 2021, pp. 3428–3433
2021
-
[17]
Multi- objective model checking of Markov decision processes,
K. Etessami, M. Kwiatkowska, M. Y . Vardi, and M. Yannakakis, “Multi- objective model checking of Markov decision processes,”Tools and Algorithms for the Construction and Analysis of Systems, vol. 4, pp. 50–65, 2008
2008
-
[18]
Operator splitting for convex constrained Markov decision processes,
P. D. Grontas, A. Tsiamis, and J. Lygeros, “Operator splitting for convex constrained Markov decision processes,”arXiv preprint arXiv:2412.14002, 2024
-
[19]
Policy gradients for probabilistic constrained reinforcement learning,
W. Chen, D. Subramanian, and S. Paternain, “Policy gradients for probabilistic constrained reinforcement learning,” inConference on Information Sciences and Systems, 2023, pp. 1–6
2023
-
[20]
M. Xu, Z. Liu, P. Huang, W. Ding, Z. Cen, B. Li, and D. Zhao, “Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,”arXiv:2209.08025, 2022
-
[21]
First order constrained optimization in policy space,
Y . Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,”Conference on Neural Information Processing Systems, vol. 33, pp. 15 338–15 349, 2020
2020
-
[22]
Natural policy gradient primal-dual method for constrained Markov decision processes,
D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Confer- ence on Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020
2020
-
[23]
Reward constrained policy optimization,
C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” inInternational Conference on Learning Representations, 2018
2018
-
[24]
A learning-based approach to stochastic op- timal control under reach-avoid constraint,
T. Ni and M. Kamgarpour, “A learning-based approach to stochastic op- timal control under reach-avoid constraint,” inInternational Conference on Hybrid Systems: Computation and Control, 2025, pp. 1–8
2025
-
[25]
P. Mestres, B. Werner, R. K. Cosner, and A. D. Ames, “Probabilistic control barrier functions: Safety in probability for discrete-time stochas- tic systems,”arXiv preprint arXiv:2510.01501, 2025
-
[26]
Hernández-Lerma and J
O. Hernández-Lerma and J. B. Lasserre,Discrete-time Markov control processes: basic optimality criteria, ser. Applications of Mathematics. Springer Science & Business Media, 2012, vol. 30
2012
-
[27]
Altman,Constrained Markov decision processes
E. Altman,Constrained Markov decision processes. Routledge, 2021
2021
-
[28]
Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,
A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008
2008
-
[29]
Approximate model checking of stochastic hybrid systems,
A. Abate, J.-P. Katoen, J. Lygeros, and M. Prandini, “Approximate model checking of stochastic hybrid systems,”European Journal of Control, vol. 16, no. 6, pp. 624–641, 2010
2010
-
[30]
Self-punishment and reward backfill for deep q-learning,
M. R. Bonyadi, R. Wang, and M. Ziaei, “Self-punishment and reward backfill for deep q-learning,”Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 8086–8093, 2022
2022
-
[31]
Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning, 2018, pp. 1861–1870
2018
-
[32]
all-rl-algorithms,
F. Khan, “all-rl-algorithms,” https://github.com/fareedkhan-dev/ all-rl-algorithms, 2024, accessed: 26.06.2026
2024
-
[33]
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepelet al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,”Science, vol. 362, no. 6419, pp. 1140–1144, 2018
2018
-
[34]
Learning robust perceptive locomotion for quadrupedal robots in the wild,
T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science robotics, vol. 7, no. 62, p. eabk2822, 2022
2022
-
[35]
Nonatomic total rewards Markov decision processes with multiple criteria,
E. A. Feinberg and A. B. Piunovskiy, “Nonatomic total rewards Markov decision processes with multiple criteria,”Journal of Mathematical Analysis and Applications, vol. 273, no. 1, pp. 93–111, 2002. APPENDIX The following sections contain the proofs of our main The- orems III.1 and III.2, as well as Proposition III.3. Throughout, we will integrate over his...
2002
-
[36]
Ifπ z is optimal for (5), then its adaptation(π, a)to (2) optimally solves (2)
-
[37]
gN+1(zN+1) + NX k=0 gk(zk, uk) # (25a) s.t.E π z0
If(π, a)is optimal for (2), then its adaptationπ z to (5) optimally solves (5). Proof.Letp z 0:N+1 be the occupation measure onH z 0:N+1 in- duced byπ z onM z, andp 0:N the occupation measure induced onH 0:N byπonM. For notational simplicity, we write L(hN) =ℓ N(xN) +PN−1 k=0 ℓk(xk, uk). As a consequence of 10 Lemma A.2, by a change of variablesh z N =ψ(h...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.