pith. sign in

arxiv: 2606.30829 · v1 · pith:5YY6GCEFnew · submitted 2026-06-29 · 🧮 math.OC · cs.SY· eess.SY

Joint Chance Constrained Safe-Optimal Control

Pith reviewed 2026-07-01 01:36 UTC · model grok-4.3

classification 🧮 math.OC cs.SYeess.SY
keywords joint chance constraintssafe optimal controlconstrained Markov decision processdynamic programmingstochastic systemsreach-avoidgridding approximationsunicycle system
0
0 comments X

The pith

Minimizing expected cost only over safe trajectories removes the incentive for joint chance constraint violations in stochastic optimal control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard joint chance constrained control minimizes expected cost across all trajectories, which can encourage policies to exploit low-cost unsafe paths that violate safety requirements. The paper shows that restricting the objective to the expected cost of safe trajectories alone overcomes this incentive while preserving relevance for applications where unsafe costs do not matter. The reformulated problem is expressed as a constrained Markov decision process on an augmented state space that tracks cumulative safety information. Dynamic programming then solves for the safe-optimal policy, and explicit bounds quantify how state-space gridding affects the achieved safety level. Simulations on a 2D unicycle in cluttered reach-avoid settings illustrate the distinction between the two formulations.

Core claim

Restricting the cost objective to safe trajectories only yields policies that respect the joint chance constraint without inviting violations; the resulting problem admits an exact dynamic-programming solution once the state is augmented to encode the chance constraint as a constrained Markov decision process.

What carries the argument

Augmented state space that records whether the joint chance constraint has been satisfied so far, converting the original problem into a constrained Markov decision process solvable by dynamic programming.

If this is right

  • The safe-only objective produces policies whose safety level can be bounded under gridding approximations of continuous state spaces.
  • Dynamic programming on the augmented state yields the exact optimum for the safe-cost problem in finite discrete settings.
  • The formulation applies directly to finite-horizon reach-avoid tasks in environments with probabilistic obstacles.
  • Reinforcement-learning approximations remain feasible once the constrained-MDP structure is recognized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same augmentation trick could be combined with model-predictive control to obtain receding-horizon versions that retain the safety guarantee.
  • In multi-agent settings the approach might generalize by treating joint safety across agents as an additional dimension in the augmented state.
  • When unsafe trajectories carry domain-specific costs that are still worth modeling, a weighted combination of the two objectives could be explored as a direct extension.

Load-bearing premise

That the cost of unsafe trajectories can be ignored without changing what counts as optimal for the safe trajectories, and that the state augmentation keeps the problem Markovian enough for dynamic programming to recover the desired solution.

What would settle it

On a stochastic system whose unsafe trajectories have measurably lower cost than safe ones, compare the empirical violation rate of the policy obtained by minimizing total expected cost versus the policy obtained by minimizing expected cost only on safe trajectories; a statistically higher violation rate for the first policy would confirm the claimed incentive difference.

read the original abstract

We consider the finite-time optimal control of stochastic systems subject to a probabilistic constraint on the trajectories' safety. Such formulations are known as joint chance constrained optimal control problems. The common practice is to jointly minimise the expected cost of all trajectories, safe and unsafe. This leads to policies which invite constraint violations to exploit low-cost unsafe trajectories. When constraints represent states of critical failure, such behaviour is undesirable. We demonstrate that this behaviour can be overcome by only minimizing the expected cost of safe trajectories. The underlying rationale follows a practical intuition: In many applications, the cost incurred by unsafe trajectories is irrelevant (e.g., the battery usage of a crashed quadcopter), and one is usually interested in minimizing the cost of trajectories that are safe. We show that this problem can be cast as a constrained Markov Decision Process over an augmented state space. This allows solving it via dynamic programming. We derive bounds on the policies' safety under errors resulting from gridding approximations when the system's state space is continuous. Finally, we empirically compare dynamic programming as well as reinforcement learning solutions on a simulated 2D unicycle system in cluttered reach-avoid environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper considers finite-horizon joint chance-constrained optimal control of stochastic systems. It argues that the standard objective of minimizing expected cost over all trajectories incentivizes unsafe behavior when unsafe paths have low cost. Instead, it proposes minimizing the expected cost conditional on safety (i.e., E[cost · 1_safe]). This objective is shown to be equivalent to a constrained Markov decision process on an augmented state that tracks a binary safety flag; the joint chance constraint becomes a terminal constraint on the probability of the safe flag. The resulting CMDP is solved by dynamic programming. Bounds are derived on the safety violation probability induced by state-space gridding approximations. The approach is illustrated on a 2D unicycle reach-avoid task using both exact DP and RL.

Significance. If the reformulation and gridding bounds are correct, the work supplies a principled objective that aligns optimization with the cost of safe operation only, which is practically relevant when unsafe trajectories (e.g., crashes) incur irrelevant costs. Casting the problem as a standard CMDP permits reuse of existing DP and RL algorithms, while the explicit safety bounds address a central implementation issue for continuous-state systems. The unicycle experiments provide concrete evidence of applicability. These elements together strengthen the case for the proposed objective over conventional joint-chance formulations.

major comments (2)
  1. [Section deriving the CMDP formulation] The central claim that the augmented-state CMDP exactly recovers the original joint-chance problem rests on the construction of the safety flag and the terminal constraint E[1_safe] ≥ 1-δ. The manuscript should explicitly verify (in the section deriving the CMDP) that the flag update preserves the Markov property for arbitrary stochastic dynamics and that no additional history dependence is introduced.
  2. [Section on gridding bounds] The safety bounds under gridding (mentioned in the abstract) are load-bearing for any continuous-state guarantee. The proof must state the precise dependence on grid resolution, the Lipschitz constants or other regularity assumptions on the transition kernel, and whether the bound holds uniformly over policies or only for the optimal policy.
minor comments (2)
  1. [Empirical evaluation section] The unicycle experiments would be strengthened by reporting the exact grid resolution used for DP, the number of Monte-Carlo rollouts for safety estimation, and a direct comparison of achieved cost versus achieved safety probability against a standard (non-augmented) chance-constrained baseline.
  2. [Notation and preliminaries] Notation for the augmented state and the indicator 1_safe should be introduced once, early, and used consistently; currently the abstract and later sections appear to employ slightly different symbols for the same quantity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments identify opportunities to strengthen the clarity of the CMDP derivation and the gridding analysis. We address each major comment below.

read point-by-point responses
  1. Referee: [Section deriving the CMDP formulation] The central claim that the augmented-state CMDP exactly recovers the original joint-chance problem rests on the construction of the safety flag and the terminal constraint E[1_safe] ≥ 1-δ. The manuscript should explicitly verify (in the section deriving the CMDP) that the flag update preserves the Markov property for arbitrary stochastic dynamics and that no additional history dependence is introduced.

    Authors: We agree that an explicit verification paragraph will improve readability. The safety flag is updated deterministically from the current augmented state (original state plus flag) and the next state according to whether the next state satisfies the per-step safety predicate; because this update is a fixed function of the current augmented state and the realized next state, the transition kernel on the augmented space remains Markovian for any stochastic dynamics of the original system and introduces no additional history dependence. In the revised manuscript we will insert a short dedicated paragraph immediately after the flag-update definition that states this argument formally. revision: yes

  2. Referee: [Section on gridding bounds] The safety bounds under gridding (mentioned in the abstract) are load-bearing for any continuous-state guarantee. The proof must state the precise dependence on grid resolution, the Lipschitz constants or other regularity assumptions on the transition kernel, and whether the bound holds uniformly over policies or only for the optimal policy.

    Authors: The existing proof assumes the transition kernel is Lipschitz continuous with constant L and derives a safety-violation bound that scales linearly with grid resolution Δ and horizon length T (specifically O(Δ L T)). The argument proceeds via a uniform bound on the approximation error of the value-function iteration and therefore holds for every policy, not merely the optimal one. We will revise the gridding section to state these assumptions, the explicit functional dependence on Δ and L, and the uniformity over policies at the beginning of the proof. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper redefines the objective as the expected cost only over safe trajectories (removing unsafe contributions by construction), augments the state space with a binary safety indicator to convert the joint chance constraint into a standard terminal probability constraint E[1_safe] ≥ 1-δ, and applies the known constrained-MDP dynamic programming template. This reduction is explicit and does not rely on fitted parameters renamed as predictions, self-citation chains for uniqueness, or ansatzes smuggled from prior work. The gridding error bounds are derived separately from the DP solution and the unicycle reach-avoid example serves as an independent numerical check. No load-bearing step reduces to its own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard stochastic control modeling assumptions and the new objective definition; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Stochastic system dynamics admit a Markov decision process representation.
    Invoked when casting the problem as a constrained MDP.
  • domain assumption Joint chance constraint can be encoded via state augmentation without loss of optimality for the safe-cost objective.
    Central to the reduction to dynamic programming.

pith-pipeline@v0.9.1-grok · 5745 in / 1263 out tokens · 40475 ms · 2026-07-01T01:36:42.945012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 3 canonical work pages

  1. [1]

    A survey of autonomous driving: Common practices and emerging technologies,

    E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of autonomous driving: Common practices and emerging technologies,” IEEE access, vol. 8, pp. 58 443–58 469, 2020

  2. [2]

    Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,

    M. H. Rahman, M. M. Gulzar, T. S. Haque, S. Habib, A. Shakoor, and A. F. Murtaza, “Trajectory planning and tracking control in autonomous driving system: Leveraging machine learning and advanced control algo- rithms,”Engineering Science and Technology, an International Journal, vol. 64, p. 101950, 2025

  3. [3]

    Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,

    C. Shang, W.-H. Chen, A. D. Stroock, and F. You, “Robust model pre- dictive control of irrigation systems with active uncertainty learning and data analytics,”Transactions on Control Systems Technology, vol. 28, no. 4, pp. 1493–1504, 2019

  4. [4]

    V olume control of low-cost ventilator with automatic set-point adaptation,

    L. Hewing, M. Menner, N. Tachatos, M. S. Daners, C. Du Pasquier, T. S. Lumpe, K. Shea, A. Carron, and M. N. Zeilinger, “V olume control of low-cost ventilator with automatic set-point adaptation,” inEuropean Control Conference, 2021, pp. 781–786

  5. [5]

    Stochastic linear model predictive control with chance constraints–a review,

    M. Farina, L. Giulioni, and R. Scattolini, “Stochastic linear model predictive control with chance constraints–a review,”Journal of Process Control, vol. 44, pp. 53–67, 2016

  6. [6]

    Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,

    K. Wang and S. Gros, “Recursive feasibility of stochastic model predic- tive control with mission-wide probabilistic constraints,” inConference on Decision and Control, 2021. 8

  7. [7]

    Stochastic nonlinear model predictive control with joint chance constraints,

    V . A. Bavdekar and A. Mesbah, “Stochastic nonlinear model predictive control with joint chance constraints,”IF AC-PapersOnLine, 2016

  8. [8]

    Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,

    V . Raghuraman and J. P. Koeln, “Long duration stochastic MPC with mission-wide probabilistic constraints using waysets,”Control Systems Letters, vol. 7, pp. 865–870, 2022

  9. [9]

    Stochastic model predictive control with joint chance constraints,

    J. A. Paulson, E. A. Buehler, R. D. Braatz, and A. Mesbah, “Stochastic model predictive control with joint chance constraints,”International Journal of Control, vol. 93, no. 1, pp. 126–139, 2020

  10. [10]

    Joint chance-constrained model predictive control with prob- abilistic resolvability,

    M. Ono, “Joint chance-constrained model predictive control with prob- abilistic resolvability,” inAmerican Control Conference, 2012

  11. [11]

    Comput- ing optimal joint chance constrained control policies,

    N. Schmid, M. Fochesato, S. H. Li, T. Sutter, and J. Lygeros, “Comput- ing optimal joint chance constrained control policies,”Transactions on Automatic Control, vol. 70, no. 7, pp. 4904–4911, 2025

  12. [12]

    Chance-constrained dynamic programming with application to risk-aware robotic space exploration,

    Y . K. M. Ono, M. Pavone and J. Balaram, “Chance-constrained dynamic programming with application to risk-aware robotic space exploration,” Autonomous Robots, vol. 39, pp. 555–571, 2015

  13. [13]

    Solving mission-wide chance-constrained opti- mal control using dynamic programming,

    K. Wang and S. Gros, “Solving mission-wide chance-constrained opti- mal control using dynamic programming,” inConference on Decision and Control, 2022, pp. 2947–2952

  14. [14]

    Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,

    E. M. Hahn, V . Hashemi, H. Hermanns, M. Lahijanian, and A. Turrini, “Interval Markov decision processes with multiple objectives: From robust strategies to Pareto curves,”Transactions on Modeling and Computer Simulation, vol. 29, no. 4, pp. 1–31, 2019

  15. [15]

    Joint chance constrained optimal control via linear programming,

    N. Schmid, M. Fochesato, T. Sutter, and J. Lygeros, “Joint chance constrained optimal control via linear programming,”Control Systems Letters, vol. 8, pp. 736–741, 2024

  16. [16]

    Formal multi-objective synthe- sis of continuous-state MDPs,

    S. Haesaert, P. Nilsson, and S. Soudjani, “Formal multi-objective synthe- sis of continuous-state MDPs,” inAmerican Control Conference, 2021, pp. 3428–3433

  17. [17]

    Multi- objective model checking of Markov decision processes,

    K. Etessami, M. Kwiatkowska, M. Y . Vardi, and M. Yannakakis, “Multi- objective model checking of Markov decision processes,”Tools and Algorithms for the Construction and Analysis of Systems, vol. 4, pp. 50–65, 2008

  18. [18]

    Operator splitting for convex constrained Markov decision processes,

    P. D. Grontas, A. Tsiamis, and J. Lygeros, “Operator splitting for convex constrained Markov decision processes,”arXiv preprint arXiv:2412.14002, 2024

  19. [19]

    Policy gradients for probabilistic constrained reinforcement learning,

    W. Chen, D. Subramanian, and S. Paternain, “Policy gradients for probabilistic constrained reinforcement learning,” inConference on Information Sciences and Systems, 2023, pp. 1–6

  20. [20]

    Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,

    M. Xu, Z. Liu, P. Huang, W. Ding, Z. Cen, B. Li, and D. Zhao, “Trustworthy reinforcement learning against intrinsic vulnerabilities: Robustness, safety, and generalizability,”arXiv:2209.08025, 2022

  21. [21]

    First order constrained optimization in policy space,

    Y . Zhang, Q. Vuong, and K. Ross, “First order constrained optimization in policy space,”Conference on Neural Information Processing Systems, vol. 33, pp. 15 338–15 349, 2020

  22. [22]

    Natural policy gradient primal-dual method for constrained Markov decision processes,

    D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained Markov decision processes,”Confer- ence on Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020

  23. [23]

    Reward constrained policy optimization,

    C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” inInternational Conference on Learning Representations, 2018

  24. [24]

    A learning-based approach to stochastic op- timal control under reach-avoid constraint,

    T. Ni and M. Kamgarpour, “A learning-based approach to stochastic op- timal control under reach-avoid constraint,” inInternational Conference on Hybrid Systems: Computation and Control, 2025, pp. 1–8

  25. [25]

    Probabilistic control barrier functions: Safety in probability for discrete-time stochas- tic systems,

    P. Mestres, B. Werner, R. K. Cosner, and A. D. Ames, “Probabilistic control barrier functions: Safety in probability for discrete-time stochas- tic systems,”arXiv preprint arXiv:2510.01501, 2025

  26. [26]

    Hernández-Lerma and J

    O. Hernández-Lerma and J. B. Lasserre,Discrete-time Markov control processes: basic optimality criteria, ser. Applications of Mathematics. Springer Science & Business Media, 2012, vol. 30

  27. [27]

    Altman,Constrained Markov decision processes

    E. Altman,Constrained Markov decision processes. Routledge, 2021

  28. [28]

    Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,

    A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reacha- bility and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008

  29. [29]

    Approximate model checking of stochastic hybrid systems,

    A. Abate, J.-P. Katoen, J. Lygeros, and M. Prandini, “Approximate model checking of stochastic hybrid systems,”European Journal of Control, vol. 16, no. 6, pp. 624–641, 2010

  30. [30]

    Self-punishment and reward backfill for deep q-learning,

    M. R. Bonyadi, R. Wang, and M. Ziaei, “Self-punishment and reward backfill for deep q-learning,”Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 8086–8093, 2022

  31. [31]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational Conference on Machine Learning, 2018, pp. 1861–1870

  32. [32]

    all-rl-algorithms,

    F. Khan, “all-rl-algorithms,” https://github.com/fareedkhan-dev/ all-rl-algorithms, 2024, accessed: 26.06.2026

  33. [33]

    A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,

    D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepelet al., “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play,”Science, vol. 362, no. 6419, pp. 1140–1144, 2018

  34. [34]

    Learning robust perceptive locomotion for quadrupedal robots in the wild,

    T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,”Science robotics, vol. 7, no. 62, p. eabk2822, 2022

  35. [35]

    Nonatomic total rewards Markov decision processes with multiple criteria,

    E. A. Feinberg and A. B. Piunovskiy, “Nonatomic total rewards Markov decision processes with multiple criteria,”Journal of Mathematical Analysis and Applications, vol. 273, no. 1, pp. 93–111, 2002. APPENDIX The following sections contain the proofs of our main The- orems III.1 and III.2, as well as Proposition III.3. Throughout, we will integrate over his...

  36. [36]

    Ifπ z is optimal for (5), then its adaptation(π, a)to (2) optimally solves (2)

  37. [37]

    gN+1(zN+1) + NX k=0 gk(zk, uk) # (25a) s.t.E π z0

    If(π, a)is optimal for (2), then its adaptationπ z to (5) optimally solves (5). Proof.Letp z 0:N+1 be the occupation measure onH z 0:N+1 in- duced byπ z onM z, andp 0:N the occupation measure induced onH 0:N byπonM. For notational simplicity, we write L(hN) =ℓ N(xN) +PN−1 k=0 ℓk(xk, uk). As a consequence of 10 Lemma A.2, by a change of variablesh z N =ψ(h...