arxiv: 2605.05989 · v1 · submitted 2026-05-07 · 🧮 math.OC

Recognition: unknown

Verifiable Model-Free Safety Filters via Reinforcement Learning

Bihui Yin , Yiwen Lu , Yuchen Jiang , Yilin Mo

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:29 UTC · model grok-4.3

classification 🧮 math.OC

keywords safety filtersreinforcement learningquadratic programmingmodel-free controlformal certificatescontrol theoryverifiable safety

0 comments

The pith

Learning quadratic programming parameters via reinforcement learning yields a model-free safety filter with formal persistent safety certificates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a safety filter that does not require an accurate model of the controlled system. Instead of deriving the filter parameters from a model, it trains them using reinforcement learning on interaction data. The filter is structured as a quadratic programming problem solved by an unrolled network, which preserves the ability to prove that safety constraints are never violated over time. A reader would care because this removes a major practical barrier to using safety filters in complex or poorly modeled environments while retaining mathematical guarantees that plain neural network controllers lack. Tests indicate the approach uses less computation and intervenes less often than alternatives while improving safety metrics.

Core claim

By casting the safety filter as an unrolled quadratic programming solver and learning its parameters end-to-end with deep reinforcement learning, the method obtains a controller that operates without a system model yet still admits a formal certificate that the closed-loop trajectory satisfies safety constraints at every step.

What carries the argument

The unrolled quadratic programming solver network, whose parameters are tuned by reinforcement learning to encode safety constraints directly from data.

If this is right

The filter can be applied to systems where obtaining an accurate dynamic model is impractical or costly.
Formal safety proofs remain available after the learning process completes.
Per-step computation stays low because the quadratic program structure is retained rather than replaced by a generic neural net.
Minimal intervention is achieved by optimizing the filter to alter the nominal control action only when necessary for safety.
Overall performance exceeds both traditional model-based safety filters and standard reinforcement learning controllers in the reported metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This learning approach could be combined with other optimization layers to create verifiable policies for tasks beyond safety enforcement.
Training in simulation followed by direct transfer might be feasible if the certificate depends only on the learned parameters and not on model specifics.
Extending the method to continuous-time or hybrid systems would require analogous unrolling of the corresponding optimization problems.

Load-bearing premise

The safety certificate that was originally derived assuming quadratic programming parameters come from a known system model remains valid when those parameters are instead produced by reinforcement learning.

What would settle it

Finding a concrete system and learned parameter set where the quadratic programming certificate is satisfied yet the actual closed-loop behavior violates a safety constraint.

Figures

Figures reproduced from arXiv: 2605.05989 by Bihui Yin, Yilin Mo, Yiwen Lu, Yuchen Jiang.

**Figure 1.** Figure 1: Safety filter integrated in control architecture, view at source ↗

**Figure 2.** Figure 2: Proposed safety filter policy architecture, which view at source ↗

**Figure 3.** Figure 3: Performance of the proposed RL+LQP safety filter on a double integrator tracking task under high-noise ( view at source ↗

read the original abstract

This paper presents a reinforcement learning approach of a model-free safety filter, drawing inspiration from the framework of model-based Predictive Safety Filters (PSFs). Similar to conventional PSFs, our method adopts a Quadratic Programming (QP) formulation by representing the filter as an unrolled QP solver network. However, unlike existing PSFs that derive QP parameters explicitly from system models, we learn these parameters directly through Deep Reinforcement Learning (DRL), thereby eliminating the dependency on accurate system identification. Furthermore, compared to traditional neural network-based methods, this QP structure allows us to furnish a formal certificate for the persistent safety of the learned filter. Numerical results demonstrate that our method outperforms both conventional model-based PSFs and RL-trained Multi-Layer Perceptron (MLP) baselines in terms of safety guarantees, minimal intervention, and per-step computational load.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The QP structure does not automatically supply a formal safety certificate once the parameters are fitted by RL instead of derived from the model.

read the letter

The main thing to know is that this paper trains the parameters of an unrolled QP safety filter with deep RL to remove the need for a system model, while claiming the QP form still delivers a formal certificate for persistent safety. That claim is the part worth checking first. They represent the filter as a differentiable QP network and optimize its weights end-to-end with a reward that trades off safety and performance. This is a concrete step past standard model-based predictive safety filters, which need explicit dynamics to set the QP matrices, and it differs from generic neural controllers by keeping the QP structure. The reported simulations show lower intervention, stronger safety metrics, and lower per-step compute than both the model-based baselines and plain MLP policies. That is useful evidence for robotics settings where identification is costly. The soft spot is the certificate. Model-based PSFs derive the QP parameters so they satisfy algebraic conditions (for example, control-barrier-function inequalities) that guarantee invariance. When those parameters are instead the output of RL, the same conditions are not automatically met; the abstract gives no post-training verification step or model-free proof that the learned values preserve them. The stress-test note is therefore on target. Readers working on safe RL for autonomous systems would find the idea and the experiments worth seeing. The paper engages the literature honestly and shows clear thinking on the architecture, even if the central guarantee needs more support. I would send it to peer review so referees can examine whether the certificate actually transfers or whether extra constraints are required during training.

Referee Report

2 major / 2 minor

Summary. The paper proposes a model-free safety filter by representing a predictive safety filter (PSF) as an unrolled quadratic programming (QP) solver whose parameters are learned end-to-end via deep reinforcement learning rather than derived from an explicit system model. It claims that the retained QP structure nevertheless permits a formal certificate of persistent safety, and reports numerical results showing reduced intervention, stronger safety, and lower per-step compute compared with both classical model-based PSFs and plain MLP-based RL baselines.

Significance. If the formal certificate can be shown to survive the replacement of model-derived QP parameters by RL-optimized ones, the approach would constitute a concrete step toward verifiable, model-free safety filters. The structural choice of an unrolled QP network is a strength that could aid both interpretability and certification; the reported computational and safety gains are practically relevant for real-time control.

major comments (2)

[Abstract, §4] Abstract and §4 (safety-certificate claim): the manuscript asserts that the QP structure 'allows us to furnish a formal certificate for the persistent safety of the learned filter,' yet supplies neither the explicit algebraic conditions that the learned parameters must satisfy nor a proof that RL optimization preserves those conditions. Standard PSF certificates rely on model-derived quantities (e.g., barrier gradients or constraint matrices) that guarantee CBF invariance; without an independent verification step or a derivation showing that the RL objective enforces the same algebraic relations, the certificate claim is unsupported.
[§3.2] §3.2 (unrolled QP network and RL training): the reward function balances safety and performance, but the paper does not demonstrate that the resulting parameters continue to satisfy the linear independence or positive-definiteness conditions required for the QP to recover a valid safety filter. If these conditions are violated post-training, the formal certificate cannot be invoked.

minor comments (2)

[Figure 3] Figure 3 (comparison plots): axis labels and legend entries are too small for print; enlarge or split into separate panels.
[§3.1, Appendix] Notation: the symbol for the learned QP matrix is introduced inconsistently between §3.1 and the appendix; adopt a single definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments highlight important points regarding the rigor of our safety-certificate claim. We address each major comment below, agreeing where the manuscript is incomplete and outlining specific revisions that will strengthen the presentation without altering the core contribution.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (safety-certificate claim): the manuscript asserts that the QP structure 'allows us to furnish a formal certificate for the persistent safety of the learned filter,' yet supplies neither the explicit algebraic conditions that the learned parameters must satisfy nor a proof that RL optimization preserves those conditions. Standard PSF certificates rely on model-derived quantities (e.g., barrier gradients or constraint matrices) that guarantee CBF invariance; without an independent verification step or a derivation showing that the RL objective enforces the same algebraic relations, the certificate claim is unsupported.

Authors: We agree that the current manuscript does not explicitly state the algebraic conditions on the learned QP parameters or provide a derivation showing that the RL objective preserves them. The formal certificate is inherited from standard PSF theory: persistent safety follows from CBF invariance provided the QP is well-posed (positive-definite Hessian and linearly independent active constraints). Our approach retains this structure, so the certificate applies whenever the learned parameters satisfy those conditions; the RL reward penalizes violations and thereby encourages feasible, safe behavior. However, we did not include an explicit derivation or post-training verification step. In the revised manuscript we will add a dedicated paragraph in §4 that (i) recalls the precise algebraic conditions required for the QP to define a valid CBF-based safety filter and (ii) describes a lightweight post-training check (eigenvalue test for positive-definiteness and rank test for constraint independence) that can be performed on the learned parameters. This clarification will make the certificate claim fully supported while preserving the model-free training procedure. revision: yes
Referee: [§3.2] §3.2 (unrolled QP network and RL training): the reward function balances safety and performance, but the paper does not demonstrate that the resulting parameters continue to satisfy the linear independence or positive-definiteness conditions required for the QP to recover a valid safety filter. If these conditions are violated post-training, the formal certificate cannot be invoked.

Authors: The referee is correct that §3.2 does not explicitly verify satisfaction of the QP regularity conditions after training. While the unrolled QP architecture guarantees that any output is the exact solution of the parameterized QP, the safety-filter interpretation requires the learned parameters to meet positive-definiteness and linear-independence requirements. In the reported experiments the learned filters achieved the stated safety performance, which is consistent with the conditions holding, yet we did not report explicit checks. We will revise §3.2 to state the conditions mathematically and add a short table (or paragraph) in the numerical-results section showing that, for every trained instance across the reported trials, the Hessian eigenvalues were positive and the constraint matrix had full row rank. These additions will allow readers to invoke the formal certificate with the same rigor as in model-based PSFs. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and description present a model-free extension of PSFs by learning QP parameters via DRL rather than explicit model derivation, with the formal safety certificate attributed to the QP structure itself. No equations, derivation steps, or self-citations are available in the text to exhibit any reduction of the claimed certificate to the RL fitting process by construction (e.g., no case where a safety condition is shown to hold tautologically because it was optimized into the reward or parameters). The approach is described as preserving the QP form for verifiability while removing model dependency, which is an independent methodological choice rather than a self-referential loop. Numerical results are presented separately as empirical support. Per the rules, without a quotable specific reduction, no circularity is flagged.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger records the minimal assumptions implied by the text.

axioms (2)

domain assumption The QP formulation supplies a formal safety certificate when its parameters satisfy certain conditions.
Inherited from conventional model-based PSFs and invoked to justify the certificate for the learned filter.
ad hoc to paper Reinforcement learning can discover parameters that preserve the safety properties of the QP.
Central modeling choice that converts the model-based guarantee into a model-free one.

pith-pipeline@v0.9.0 · 5438 in / 1244 out tokens · 51848 ms · 2026-05-08T08:29:29.251836+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 15 canonical work pages · 1 internal anchor

[1]

Discrete control bar- rier functions for safety-critical control of discrete systems with application to bipedal robot naviga- tion,

Agrawal, A. and Sreenath, K. (2017). Discrete control barrier functions for safety-critical control of discrete systems with application to bipedal robot navigation. doi:10.15607/RSS.2017.XIII.073

work page doi:10.15607/rss.2017.xiii.073 2017
[2]

Ames, A.D., Xu, X., Grizzle, J.W., and Tabuada, P. (2017). Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control, 62(8), 3861--3876

2017
[3]

and Tomlin, C.J

Bansal, S. and Tomlin, C.J. (2021). Deepreach: A deep learning approach to high-dimensional reachability. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 1817--1824

2021
[4]

Bastani, O. (2021). Safe reinforcement learning with nonlinear dynamics via model predictive shielding. In 2021 American Control Conference (ACC), 3488--3494

2021
[5]

Borrelli, F., Bemporad, A., and Morari, M. (2017). Predictive Control for Linear and Hybrid Systems. Cambridge University Press

2017
[6]

Journal of Mathematical Imaging and Vision40(1), 120–145 (2010)

Chambolle, A. and Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1), 120--145. doi:10.1007/s10851-010-0251-1

work page doi:10.1007/s10851-010-0251-1 2011
[7]

Choi, J.J., Castañeda, F., Jung, W., Zhang, B., Tomlin, C.J., and Sreenath, K. (2025). Constraint-guided online data selection for scalable data-driven safety filters in uncertain robotic systems. IEEE Transactions on Robotics, 41, 3779--3798. doi:10.1109/TRO.2025.3577022

work page doi:10.1109/tro.2025.3577022 2025
[8]

Cosner, R.K., Rodriguez, I.D.J., Molnar, T.G., Ubellacker, W., Yue, Y., Ames, A.D., and Bouman, K.L. (2022). Self-supervised online learning for safety-critical control using stereo vision. In 2022 International Conference on Robotics and Automation (ICRA), 11487--11493

2022
[9]

Dawson, C., Qin, Z., Gao, S., and Fan, C. (2022). Safe nonlinear control using robust neural lyapunov-barrier functions. In A. Faust, D. Hsu, and G. Neumann (eds.), Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, 1724--1735. PMLR

2022
[10]

A Cartpole Experiment Benchmark for Trainable Controllers

Geva, S. and Sitte, J. (1993). A cartpole experiment benchmark for trainable controllers. IEEE Control Systems Magazine, 13(5), 40--51. doi:10.1109/37.236324

work page doi:10.1109/37.236324 1993
[11]

Herbert, S., Choi, J.J., Sanjeev, S., Gibson, M., Sreenath, K., and Tomlin, C.J. (2021). Scalable learning of safety guarantees for autonomous systems using hamilton-jacobi reachability. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 5914–5920. IEEE Press

2021
[12]

Hsu, K.C., Hu, H., and Fisac, J.F. (2023). The safety filter: A unified view of safety-critical control in autonomous systems

2023
[13]

Johansson, K. (2000). The quadruple-tank process: a multivariable laboratory process with an adjustable zero. IEEE Transactions on Control Systems Technology, 8(3), 456--465. doi:10.1109/87.845876

work page doi:10.1109/87.845876 2000
[14]

Lasserre, J.B. (2001). Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11(3), 796--817. doi:10.1137/S1052623400366802

work page doi:10.1137/s1052623400366802 2001
[15]

Lavanakul, W., Choi, J.J., Sreenath, K., and Tomlin, C.J. (2024). Safety filters for black-box dynamical systems by learning discriminating hyperplanes. In Conference on Learning for Dynamics & Control

2024
[16]

Li, Z., Yang, B., Li, J., Yan, J., and Mo, Y. (2023). Linear model predictive control under continuous path constraints via parallelized primal-dual hybrid gradient algorithm. 2023 62nd IEEE Conference on Decision and Control (CDC), 159--164

2023
[17]

Long, K., Yi, Y., Dai, Z., Herbert, S., Cortés, J., and Atanasov, N. (2024). Sensor-based distributionally robust control for safe robot navigation in dynamic environments. CoRR, abs/2405.18251

work page arXiv 2024
[18]

Lu, Y., Li, Z., Zhou, Y., Li, N., and Mo, Y. (2023). Mpc-inspired reinforcement learning for verifiable model-free control. arXiv preprint arXiv:2312.05332

work page arXiv 2023
[19]

and Lygeros, J

Margellos, K. and Lygeros, J. (2011). Hamilton–jacobi formulation for reach–avoid differential games. IEEE Transactions on Automatic Control, 56(8), 1849--1861

2011
[20]

Mestres, P., Chen, Y., Dall'anese, E., and Cortés, J. (2025). Control barrier function-based safety filters: Characterization of undesired equilibria, unbounded trajectories, and limit cycles. ://arxiv.org/abs/2501.09289

work page arXiv 2025
[21]

Monga, V., Li, Y., and Eldar, Y.C. (2021). Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine, 38(2), 18--44. doi:10.1109/MSP.2020.3018525

work page doi:10.1109/msp.2020.3018525 2021
[22]

Parrilo, P.A. (2003). Semidefinite programming relaxations for semialgebraic problems. Mathematical programming, 96(2), 293--320

2003
[23]

Robey, A., Hu, H., Lindemann, L., Zhang, H., Dimarogonas, D.V., Tu, S., and Matni, N. (2020). Learning control barrier functions from expert demonstrations. In 2020 59th IEEE Conference on Decision and Control (CDC), 3717--3724

2020
[24]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. CoRR, abs/1707.06347

work page internal anchor Pith review arXiv 2017
[25]

So, O., Serlin, Z., Mann, M., Gonzales, J., Rutledge, K., Roy, N., and Fan, C. (2024). How to train your neural control barrier function: Learning safety filters for complex input-constrained systems. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 11532--11539. doi:10.1109/ICRA57147.2024.10610418

work page doi:10.1109/icra57147.2024.10610418 2024
[26]

and Barto, A.G

Sutton, R.S. and Barto, A.G. (2018). Reinforcement learning: An introduction. MIT press

2018
[27]

Tang, Y., Chu, X., Huang, J., and Samuel Au, K.W. (2024). Learning-based mpc with safety filter for constrained deformable linear object manipulation. IEEE Robotics and Automation Letters, 9(3), 2877--2884. doi:10.1109/LRA.2024.3362643

work page doi:10.1109/lra.2024.3362643 2024
[28]

Viljoen, J., Shaw-Cortez, W., Drgoňa, J., East, S., Tomizuka, M., and Vrabie, D.L. (2024). Differentiable predictive control for robotics: A data-driven predictive safety filter approach. ArXiv, abs/2409.13817

work page arXiv 2024
[29]

and Zeilinger, M.N

Wabersich, K.P. and Zeilinger, M.N. (2018). Linear model predictive safety certification for learning-based control. In 2018 IEEE Conference on Decision and Control (CDC), 7130--7135. doi:10.1109/CDC.2018.8619829

work page doi:10.1109/cdc.2018.8619829 2018
[30]

and Zeilinger, M.N

Wabersich, K.P. and Zeilinger, M.N. (2021). A predictive safety filter for learning-based control of constrained nonlinear dynamical systems. Automatica, 129, 109597

2021
[31]

and Allgöwer, F

Wieland, P. and Allgöwer, F. (2007). Constructive safety using control barrier functions. IFAC Proceedings Volumes, 40(12), 462--467. 7th IFAC Symposium on Nonlinear Control Systems

2007