Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

Arvind Easwaran; Mohit Prashant

arxiv: 2606.04812 · v2 · pith:VPA6R3OOnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

Mohit Prashant , Arvind Easwaran This is my paper

Pith reviewed 2026-06-28 07:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningsafetybarrier certificatesvariational autoencoderprobabilistic guaranteesscenario generationrisk-aware learning

0 comments

The pith

Variational autoencoders model state distributions to build dual upper and lower probabilistic barrier certificates that tighten safety guarantees for reinforcement learning policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method to provide probably approximately safe guarantees for RL agents by approximating encountered states with a variational autoencoder and deriving upper and lower barrier certificates from latent features. This addresses the problem of policies failing under transition perturbations that lead to unexplored unsafe states. The dual certificates allow conservative and optimistic estimates of safe regions, with sampling in their difference used to refine bounds during training. A reader would care because it offers a concrete way to demarcate known safe behavior from unknown behavior with quantifiable probabilistic confidence before real-world deployment.

Core claim

We approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety.

What carries the argument

A variational autoencoder that approximates the state distribution, combined with dual upper and lower barrier certificates constructed from latent state characteristics for conservative and optimistic safety bounds.

If this is right

The dual optimization yields a conservative lower bound and an optimistic upper bound on the safe region.
Sampling states in the non-robust region between bounds during training sharpens the probabilistic safety guarantees.
The resulting certificates demarcate known safe behavior from unknown behavior with high confidence.
Experimental results demonstrate the tightness of the derived upper and lower bounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The latent-space approach might extend to other generative models for state approximation if VAEs prove insufficient for certain dynamics.
This scenario-generation technique could connect to robust optimization methods that sample worst-case perturbations explicitly.
A testable extension would involve applying the dual-certificate tightening to continuous control tasks with physical sensor noise.

Load-bearing premise

The variational autoencoder trained on policy trajectories accurately represents the true distribution of states the agent will encounter.

What would settle it

Run the trained policy in an environment where it reaches states outside the VAE-modeled distribution and observe whether safety violations occur at rates exceeding the claimed probabilistic bounds.

Figures

Figures reproduced from arXiv: 2606.04812 by Arvind Easwaran, Mohit Prashant.

read the original abstract

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper tries to tighten probabilistic safety bounds in RL by fitting a VAE to trajectories and using dual barrier certificates optimized over latent samples, but the abstract supplies no derivation linking the VAE error to the claimed guarantees.

read the letter

The core move is to model encountered states with a VAE, then build upper and lower barrier certificates from latent features and tighten them by sampling the set difference between the two bounds. This dual-optimization framing is the main technical step beyond standard barrier-certificate sampling.

It does address a practical need: getting sharper probably-approximately-safe statements when policies face transition noise. Using the VAE latent space to define the non-robust region and then sampling there is a direct way to focus effort on the uncertain parts.

The weak point is the missing link between VAE approximation quality and the probabilistic bounds. The abstract does not show how reconstruction error, KL gap, or coverage failure is folded into the certificate construction, so it is not clear whether the final guarantees remain valid outside the training trajectories. The circularity risk noted in the reader report is also real if the same trajectories train both the policy and the VAE.

The work is aimed at researchers already working on safe RL verification. Someone looking for new combinations of density estimation and certificates might find the framing useful, but only if the full paper supplies the missing error analysis and quantitative results.

The paper should go to peer review. The topic matters and the combination has not been ruled out by prior work, but referees will need to check whether the bounds actually hold once the VAE approximation is accounted for.

Referee Report

2 major / 1 minor

Summary. The paper claims that approximating the state distribution encountered by an RL policy via a variational autoencoder (VAE) enables construction of upper- and lower-bound probabilistic barrier certificates from latent-space features; these bounds are tightened by framing safety verification as a dual optimization problem whose non-robust region (set difference) is sampled during training, yielding probably-approximately-safe guarantees whose tightness is demonstrated experimentally.

Significance. If the VAE approximation error can be rigorously folded into the certificate bounds, the approach would supply a concrete mechanism for scenario generation that produces falsifiable, high-confidence safety regions for RL policies under transition uncertainty, an area where most existing barrier-certificate methods either assume known dynamics or lack explicit density-estimation error control.

major comments (2)

[Abstract] Abstract: the central claim that latent characteristics of the VAE yield 'upper and lower-bound barrier-certificates' with 'probably approximately safe guarantees' is load-bearing, yet the text supplies no derivation showing how reconstruction error, KL divergence gap, or posterior collapse are propagated into the violation-probability bounds; without this step the dual-optimization tightening cannot inherit validity outside the training trajectories.
[Abstract (and the implied construction in §3–4)] The dual-optimization construction (sampling the set difference between upper- and lower-bound certificates) presupposes that the VAE density estimate covers the relevant state space with quantifiable error; the manuscript provides neither a coverage argument nor an explicit error term that would make the resulting probabilistic bounds valid under the 'probably approximately' qualifier.

minor comments (1)

[Abstract] The abstract sentence 'we describe the guarantees placed and demonstrate the tightness of our bounds experimentally' is vague about which quantitative metric (e.g., empirical violation rate vs. theoretical bound gap) is reported; a table or figure reference would clarify the experimental claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need for explicit derivations and error terms to support the probabilistic claims. We address the two major comments below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that latent characteristics of the VAE yield 'upper and lower-bound barrier-certificates' with 'probably approximately safe guarantees' is load-bearing, yet the text supplies no derivation showing how reconstruction error, KL divergence gap, or posterior collapse are propagated into the violation-probability bounds; without this step the dual-optimization tightening cannot inherit validity outside the training trajectories.

Authors: We agree that the manuscript does not contain an explicit derivation propagating VAE reconstruction error, KL divergence, or posterior collapse into the violation-probability bounds. The current text presents the dual-optimization construction and experimental results but omits this propagation step. In the revised version we will add a formal derivation (new subsection in §3) that folds the variational bound and reconstruction error directly into the certificate violation probabilities, thereby ensuring the dual-optimization tightening inherits validity beyond the training trajectories. revision: yes
Referee: [Abstract (and the implied construction in §3–4)] The dual-optimization construction (sampling the set difference between upper- and lower-bound certificates) presupposes that the VAE density estimate covers the relevant state space with quantifiable error; the manuscript provides neither a coverage argument nor an explicit error term that would make the resulting probabilistic bounds valid under the 'probably approximately' qualifier.

Authors: The referee is correct that the manuscript supplies neither a coverage argument nor an explicit error term for the VAE density estimate. We will revise §3–4 to include (i) a coverage guarantee based on the number of sampled trajectories and the VAE latent-space density, and (ii) an explicit additive error term that is propagated through the dual optimization, making the 'probably approximately safe' qualifier formally justified. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses VAE approximation as modeling step without reducing to self-definition or fitted prediction by construction

full rationale

The paper's chain proceeds by training a VAE on policy trajectories to approximate the state distribution, then constructing barrier certificates from latent features via dual optimization on the set difference. No equation or claim reduces the certificates to the VAE fit by definition, nor renames a fitted quantity as an independent prediction. No self-citations are invoked as load-bearing uniqueness theorems. The approach is a standard generative modeling pipeline for bounding; any dependence between training data and evaluation is a validity concern, not a circular reduction of the claimed guarantees to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.1-grok · 5758 in / 1107 out tokens · 23869 ms · 2026-06-28T07:05:54.072426+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 8 canonical work pages

[1]

Edoardo Bacci and David Parker. 2020. Probabilistic Guarantees for Safe Deep Reinforcement Learning. InFormal Modeling and Analysis of Timed Systems: 18th International Conference, FORMATS 2020, Vienna, Austria, September 1–3, 2020, Proceedings(Vienna, Austria). Springer-Verlag, Berlin, Heidelberg, 231–248. https://doi.org/10.1007/978-3-030-57628-8_14

work page doi:10.1007/978-3-030-57628-8_14 2020
[2]

Osbert Bastani and Shuo Li. 2021. Safe Reinforcement Learning via Statistical Model Predictive Shielding. https://doi.org/10.15607/RSS.2021.XVII.026

work page doi:10.15607/rss.2021.xvii.026 2021
[3]

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause
[4]

In Advances in Neural Information Processing Systems, I

Safe Model-based Reinforcement Learning with Stability Guarantees. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/ file/766ebcd59621e305170616ba3d3dac32-Paper.pdf

2017
[5]

Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. 2022. Safe learning in robotics: From learning- based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems5 (2022), 411–444

2022
[6]

Campi, Simone Garatti, and Maria Prandini

Marco C. Campi, Simone Garatti, and Maria Prandini. 2008. The Scenario Ap- proach for Systems and Control Design.IFAC Proceedings Volumes41, 2 (2008), 381–389. https://doi.org/10.3182/20080706-5-KR-1001.00065 17th IFAC World Congress

work page doi:10.3182/20080706-5-kr-1001.00065 2008
[7]

Richard Cheng, Gábor Orosz, Richard Murray, and Joel Burdick. 2019. End-to- End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks

2019
[8]

William R Clements, Bastien Van Delft, Benoît-Marie Robaglia, Reda Bahi Slaoui, and Sébastien Toth. 2019. Estimating risk and uncertainty in deep reinforcement learning.arXiv preprint arXiv:1905.09638(2019)

arXiv 2019
[9]

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. 2017. Reverse curriculum generation for reinforcement learning. In Conference on robot learning. PMLR, 482–495

2017
[10]

Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, and Pierre-Yves Oudeyer
[11]

Accuracy-based curriculum learning in deep reinforcement learning.arXiv preprint arXiv:1806.09614(2018)

Pith/arXiv arXiv 2018
[12]

Nathan Fulton and André Platzer. 2018. Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning

2018
[13]

Akshita Gupta and Inseok Hwang. 2020. Safety Verification of Model Based Reinforcement Learning Controllers. arXiv:2010.10740 [cs.LG] https://arxiv.org/ abs/2010.10740

arXiv 2020
[14]

Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and Aleksandra Faust. 2022. Environment Generation for Zero- Shot Compositional Reinforcement Learning. arXiv:2201.08896 [cs.LG] https: //arxiv.org/abs/2201.08896

arXiv 2022
[15]

Tom Haider, Felippe Schmoeller Roza, Dirk Eilers, Karsten Roscher, and Stephan Günnemann. 2021. Domain Shifts in Reinforcement Learning: Identifying Dis- turbances in Environments.. InAISafety@ IJCAI

2021
[16]

John Jackson, Luca Laurenti, Eric Frew, and Morteza Lahijanian. 2020. Safety Verification of Unknown Dynamical Systems via Gaussian Process Regression. In2020 59th IEEE Conference on Decision and Control (CDC)(Jeju Island, Korea (South)). IEEE Press, 860–866. https://doi.org/10.1109/CDC42340.2020.9303814

work page doi:10.1109/cdc42340.2020.9303814 2020
[17]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

Pith/arXiv arXiv 2013
[18]

Andersson, and Calin Belta

Morteza Lahijanian, Sean B. Andersson, and Calin Belta. 2015. Formal Verification and Synthesis for Discrete-Time Stochastic Systems.IEEE Trans. Automat. Control 60, 8 (2015), 2031–2045. https://doi.org/10.1109/TAC.2015.2398883

work page doi:10.1109/tac.2015.2398883 2015
[19]

Matthew Landers and Afsaneh Doryab. 2023. Deep Reinforcement Learning Verification: A Survey.ACM Comput. Surv.55, 14s, Article 330 (July 2023), 31 pages. https://doi.org/10.1145/3596444

work page doi:10.1145/3596444 2023
[20]

Owen Lockwood and Mei Si. 2022. A review of uncertainty for deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 18. 155–162

2022
[21]

Yuping Luo and Tengyu Ma. 2022. Learning Barrier Certificates: To- wards Safe Reinforcement Learning with Zero Training-time Violations. arXiv:2108.01846 [cs.LG] https://arxiv.org/abs/2108.01846

arXiv 2022
[22]

Xinyu Mao, Wanli Yu, Kazunori D Yamada, and Michael R Zielewski. 2024. Pro- cedural content generation via generative artificial intelligence.arXiv preprint arXiv:2407.09013(2024)

arXiv 2024
[23]

Amir Modares, Nasser Sadati, Babak Esmaeili, Farnaz Adib Yaghmaie, and Hamidreza Modares. 2024. Safe Reinforcement Learning via a Model-Free Safety Certifier.IEEE Transactions on Neural Networks and Learning Systems35, 3 (2024), 3302–3311. https://doi.org/10.1109/TNNLS.2023.3264815

work page doi:10.1109/tnnls.2023.3264815 2024
[24]

Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. 2020. Curriculum learning for reinforcement learning domains: A framework and survey.Journal of Machine Learning Research21, 181 (2020), 1–50

2020
[25]

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Ro- bust Adversarial Reinforcement Learning. arXiv:1703.02702 [cs.LG] https: //arxiv.org/abs/1703.02702

Pith/arXiv arXiv 2017
[26]

Stephen Prajna, Ali Jadbabaie, and George Pappas. 2005. Stochastic Safety Verifi- cation Using Barrier Certificates.Proceedings of the IEEE Conference on Decision and Control1, 929 – 934 Vol.1. https://doi.org/10.1109/CDC.2004.1428804

work page doi:10.1109/cdc.2004.1428804 2005
[27]

Zengyi Qin, Kaiqing Zhang, Yuxiao Chen, Jingkai Chen, and Chuchu Fan. 2021. Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates. arXiv:2101.05436 [cs.MA] https://arxiv.org/abs/2101.05436

arXiv 2021
[28]

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. 2021. Stable-Baselines3: Reliable Reinforcement Learning Implementations.Journal of Machine Learning Research22, 268 (2021), 1–8. http://jmlr.org/papers/v22/20-1364.html

2021
[29]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
[30]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

Pith/arXiv arXiv 2017
[31]

Junru Sheng, Peng Zhai, Zhiyan Dong, Xiaoyang Kang, Chixiao Chen, and Lihua Zhang. 2022. Curriculum adversarial training for robust reinforcement learning. In2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

2022
[32]

Arambam James Singh and Arvind Easwaran. 2024. PAS: Probably Approximate Safety Verification of Reinforcement Learning Policy Using Scenario Optimiza- tion. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems(Auckland, New Zealand)(AAMAS ’24). International Founda- tion for Autonomous Agents and Multiagent System...

2024
[33]

Yeeho Song and Jeff Schneider. 2022. Robust reinforcement learning via genetic curriculum. In2022 International Conference on Robotics and Automation (ICRA). IEEE, 5560–5566

2022
[34]

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. 2024. Gymnasium: A Standard Interface for Reinforcement Learning Environments.arXiv preprint arXiv:2407.17032(2024)

Pith/arXiv arXiv 2024
[35]

Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. 2019. Programmatically Interpretable Reinforcement Learning. arXiv:1804.02477 [cs.LG] https://arxiv.org/abs/1804.02477

Pith/arXiv arXiv 2019
[36]

Jingda Wu, Zhiyu Huang, and Chen Lv. 2022. Uncertainty-aware model-based reinforcement learning: Methodology and application in autonomous driving. IEEE Transactions on Intelligent Vehicles8, 1 (2022), 194–203

2022
[37]

Linrui Zhang, Qin Zhang, Li Shen, Bo Yuan, Xueqian Wang, and Dacheng Tao
[38]

arXiv:2212.05727 [cs.LG] https://arxiv.org/abs/2212.05727

Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks. arXiv:2212.05727 [cs.LG] https://arxiv.org/abs/2212.05727

arXiv

[1] [1]

Edoardo Bacci and David Parker. 2020. Probabilistic Guarantees for Safe Deep Reinforcement Learning. InFormal Modeling and Analysis of Timed Systems: 18th International Conference, FORMATS 2020, Vienna, Austria, September 1–3, 2020, Proceedings(Vienna, Austria). Springer-Verlag, Berlin, Heidelberg, 231–248. https://doi.org/10.1007/978-3-030-57628-8_14

work page doi:10.1007/978-3-030-57628-8_14 2020

[2] [2]

Osbert Bastani and Shuo Li. 2021. Safe Reinforcement Learning via Statistical Model Predictive Shielding. https://doi.org/10.15607/RSS.2021.XVII.026

work page doi:10.15607/rss.2021.xvii.026 2021

[3] [3]

Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause

[4] [4]

In Advances in Neural Information Processing Systems, I

Safe Model-based Reinforcement Learning with Stability Guarantees. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/ file/766ebcd59621e305170616ba3d3dac32-Paper.pdf

2017

[5] [5]

Lukas Brunke, Melissa Greeff, Adam W Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P Schoellig. 2022. Safe learning in robotics: From learning- based control to safe reinforcement learning.Annual Review of Control, Robotics, and Autonomous Systems5 (2022), 411–444

2022

[6] [6]

Campi, Simone Garatti, and Maria Prandini

Marco C. Campi, Simone Garatti, and Maria Prandini. 2008. The Scenario Ap- proach for Systems and Control Design.IFAC Proceedings Volumes41, 2 (2008), 381–389. https://doi.org/10.3182/20080706-5-KR-1001.00065 17th IFAC World Congress

work page doi:10.3182/20080706-5-kr-1001.00065 2008

[7] [7]

Richard Cheng, Gábor Orosz, Richard Murray, and Joel Burdick. 2019. End-to- End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks

2019

[8] [8]

William R Clements, Bastien Van Delft, Benoît-Marie Robaglia, Reda Bahi Slaoui, and Sébastien Toth. 2019. Estimating risk and uncertainty in deep reinforcement learning.arXiv preprint arXiv:1905.09638(2019)

arXiv 2019

[9] [9]

Carlos Florensa, David Held, Markus Wulfmeier, Michael Zhang, and Pieter Abbeel. 2017. Reverse curriculum generation for reinforcement learning. In Conference on robot learning. PMLR, 482–495

2017

[10] [10]

Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, and Pierre-Yves Oudeyer

[11] [11]

Accuracy-based curriculum learning in deep reinforcement learning.arXiv preprint arXiv:1806.09614(2018)

Pith/arXiv arXiv 2018

[12] [12]

Nathan Fulton and André Platzer. 2018. Safe Reinforcement Learning via Formal Methods: Toward Safe Control Through Proof and Learning

2018

[13] [13]

Akshita Gupta and Inseok Hwang. 2020. Safety Verification of Model Based Reinforcement Learning Controllers. arXiv:2010.10740 [cs.LG] https://arxiv.org/ abs/2010.10740

arXiv 2020

[14] [14]

Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Manoj Tiwari, Honglak Lee, and Aleksandra Faust. 2022. Environment Generation for Zero- Shot Compositional Reinforcement Learning. arXiv:2201.08896 [cs.LG] https: //arxiv.org/abs/2201.08896

arXiv 2022

[15] [15]

Tom Haider, Felippe Schmoeller Roza, Dirk Eilers, Karsten Roscher, and Stephan Günnemann. 2021. Domain Shifts in Reinforcement Learning: Identifying Dis- turbances in Environments.. InAISafety@ IJCAI

2021

[16] [16]

John Jackson, Luca Laurenti, Eric Frew, and Morteza Lahijanian. 2020. Safety Verification of Unknown Dynamical Systems via Gaussian Process Regression. In2020 59th IEEE Conference on Decision and Control (CDC)(Jeju Island, Korea (South)). IEEE Press, 860–866. https://doi.org/10.1109/CDC42340.2020.9303814

work page doi:10.1109/cdc42340.2020.9303814 2020

[17] [17]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013)

Pith/arXiv arXiv 2013

[18] [18]

Andersson, and Calin Belta

Morteza Lahijanian, Sean B. Andersson, and Calin Belta. 2015. Formal Verification and Synthesis for Discrete-Time Stochastic Systems.IEEE Trans. Automat. Control 60, 8 (2015), 2031–2045. https://doi.org/10.1109/TAC.2015.2398883

work page doi:10.1109/tac.2015.2398883 2015

[19] [19]

Matthew Landers and Afsaneh Doryab. 2023. Deep Reinforcement Learning Verification: A Survey.ACM Comput. Surv.55, 14s, Article 330 (July 2023), 31 pages. https://doi.org/10.1145/3596444

work page doi:10.1145/3596444 2023

[20] [20]

Owen Lockwood and Mei Si. 2022. A review of uncertainty for deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, Vol. 18. 155–162

2022

[21] [21]

Yuping Luo and Tengyu Ma. 2022. Learning Barrier Certificates: To- wards Safe Reinforcement Learning with Zero Training-time Violations. arXiv:2108.01846 [cs.LG] https://arxiv.org/abs/2108.01846

arXiv 2022

[22] [22]

Xinyu Mao, Wanli Yu, Kazunori D Yamada, and Michael R Zielewski. 2024. Pro- cedural content generation via generative artificial intelligence.arXiv preprint arXiv:2407.09013(2024)

arXiv 2024

[23] [23]

Amir Modares, Nasser Sadati, Babak Esmaeili, Farnaz Adib Yaghmaie, and Hamidreza Modares. 2024. Safe Reinforcement Learning via a Model-Free Safety Certifier.IEEE Transactions on Neural Networks and Learning Systems35, 3 (2024), 3302–3311. https://doi.org/10.1109/TNNLS.2023.3264815

work page doi:10.1109/tnnls.2023.3264815 2024

[24] [24]

Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. 2020. Curriculum learning for reinforcement learning domains: A framework and survey.Journal of Machine Learning Research21, 181 (2020), 1–50

2020

[25] [25]

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017. Ro- bust Adversarial Reinforcement Learning. arXiv:1703.02702 [cs.LG] https: //arxiv.org/abs/1703.02702

Pith/arXiv arXiv 2017

[26] [26]

Stephen Prajna, Ali Jadbabaie, and George Pappas. 2005. Stochastic Safety Verifi- cation Using Barrier Certificates.Proceedings of the IEEE Conference on Decision and Control1, 929 – 934 Vol.1. https://doi.org/10.1109/CDC.2004.1428804

work page doi:10.1109/cdc.2004.1428804 2005

[27] [27]

Zengyi Qin, Kaiqing Zhang, Yuxiao Chen, Jingkai Chen, and Chuchu Fan. 2021. Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates. arXiv:2101.05436 [cs.MA] https://arxiv.org/abs/2101.05436

arXiv 2021

[28] [28]

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. 2021. Stable-Baselines3: Reliable Reinforcement Learning Implementations.Journal of Machine Learning Research22, 268 (2021), 1–8. http://jmlr.org/papers/v22/20-1364.html

2021

[29] [29]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

[30] [30]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

Pith/arXiv arXiv 2017

[31] [31]

Junru Sheng, Peng Zhai, Zhiyan Dong, Xiaoyang Kang, Chixiao Chen, and Lihua Zhang. 2022. Curriculum adversarial training for robust reinforcement learning. In2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

2022

[32] [32]

Arambam James Singh and Arvind Easwaran. 2024. PAS: Probably Approximate Safety Verification of Reinforcement Learning Policy Using Scenario Optimiza- tion. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems(Auckland, New Zealand)(AAMAS ’24). International Founda- tion for Autonomous Agents and Multiagent System...

2024

[33] [33]

Yeeho Song and Jeff Schneider. 2022. Robust reinforcement learning via genetic curriculum. In2022 International Conference on Robotics and Automation (ICRA). IEEE, 5560–5566

2022

[34] [34]

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. 2024. Gymnasium: A Standard Interface for Reinforcement Learning Environments.arXiv preprint arXiv:2407.17032(2024)

Pith/arXiv arXiv 2024

[35] [35]

Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. 2019. Programmatically Interpretable Reinforcement Learning. arXiv:1804.02477 [cs.LG] https://arxiv.org/abs/1804.02477

Pith/arXiv arXiv 2019

[36] [36]

Jingda Wu, Zhiyu Huang, and Chen Lv. 2022. Uncertainty-aware model-based reinforcement learning: Methodology and application in autonomous driving. IEEE Transactions on Intelligent Vehicles8, 1 (2022), 194–203

2022

[37] [37]

Linrui Zhang, Qin Zhang, Li Shen, Bo Yuan, Xueqian Wang, and Dacheng Tao

[38] [38]

arXiv:2212.05727 [cs.LG] https://arxiv.org/abs/2212.05727

Evaluating Model-free Reinforcement Learning toward Safety-critical Tasks. arXiv:2212.05727 [cs.LG] https://arxiv.org/abs/2212.05727

arXiv