CRAX: Fast Safe Reinforcement Learning Benchmarking

Mickey Beurskens; Mourad Boustani; Thiago D. Sim\~ao; Tristan Tomilin

arxiv: 2606.20376 · v2 · pith:VUSLDYAFnew · submitted 2026-06-18 · 💻 cs.LG · cs.AI

CRAX: Fast Safe Reinforcement Learning Benchmarking

Tristan Tomilin , Mourad Boustani , Mickey Beurskens , Thiago D. Sim\~ao This is my paper

Pith reviewed 2026-06-26 18:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords safe reinforcement learningbenchmarkingJAXMuJoCoconstrained RLhardware accelerationvectorization

0 comments

The pith

CRAX delivers up to 100x faster safe RL benchmarking using vectorized MJX environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRAX as a benchmark for constrained reinforcement learning built on the MuJoCo XLA physics engine and JAX. It addresses the slow computation of existing high-fidelity 3D safety benchmarks by applying vectorized operations and hardware acceleration. Evaluations across six environment suites and multiple safe RL methods demonstrate that no single method dominates all tasks while also showing that curriculum learning and safety transfer can improve results in harder settings.

Core claim

CRAX provides six environment suites and three agent-specific tasks each at three difficulty levels, all running on MJX. It reports up to 100x speedups compared to CPU-based safety benchmarks while preserving realistic 3D dynamics. Testing six popular safe RL methods reveals performance-safety trade-offs with no universal winner, and shows that curriculum learning across difficulty levels plus safety transfer can improve performance over direct training in harder settings.

What carries the argument

The CRAX benchmark suite, which applies vectorized operations on the MuJoCo XLA engine to accelerate constrained RL training and evaluation.

If this is right

Large-scale experimentation with safe RL methods becomes feasible in realistic 3D physics settings.
Rapid prototyping and comparison of constrained RL algorithms can occur without prohibitive compute costs.
Observed trade-offs between performance and safety can inform method selection for specific tasks.
Curriculum learning and safety transfer become practical strategies for scaling to harder difficulty levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same vectorization approach could accelerate other physics-based RL benchmarks beyond safety.
Extensive hyperparameter tuning for safe RL methods that was previously impractical may now be routine.
Faster iteration could shorten the gap between simulated safety research and real-world robotics deployment.

Load-bearing premise

The MuJoCo XLA environments provide equivalent fidelity and safety constraint modeling to the slower CPU-based benchmarks they are compared against.

What would settle it

Running the same trained policies on both CRAX environments and the original CPU benchmarks and verifying that safety violation rates and task rewards match within a small margin.

Figures

Figures reproduced from arXiv: 2606.20376 by Mickey Beurskens, Mourad Boustani, Thiago D. Sim\~ao, Tristan Tomilin.

**Figure 2.** Figure 2: Higher difficulty levels of Safe Goal (1 to 3, from left to right) decrease the size of the goals, and increase the number and variety of hazards. 4.1 Environment Suites And Tasks Environment suites define families of configurable tasks in simulated 3D environments. In each suite, an agent seeks to maximize reward while adhering to a predefined cost threshold [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Rewards and costs of baseline methods on Level 1 tasks after 500M environment steps used [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Curriculum learning and safety transfer in CRAX environments. We compare direct training (Normal), curriculum learning across difficulty levels (Curriculum), and transfer from an unconstrained PPO policy (Transfer) on Level 3 tasks. Experimental Setup. We run each experiment for 500 million environment steps, repeated over 5 seeds. All experiments are conducted on a dedicated compute node with a 72-core 3.… view at source ↗

**Figure 5.** Figure 5: Throughput comparison between CRAX and Safety-Gymnasium. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: PPOLag increasingly sacrifices rewards to adhere to tighter safety bounds in all tasks. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: In levels 2 and 3 of Push, the goal that the agent must push the block into moves at a fixed velocity. To succeed, the agent ought to anticipate the goal’s trajectory while avoiding hazard zones [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: The Pathway agent incurs a per-step cost when its foot contacts a hazard, scaled by the penetration depth into the hazard region. low ceiling constraint. Cost increases smoothly as height exceeds the threshold, encouraging the agent to crouch. Based on the HumanoidStandup environment with added forward locomotion reward. Difficulty levels lower the maximum height requirement. Pathway. Agents: Walker2d, Hal… view at source ↗

**Figure 9.** Figure 9: Visual comparison of difficulty levels across environments. Increasing difficulty generally [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Level 1 training curves. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Level 2 training curves. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Level 3 training curves. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

Safety is a core concern for deploying reinforcement learning (RL) agents in real-world domains such as robotics and autonomous driving. While benchmarks have been central to progress in RL, existing safety benchmarks with high-fidelity 3D physics remain computationally slow, limiting large-scale experimentation and rapid prototyping. To address this gap, we propose CRAX (Constrained RL Accelerated with JAX). Built on top of the MuJoCo XLA (MJX) physics engine with realistic 3D dynamics, CRAX leverages vectorized operations and hardware acceleration, yielding up to ~100x speedups over comparable CPU-based safety benchmarks. The benchmark features six environment suites and three agent-specific tasks, each spanning three difficulty levels. Evaluating six popular safe RL methods shows that no single approach dominates across all tasks, and reveals the trade-offs between performance and safety. We find that curriculum learning across difficulty levels and safety transfer can improve performance over direct training in harder settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRAX gives a JAX-based safe RL benchmark with claimed speedups, but equivalence to prior CPU environments is unshown so the comparisons stay provisional.

read the letter

CRAX ports safe RL tasks to MJX for faster runs, reporting up to 100x speedups and running six standard methods across suites with difficulty levels and curriculum transfer. The main new piece is the concrete benchmark collection itself, built on vectorized physics, plus the observation that no method wins everywhere and that curriculum can help on harder tasks.

The work addresses a clear practical limit: existing high-fidelity safety benchmarks are too slow for large-scale testing. If the new environments are faithful, the speed gain would let people run more thorough experiments in robotics and control settings.

The soft spot is the missing check that MJX versions match the original CPU environments on rewards, constraint violations, contact dynamics, and termination rules. The abstract states the speedups and the method rankings but gives no side-by-side statistics or measurement details, so it is not yet possible to tell whether the reported trade-offs are general or tied to small differences in the physics engine. That gap makes the central claims hard to assess from the text alone.

The paper is aimed at researchers who need to run many safe RL trials quickly. A reader who already works with JAX or MuJoCo and wants faster iteration would get direct value once the fidelity question is settled.

It deserves a serious referee because the benchmark direction is useful and the empirical scope is reasonable; the review would mainly need to confirm environment equivalence and add basic reproducibility details.

Referee Report

2 major / 1 minor

Summary. The paper introduces CRAX, a benchmark suite for constrained/safe RL built on MuJoCo XLA (MJX) that uses vectorization and hardware acceleration to achieve up to ~100x speedups over existing CPU-based safety benchmarks. It includes six environment suites with three agent-specific tasks at three difficulty levels each, evaluates six popular safe RL algorithms, reports that no single method dominates across tasks, and claims that curriculum learning and safety transfer improve performance in harder settings.

Significance. If the MJX environments are shown to be dynamically equivalent to the CPU baselines, CRAX would meaningfully lower the computational barrier to large-scale safe RL experimentation and prototyping. The empirical finding that method rankings are task-dependent and the suggestion of curriculum/safety-transfer benefits would be useful for guiding future algorithm design in a domain where real-world deployment requires both performance and constraint satisfaction.

major comments (2)

[Environments / Experiments] Environments and Experiments sections: the ~100x speedup claim and the conclusion that 'no single approach dominates' both rest on the assumption that the MJX-based environments preserve identical dynamics, contact modeling, constraint definitions, termination logic, and reward structure as the original CPU MuJoCo safety suites. No side-by-side validation (identical state-transition distributions, constraint-violation statistics, or reward curves) is reported, so the speedups and cross-method rankings cannot be directly interpreted as general results.
[Experiments] Experiments section: the abstract and available description provide no details on measurement methodology (wall-clock timing procedure, hardware configuration, number of independent seeds, error bars, or statistical tests), making the quantitative speedup and performance-safety trade-off claims unverifiable from the presented evidence.

minor comments (1)

[Abstract] Clarify in the abstract and introduction whether the reported speedups are measured on identical hardware or include GPU/TPU acceleration versus CPU baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the validation and reporting of our results.

read point-by-point responses

Referee: [Environments / Experiments] Environments and Experiments sections: the ~100x speedup claim and the conclusion that 'no single approach dominates' both rest on the assumption that the MJX-based environments preserve identical dynamics, contact modeling, constraint definitions, termination logic, and reward structure as the original CPU MuJoCo safety suites. No side-by-side validation (identical state-transition distributions, constraint-violation statistics, or reward curves) is reported, so the speedups and cross-method rankings cannot be directly interpreted as general results.

Authors: We agree that explicit side-by-side validation was not included in the submitted manuscript. Although MJX is intended to replicate MuJoCo's physics, we will add a dedicated validation subsection (with matching state-transition histograms, constraint-violation rates, and reward curves on representative tasks) to confirm dynamical equivalence. This will allow readers to interpret the reported speedups and algorithm rankings with greater confidence. revision: yes
Referee: [Experiments] Experiments section: the abstract and available description provide no details on measurement methodology (wall-clock timing procedure, hardware configuration, number of independent seeds, error bars, or statistical tests), making the quantitative speedup and performance-safety trade-off claims unverifiable from the presented evidence.

Authors: We acknowledge the lack of methodological detail. In the revised manuscript we will expand the Experiments section to specify: (i) the exact wall-clock timing procedure, (ii) hardware configuration (CPU/GPU models and memory), (iii) number of independent random seeds, (iv) how error bars are computed, and (v) any statistical tests applied. These additions will make the speedup and trade-off claims fully reproducible and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with empirical evaluation only

full rationale

The paper proposes CRAX as a vectorized MJX-based safety RL benchmark and reports empirical speedups plus method comparisons. It contains no derivation chain, no first-principles predictions, no fitted parameters renamed as outputs, and no self-citation load-bearing steps that reduce the central claims to inputs by construction. The ~100x speedup and method-ranking results rest on direct timing measurements and environment runs rather than any self-referential mathematical reduction. Environment equivalence to CPU baselines is an external validity question, not a circularity issue within the paper's own logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is a software benchmark built on existing JAX and MJX frameworks.

pith-pipeline@v0.9.1-grok · 5699 in / 1041 out tokens · 30380 ms · 2026-06-26T18:20:42.725668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 1 canonical work pages

[1]

Routledge, 1st edition, 1999

Eitan Altman.Constrained Markov Decision Processes. Routledge, 1st edition, 1999. doi: 10.1201/9781315140223

work page doi:10.1201/9781315140223 1999
[2]

Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.J. Artif. Intell. Res., 47:253–279, 2013

2013
[3]

Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). InIJCAI, pages 4148–4152. AAAI Press, 2015

2015
[4]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InICML, pages 41–48, 2009

2009
[5]

VMAS: A vectorized multi-agent simulator for collective robot learning

Matteo Bettini, Ryan Kortvelesy, Jan Blumenkamp, and Amanda Prorok. VMAS: A vectorized multi-agent simulator for collective robot learning. InDARS, pages 42–56, 2022

2022
[6]

Smit, Nathan Grinsztajn, Raphaël Boige, Cemlyn N

Clément Bonnet, Daniel Luo, Donal Byrne, Shikha Surana, Sasha Abramowitz, Paul Duckworth, Vincent Coyette, Laurence Illing Midgley, Elshadai Tegegn, Tristan Kalloniatis, Omayma Mahjoub, Matthew Macfarlane, Andries P. Smit, Nathan Grinsztajn, Raphaël Boige, Cemlyn N. Waters, Mohamed A. Mimouni, Ulrich A. Mbou Sob, Ruan de Kock, Siddarth Singh, Daniel Furel...

2024
[7]

Accelerating goal-conditioned reinforcement learning algorithms and research

Michal Bortkiewicz, Wladyslaw Palucki, Vivek Myers, Tadeusz Dziarmaga, Tomasz Arczewski, Lukasz Kucinski, and Benjamin Eysenbach. Accelerating goal-conditioned reinforcement learning algorithms and research. InICLR, 2025

2025
[8]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

Pith/arXiv arXiv 2016
[9]

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan K. Terry. Minigrid & Miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In NeurIPS, 2023

2023
[10]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

2009
[11]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InCoRL, pages 1–16, 2017

2017
[12]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A differentiable physics engine for large scale rigid body simulation. In NeurIPS Datasets and Benchmarks, 2021

2021
[13]

A comprehensive survey on safe reinforcement learning

Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015

2015
[14]

Bullet-Safety-Gym: A framework for constrained reinforcement learning

Sven Gronauer. Bullet-Safety-Gym: A framework for constrained reinforcement learning. Technical report, mediaTUM, 2022

2022
[15]

A clean slate for offline reinforcement learning

Matthew Thomas Jackson, Uljad Berdica, Jarek Luca Liesen, Shimon Whiteson, and Jakob Nico- laus Foerster. A clean slate for offline reinforcement learning. InNeurIPS, 2025

2025
[16]

Safety gymnasium: A unified safe reinforcement learning benchmark

Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Josef Dai, and Yaodong Yang. Safety gymnasium: A unified safe reinforcement learning benchmark. InNeurIPS, 2023

2023
[17]

A modern perspective on safe automated driving for different traffic dynamics using constrained reinforcement learning

Danial Kamran, Thiago D Simão, Qisong Yang, Canmanie T Ponnambalam, Johannes Fischer, Matthijs TJ Spaan, and Martin Lauer. A modern perspective on safe automated driving for different traffic dynamics using constrained reinforcement learning. InITSC, pages 4017–4023, 2022. 10

2022
[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. InNIPS, pages 1106–1114, 2012

2012
[19]

gymnax: A JAX-based reinforcement learning environment library, 2022

Robert Tjarko Lange. gymnax: A JAX-based reinforcement learning environment library, 2022. URLhttp://github.com/RobertTLange/gymnax

2022
[20]

AI safety gridworlds.arXiv preprint arXiv:1711.09883, 2017

Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI safety gridworlds.arXiv preprint arXiv:1711.09883, 2017

Pith/arXiv arXiv 2017
[21]

MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning

Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell., 45(3):3461–3475, 2022

2022
[22]

Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster

Michael T. Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. InICML, 2024

2024
[23]

Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

Pith/arXiv arXiv 2025
[24]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

2015
[25]

Sharada P. Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Di- pam Chakraborty, Grazvydas Semetulskis, João Schapke, Jonas Kubilius, Jurgis Pasukonis, Linas Klimas, Matthew J. Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc...

2020
[26]

MuJoCo XLA (MJX) - MuJoCo documentation

MuJoCo XLA Authors. MuJoCo XLA (MJX) - MuJoCo documentation. https://mujoco. readthedocs.io/en/stable/mjx.html, 2023. [Accessed 28-01-2026]

2023
[27]

Curriculum learning for reinforcement learning domains: A framework and survey.J

Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey.J. Mach. Learn. Res., 21(181):1–50, 2020

2020
[28]

XLand-MiniGrid: Scalable meta-reinforcement learning environments in JAX

Alexander Nikulin, Vladislav Kurenkov, Ilya Zisman, Artem Agarkov, Viacheslav Sinii, and Sergey Kolesnikov. XLand-MiniGrid: Scalable meta-reinforcement learning environments in JAX. InNeurIPS, 2024. 11

2024
[29]

John Wiley & Sons, Inc., 1 edition, 1994

Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1 edition, 1994

1994
[30]

Papageorgiou, and Can Li

Asha Ramanujam, Adam Elyoumi, Hao Chen, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, Shraman Pal, Dimitri J. Papageorgiou, and Can Li. SafeOR-Gym: A benchmark suite for safe reinforcement learning algorithms on practical operations research problems. arXiv preprint arXiv:2506.02255, 2025

arXiv 2025
[31]

Benchmarking safe exploration in deep rein- forcement learning

Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep rein- forcement learning. arXiv preprint arXiv:1910.01708, 2019

Pith/arXiv arXiv 1910
[32]

Julien Roy, Roger Girgis, Joshua Romoff, Pierre-Luc Bacon, and Christopher J. Pal. Direct behavior specification via constrained reinforcement learning. InICML, pages 18828–18843, 2022

2022
[33]

Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob N

Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schröder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert T. Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob N. Foerster. Ja...

2024
[34]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[35]

Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David Henry Mguni, Jun Wang, and Haitham Ammar

Aivar Sootla, Alexander I. Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David Henry Mguni, Jun Wang, and Haitham Ammar. Saute RL: almost surely safe reinforcement learning using state augmentation. InICML, pages 20423–20443, 2022

2022
[36]

Responsive safety in reinforcement learning by PID Lagrangian methods

Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. InICML, pages 9133–9143, 2020

2020
[37]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning - an introduction, 2nd Edition. MIT Press, 2018

2018
[38]

Lillicrap, and Martin A

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018

Pith/arXiv arXiv 2018
[39]

Taylor and Peter Stone

Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.J. Mach. Learn. Res., 10:1633–1685, 2009

2009
[40]

MuJoCo: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIROS, pages 5026–5033, 2012

2012
[41]

HASARD: A benchmark for vision- based safe reinforcement learning in embodied agents

Tristan Tomilin, Meng Fang, and Mykola Pechenizkiy. HASARD: A benchmark for vision- based safe reinforcement learning in embodied agents. InICLR, 2025

2025
[42]

A survey of constraint formulations in safe reinforcement learning

Akifumi Wachi, Xun Shen, and Yanan Sui. A survey of constraint formulations in safe reinforcement learning. InIJCAI, pages 8262–8271, 2024

2024
[43]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, volume 100 ofProceedings of Machine Learning Research, pages 1094–
[44]

Hall, Siqi Zhou, Lukas Brunke, Melissa Greeff, Jacopo Panerati, and Angela P

Zhaocong Yuan, Adam W. Hall, Siqi Zhou, Lukas Brunke, Melissa Greeff, Jacopo Panerati, and Angela P. Schoellig. Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics.IEEE Robotics Autom. Lett., 7(4):11142–11149, 2022

2022
[45]

MuJoCo playground

Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. MuJoCo playground. arXiv preprint arXiv:2502.08844, 2025. 12

arXiv 2025
[46]

Penalized proximal policy optimization for safe reinforcement learning

Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, and Dacheng Tao. Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv:2205.11814, 2022

arXiv 2022
[47]

First order constrained optimization in policy space

Yiming Zhang, Quan Vuong, and Keith Ross. First order constrained optimization in policy space. InNeurIPS, pages 15338–15349, 2020

2020
[48]

Robust transfer of safety-constrained rein- forcement learning agents

Markel Zubia, Thiago D Simão, and Nils Jansen. Robust transfer of safety-constrained rein- forcement learning agents. InICLR, 2025. 13 A Environment Descriptions A.1 Overview Task Compatible Agents Constraint Type Cost Mechanism Levels Goal Point, Ant, Humanoid, Spider Spatial avoidance Contact / Proximity 3 Button Point, Ant, Humanoid, Spider Spatial + S...

2025

[1] [1]

Routledge, 1st edition, 1999

Eitan Altman.Constrained Markov Decision Processes. Routledge, 1st edition, 1999. doi: 10.1201/9781315140223

work page doi:10.1201/9781315140223 1999

[2] [2]

Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.J. Artif. Intell. Res., 47:253–279, 2013

2013

[3] [3]

Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). InIJCAI, pages 4148–4152. AAAI Press, 2015

2015

[4] [4]

Curriculum learning

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InICML, pages 41–48, 2009

2009

[5] [5]

VMAS: A vectorized multi-agent simulator for collective robot learning

Matteo Bettini, Ryan Kortvelesy, Jan Blumenkamp, and Amanda Prorok. VMAS: A vectorized multi-agent simulator for collective robot learning. InDARS, pages 42–56, 2022

2022

[6] [6]

Smit, Nathan Grinsztajn, Raphaël Boige, Cemlyn N

Clément Bonnet, Daniel Luo, Donal Byrne, Shikha Surana, Sasha Abramowitz, Paul Duckworth, Vincent Coyette, Laurence Illing Midgley, Elshadai Tegegn, Tristan Kalloniatis, Omayma Mahjoub, Matthew Macfarlane, Andries P. Smit, Nathan Grinsztajn, Raphaël Boige, Cemlyn N. Waters, Mohamed A. Mimouni, Ulrich A. Mbou Sob, Ruan de Kock, Siddarth Singh, Daniel Furel...

2024

[7] [7]

Accelerating goal-conditioned reinforcement learning algorithms and research

Michal Bortkiewicz, Wladyslaw Palucki, Vivek Myers, Tadeusz Dziarmaga, Tomasz Arczewski, Lukasz Kucinski, and Benjamin Eysenbach. Accelerating goal-conditioned reinforcement learning algorithms and research. InICLR, 2025

2025

[8] [8]

OpenAI Gym

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

Pith/arXiv arXiv 2016

[9] [9]

Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan K. Terry. Minigrid & Miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In NeurIPS, 2023

2023

[10] [10]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

2009

[11] [11]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InCoRL, pages 1–16, 2017

2017

[12] [12]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A differentiable physics engine for large scale rigid body simulation. In NeurIPS Datasets and Benchmarks, 2021

2021

[13] [13]

A comprehensive survey on safe reinforcement learning

Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015

2015

[14] [14]

Bullet-Safety-Gym: A framework for constrained reinforcement learning

Sven Gronauer. Bullet-Safety-Gym: A framework for constrained reinforcement learning. Technical report, mediaTUM, 2022

2022

[15] [15]

A clean slate for offline reinforcement learning

Matthew Thomas Jackson, Uljad Berdica, Jarek Luca Liesen, Shimon Whiteson, and Jakob Nico- laus Foerster. A clean slate for offline reinforcement learning. InNeurIPS, 2025

2025

[16] [16]

Safety gymnasium: A unified safe reinforcement learning benchmark

Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Josef Dai, and Yaodong Yang. Safety gymnasium: A unified safe reinforcement learning benchmark. InNeurIPS, 2023

2023

[17] [17]

A modern perspective on safe automated driving for different traffic dynamics using constrained reinforcement learning

Danial Kamran, Thiago D Simão, Qisong Yang, Canmanie T Ponnambalam, Johannes Fischer, Matthijs TJ Spaan, and Martin Lauer. A modern perspective on safe automated driving for different traffic dynamics using constrained reinforcement learning. InITSC, pages 4017–4023, 2022. 10

2022

[18] [18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. InNIPS, pages 1106–1114, 2012

2012

[19] [19]

gymnax: A JAX-based reinforcement learning environment library, 2022

Robert Tjarko Lange. gymnax: A JAX-based reinforcement learning environment library, 2022. URLhttp://github.com/RobertTLange/gymnax

2022

[20] [20]

AI safety gridworlds.arXiv preprint arXiv:1711.09883, 2017

Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI safety gridworlds.arXiv preprint arXiv:1711.09883, 2017

Pith/arXiv arXiv 2017

[21] [21]

MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning

Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell., 45(3):3461–3475, 2022

2022

[22] [22]

Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster

Michael T. Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. InICML, 2024

2024

[23] [23]

Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

Pith/arXiv arXiv 2025

[24] [24]

Rusu, Joel Veness, Marc G

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

2015

[25] [25]

Sharada P. Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Di- pam Chakraborty, Grazvydas Semetulskis, João Schapke, Jonas Kubilius, Jurgis Pasukonis, Linas Klimas, Matthew J. Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc...

2020

[26] [26]

MuJoCo XLA (MJX) - MuJoCo documentation

MuJoCo XLA Authors. MuJoCo XLA (MJX) - MuJoCo documentation. https://mujoco. readthedocs.io/en/stable/mjx.html, 2023. [Accessed 28-01-2026]

2023

[27] [27]

Curriculum learning for reinforcement learning domains: A framework and survey.J

Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey.J. Mach. Learn. Res., 21(181):1–50, 2020

2020

[28] [28]

XLand-MiniGrid: Scalable meta-reinforcement learning environments in JAX

Alexander Nikulin, Vladislav Kurenkov, Ilya Zisman, Artem Agarkov, Viacheslav Sinii, and Sergey Kolesnikov. XLand-MiniGrid: Scalable meta-reinforcement learning environments in JAX. InNeurIPS, 2024. 11

2024

[29] [29]

John Wiley & Sons, Inc., 1 edition, 1994

Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1 edition, 1994

1994

[30] [30]

Papageorgiou, and Can Li

Asha Ramanujam, Adam Elyoumi, Hao Chen, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, Shraman Pal, Dimitri J. Papageorgiou, and Can Li. SafeOR-Gym: A benchmark suite for safe reinforcement learning algorithms on practical operations research problems. arXiv preprint arXiv:2506.02255, 2025

arXiv 2025

[31] [31]

Benchmarking safe exploration in deep rein- forcement learning

Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep rein- forcement learning. arXiv preprint arXiv:1910.01708, 2019

Pith/arXiv arXiv 1910

[32] [32]

Julien Roy, Roger Girgis, Joshua Romoff, Pierre-Luc Bacon, and Christopher J. Pal. Direct behavior specification via constrained reinforcement learning. InICML, pages 18828–18843, 2022

2022

[33] [33]

Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob N

Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schröder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert T. Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob N. Foerster. Ja...

2024

[34] [34]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[35] [35]

Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David Henry Mguni, Jun Wang, and Haitham Ammar

Aivar Sootla, Alexander I. Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David Henry Mguni, Jun Wang, and Haitham Ammar. Saute RL: almost surely safe reinforcement learning using state augmentation. InICML, pages 20423–20443, 2022

2022

[36] [36]

Responsive safety in reinforcement learning by PID Lagrangian methods

Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. InICML, pages 9133–9143, 2020

2020

[37] [37]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto.Reinforcement learning - an introduction, 2nd Edition. MIT Press, 2018

2018

[38] [38]

Lillicrap, and Martin A

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018

Pith/arXiv arXiv 2018

[39] [39]

Taylor and Peter Stone

Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.J. Mach. Learn. Res., 10:1633–1685, 2009

2009

[40] [40]

MuJoCo: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIROS, pages 5026–5033, 2012

2012

[41] [41]

HASARD: A benchmark for vision- based safe reinforcement learning in embodied agents

Tristan Tomilin, Meng Fang, and Mykola Pechenizkiy. HASARD: A benchmark for vision- based safe reinforcement learning in embodied agents. InICLR, 2025

2025

[42] [42]

A survey of constraint formulations in safe reinforcement learning

Akifumi Wachi, Xun Shen, and Yanan Sui. A survey of constraint formulations in safe reinforcement learning. InIJCAI, pages 8262–8271, 2024

2024

[43] [43]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, volume 100 ofProceedings of Machine Learning Research, pages 1094–

[44] [44]

Hall, Siqi Zhou, Lukas Brunke, Melissa Greeff, Jacopo Panerati, and Angela P

Zhaocong Yuan, Adam W. Hall, Siqi Zhou, Lukas Brunke, Melissa Greeff, Jacopo Panerati, and Angela P. Schoellig. Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics.IEEE Robotics Autom. Lett., 7(4):11142–11149, 2022

2022

[45] [45]

MuJoCo playground

Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. MuJoCo playground. arXiv preprint arXiv:2502.08844, 2025. 12

arXiv 2025

[46] [46]

Penalized proximal policy optimization for safe reinforcement learning

Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, and Dacheng Tao. Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv:2205.11814, 2022

arXiv 2022

[47] [47]

First order constrained optimization in policy space

Yiming Zhang, Quan Vuong, and Keith Ross. First order constrained optimization in policy space. InNeurIPS, pages 15338–15349, 2020

2020

[48] [48]

Robust transfer of safety-constrained rein- forcement learning agents

Markel Zubia, Thiago D Simão, and Nils Jansen. Robust transfer of safety-constrained rein- forcement learning agents. InICLR, 2025. 13 A Environment Descriptions A.1 Overview Task Compatible Agents Constraint Type Cost Mechanism Levels Goal Point, Ant, Humanoid, Spider Spatial avoidance Contact / Proximity 3 Button Point, Ant, Humanoid, Spider Spatial + S...

2025