pith. sign in

arxiv: 2606.20376 · v2 · pith:VUSLDYAFnew · submitted 2026-06-18 · 💻 cs.LG · cs.AI

CRAX: Fast Safe Reinforcement Learning Benchmarking

Pith reviewed 2026-06-26 18:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords safe reinforcement learningbenchmarkingJAXMuJoCoconstrained RLhardware accelerationvectorization
0
0 comments X

The pith

CRAX delivers up to 100x faster safe RL benchmarking using vectorized MJX environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRAX as a benchmark for constrained reinforcement learning built on the MuJoCo XLA physics engine and JAX. It addresses the slow computation of existing high-fidelity 3D safety benchmarks by applying vectorized operations and hardware acceleration. Evaluations across six environment suites and multiple safe RL methods demonstrate that no single method dominates all tasks while also showing that curriculum learning and safety transfer can improve results in harder settings.

Core claim

CRAX provides six environment suites and three agent-specific tasks each at three difficulty levels, all running on MJX. It reports up to 100x speedups compared to CPU-based safety benchmarks while preserving realistic 3D dynamics. Testing six popular safe RL methods reveals performance-safety trade-offs with no universal winner, and shows that curriculum learning across difficulty levels plus safety transfer can improve performance over direct training in harder settings.

What carries the argument

The CRAX benchmark suite, which applies vectorized operations on the MuJoCo XLA engine to accelerate constrained RL training and evaluation.

If this is right

  • Large-scale experimentation with safe RL methods becomes feasible in realistic 3D physics settings.
  • Rapid prototyping and comparison of constrained RL algorithms can occur without prohibitive compute costs.
  • Observed trade-offs between performance and safety can inform method selection for specific tasks.
  • Curriculum learning and safety transfer become practical strategies for scaling to harder difficulty levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same vectorization approach could accelerate other physics-based RL benchmarks beyond safety.
  • Extensive hyperparameter tuning for safe RL methods that was previously impractical may now be routine.
  • Faster iteration could shorten the gap between simulated safety research and real-world robotics deployment.

Load-bearing premise

The MuJoCo XLA environments provide equivalent fidelity and safety constraint modeling to the slower CPU-based benchmarks they are compared against.

What would settle it

Running the same trained policies on both CRAX environments and the original CPU benchmarks and verifying that safety violation rates and task rewards match within a small margin.

Figures

Figures reproduced from arXiv: 2606.20376 by Mickey Beurskens, Mourad Boustani, Thiago D. Sim\~ao, Tristan Tomilin.

Figure 1
Figure 1. Figure 1: Overview of the CRAX benchmark environment suites. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Higher difficulty levels of Safe Goal (1 to 3, from left to right) decrease the size of the goals, and increase the number and variety of hazards. 4.1 Environment Suites And Tasks Environment suites define families of configurable tasks in simulated 3D environments. In each suite, an agent seeks to maximize reward while adhering to a predefined cost threshold [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rewards and costs of baseline methods on Level 1 tasks after 500M environment steps used [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Curriculum learning and safety transfer in CRAX environments. We compare direct training (Normal), curriculum learning across difficulty levels (Curriculum), and transfer from an unconstrained PPO policy (Transfer) on Level 3 tasks. Experimental Setup. We run each experiment for 500 million environment steps, repeated over 5 seeds. All experiments are conducted on a dedicated compute node with a 72-core 3.… view at source ↗
Figure 5
Figure 5. Figure 5: Throughput comparison between CRAX and Safety-Gymnasium. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PPOLag increasingly sacrifices rewards to adhere to tighter safety bounds in all tasks. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: In levels 2 and 3 of Push, the goal that the agent must push the block into moves at a fixed velocity. To succeed, the agent ought to anticipate the goal’s trajectory while avoiding hazard zones [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The Pathway agent incurs a per-step cost when its foot contacts a hazard, scaled by the penetration depth into the hazard region. low ceiling constraint. Cost increases smoothly as height exceeds the threshold, encouraging the agent to crouch. Based on the HumanoidStandup environment with added forward locomotion reward. Difficulty levels lower the maximum height requirement. Pathway. Agents: Walker2d, Hal… view at source ↗
Figure 9
Figure 9. Figure 9: Visual comparison of difficulty levels across environments. Increasing difficulty generally [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Level 1 training curves. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Level 2 training curves. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Level 3 training curves. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

Safety is a core concern for deploying reinforcement learning (RL) agents in real-world domains such as robotics and autonomous driving. While benchmarks have been central to progress in RL, existing safety benchmarks with high-fidelity 3D physics remain computationally slow, limiting large-scale experimentation and rapid prototyping. To address this gap, we propose CRAX (Constrained RL Accelerated with JAX). Built on top of the MuJoCo XLA (MJX) physics engine with realistic 3D dynamics, CRAX leverages vectorized operations and hardware acceleration, yielding up to ~100x speedups over comparable CPU-based safety benchmarks. The benchmark features six environment suites and three agent-specific tasks, each spanning three difficulty levels. Evaluating six popular safe RL methods shows that no single approach dominates across all tasks, and reveals the trade-offs between performance and safety. We find that curriculum learning across difficulty levels and safety transfer can improve performance over direct training in harder settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CRAX, a benchmark suite for constrained/safe RL built on MuJoCo XLA (MJX) that uses vectorization and hardware acceleration to achieve up to ~100x speedups over existing CPU-based safety benchmarks. It includes six environment suites with three agent-specific tasks at three difficulty levels each, evaluates six popular safe RL algorithms, reports that no single method dominates across tasks, and claims that curriculum learning and safety transfer improve performance in harder settings.

Significance. If the MJX environments are shown to be dynamically equivalent to the CPU baselines, CRAX would meaningfully lower the computational barrier to large-scale safe RL experimentation and prototyping. The empirical finding that method rankings are task-dependent and the suggestion of curriculum/safety-transfer benefits would be useful for guiding future algorithm design in a domain where real-world deployment requires both performance and constraint satisfaction.

major comments (2)
  1. [Environments / Experiments] Environments and Experiments sections: the ~100x speedup claim and the conclusion that 'no single approach dominates' both rest on the assumption that the MJX-based environments preserve identical dynamics, contact modeling, constraint definitions, termination logic, and reward structure as the original CPU MuJoCo safety suites. No side-by-side validation (identical state-transition distributions, constraint-violation statistics, or reward curves) is reported, so the speedups and cross-method rankings cannot be directly interpreted as general results.
  2. [Experiments] Experiments section: the abstract and available description provide no details on measurement methodology (wall-clock timing procedure, hardware configuration, number of independent seeds, error bars, or statistical tests), making the quantitative speedup and performance-safety trade-off claims unverifiable from the presented evidence.
minor comments (1)
  1. [Abstract] Clarify in the abstract and introduction whether the reported speedups are measured on identical hardware or include GPU/TPU acceleration versus CPU baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the validation and reporting of our results.

read point-by-point responses
  1. Referee: [Environments / Experiments] Environments and Experiments sections: the ~100x speedup claim and the conclusion that 'no single approach dominates' both rest on the assumption that the MJX-based environments preserve identical dynamics, contact modeling, constraint definitions, termination logic, and reward structure as the original CPU MuJoCo safety suites. No side-by-side validation (identical state-transition distributions, constraint-violation statistics, or reward curves) is reported, so the speedups and cross-method rankings cannot be directly interpreted as general results.

    Authors: We agree that explicit side-by-side validation was not included in the submitted manuscript. Although MJX is intended to replicate MuJoCo's physics, we will add a dedicated validation subsection (with matching state-transition histograms, constraint-violation rates, and reward curves on representative tasks) to confirm dynamical equivalence. This will allow readers to interpret the reported speedups and algorithm rankings with greater confidence. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract and available description provide no details on measurement methodology (wall-clock timing procedure, hardware configuration, number of independent seeds, error bars, or statistical tests), making the quantitative speedup and performance-safety trade-off claims unverifiable from the presented evidence.

    Authors: We acknowledge the lack of methodological detail. In the revised manuscript we will expand the Experiments section to specify: (i) the exact wall-clock timing procedure, (ii) hardware configuration (CPU/GPU models and memory), (iii) number of independent random seeds, (iv) how error bars are computed, and (v) any statistical tests applied. These additions will make the speedup and trade-off claims fully reproducible and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark proposal with empirical evaluation only

full rationale

The paper proposes CRAX as a vectorized MJX-based safety RL benchmark and reports empirical speedups plus method comparisons. It contains no derivation chain, no first-principles predictions, no fitted parameters renamed as outputs, and no self-citation load-bearing steps that reduce the central claims to inputs by construction. The ~100x speedup and method-ranking results rest on direct timing measurements and environment runs rather than any self-referential mathematical reduction. Environment equivalence to CPU baselines is an external validity question, not a circularity issue within the paper's own logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the contribution is a software benchmark built on existing JAX and MJX frameworks.

pith-pipeline@v0.9.1-grok · 5699 in / 1041 out tokens · 30380 ms · 2026-06-26T18:20:42.725668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 1 canonical work pages

  1. [1]

    Routledge, 1st edition, 1999

    Eitan Altman.Constrained Markov Decision Processes. Routledge, 1st edition, 1999. doi: 10.1201/9781315140223

  2. [2]

    Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

    Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.J. Artif. Intell. Res., 47:253–279, 2013

  3. [3]

    Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

    Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). InIJCAI, pages 4148–4152. AAAI Press, 2015

  4. [4]

    Curriculum learning

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InICML, pages 41–48, 2009

  5. [5]

    VMAS: A vectorized multi-agent simulator for collective robot learning

    Matteo Bettini, Ryan Kortvelesy, Jan Blumenkamp, and Amanda Prorok. VMAS: A vectorized multi-agent simulator for collective robot learning. InDARS, pages 42–56, 2022

  6. [6]

    Smit, Nathan Grinsztajn, Raphaël Boige, Cemlyn N

    Clément Bonnet, Daniel Luo, Donal Byrne, Shikha Surana, Sasha Abramowitz, Paul Duckworth, Vincent Coyette, Laurence Illing Midgley, Elshadai Tegegn, Tristan Kalloniatis, Omayma Mahjoub, Matthew Macfarlane, Andries P. Smit, Nathan Grinsztajn, Raphaël Boige, Cemlyn N. Waters, Mohamed A. Mimouni, Ulrich A. Mbou Sob, Ruan de Kock, Siddarth Singh, Daniel Furel...

  7. [7]

    Accelerating goal-conditioned reinforcement learning algorithms and research

    Michal Bortkiewicz, Wladyslaw Palucki, Vivek Myers, Tadeusz Dziarmaga, Tomasz Arczewski, Lukasz Kucinski, and Benjamin Eysenbach. Accelerating goal-conditioned reinforcement learning algorithms and research. InICLR, 2025

  8. [8]

    OpenAI Gym

    Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016

  9. [9]

    Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan K. Terry. Minigrid & Miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In NeurIPS, 2023

  10. [10]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255, 2009

  11. [11]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InCoRL, pages 1–16, 2017

  12. [12]

    Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

    C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax - A differentiable physics engine for large scale rigid body simulation. In NeurIPS Datasets and Benchmarks, 2021

  13. [13]

    A comprehensive survey on safe reinforcement learning

    Javier García and Fernando Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015

  14. [14]

    Bullet-Safety-Gym: A framework for constrained reinforcement learning

    Sven Gronauer. Bullet-Safety-Gym: A framework for constrained reinforcement learning. Technical report, mediaTUM, 2022

  15. [15]

    A clean slate for offline reinforcement learning

    Matthew Thomas Jackson, Uljad Berdica, Jarek Luca Liesen, Shimon Whiteson, and Jakob Nico- laus Foerster. A clean slate for offline reinforcement learning. InNeurIPS, 2025

  16. [16]

    Safety gymnasium: A unified safe reinforcement learning benchmark

    Jiaming Ji, Borong Zhang, Jiayi Zhou, Xuehai Pan, Weidong Huang, Ruiyang Sun, Yiran Geng, Yifan Zhong, Josef Dai, and Yaodong Yang. Safety gymnasium: A unified safe reinforcement learning benchmark. InNeurIPS, 2023

  17. [17]

    A modern perspective on safe automated driving for different traffic dynamics using constrained reinforcement learning

    Danial Kamran, Thiago D Simão, Qisong Yang, Canmanie T Ponnambalam, Johannes Fischer, Matthijs TJ Spaan, and Martin Lauer. A modern perspective on safe automated driving for different traffic dynamics using constrained reinforcement learning. InITSC, pages 4017–4023, 2022. 10

  18. [18]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. InNIPS, pages 1106–1114, 2012

  19. [19]

    gymnax: A JAX-based reinforcement learning environment library, 2022

    Robert Tjarko Lange. gymnax: A JAX-based reinforcement learning environment library, 2022. URLhttp://github.com/RobertTLange/gymnax

  20. [20]

    AI safety gridworlds.arXiv preprint arXiv:1711.09883, 2017

    Jan Leike, Miljan Martic, Victoria Krakovna, Pedro A Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI safety gridworlds.arXiv preprint arXiv:1711.09883, 2017

  21. [21]

    MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning

    Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Trans. Pattern Anal. Mach. Intell., 45(3):3461–3475, 2022

  22. [22]

    Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster

    Michael T. Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Thomas Jackson, Samuel Coward, and Jakob Nicolaus Foerster. Craftax: A lightning-fast benchmark for open-ended reinforcement learning. InICML, 2024

  23. [23]

    Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M

    Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

  24. [24]

    Rusu, Joel Veness, Marc G

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcemen...

  25. [25]

    Sharada P. Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Di- pam Chakraborty, Grazvydas Semetulskis, João Schapke, Jonas Kubilius, Jurgis Pasukonis, Linas Klimas, Matthew J. Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc...

  26. [26]

    MuJoCo XLA (MJX) - MuJoCo documentation

    MuJoCo XLA Authors. MuJoCo XLA (MJX) - MuJoCo documentation. https://mujoco. readthedocs.io/en/stable/mjx.html, 2023. [Accessed 28-01-2026]

  27. [27]

    Curriculum learning for reinforcement learning domains: A framework and survey.J

    Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey.J. Mach. Learn. Res., 21(181):1–50, 2020

  28. [28]

    XLand-MiniGrid: Scalable meta-reinforcement learning environments in JAX

    Alexander Nikulin, Vladislav Kurenkov, Ilya Zisman, Artem Agarkov, Viacheslav Sinii, and Sergey Kolesnikov. XLand-MiniGrid: Scalable meta-reinforcement learning environments in JAX. InNeurIPS, 2024. 11

  29. [29]

    John Wiley & Sons, Inc., 1 edition, 1994

    Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1 edition, 1994

  30. [30]

    Papageorgiou, and Can Li

    Asha Ramanujam, Adam Elyoumi, Hao Chen, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, Shraman Pal, Dimitri J. Papageorgiou, and Can Li. SafeOR-Gym: A benchmark suite for safe reinforcement learning algorithms on practical operations research problems. arXiv preprint arXiv:2506.02255, 2025

  31. [31]

    Benchmarking safe exploration in deep rein- forcement learning

    Alex Ray, Joshua Achiam, and Dario Amodei. Benchmarking safe exploration in deep rein- forcement learning. arXiv preprint arXiv:1910.01708, 2019

  32. [32]

    Julien Roy, Roger Girgis, Joshua Romoff, Pierre-Luc Bacon, and Christopher J. Pal. Direct behavior specification via constrained reinforcement learning. InICML, pages 18828–18843, 2022

  33. [33]

    Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob N

    Alexander Rutherford, Benjamin Ellis, Matteo Gallici, Jonathan Cook, Andrei Lupu, Garðar Ingvarsson, Timon Willi, Ravi Hammond, Akbir Khan, Christian Schröder de Witt, Alexandra Souly, Saptarashmi Bandyopadhyay, Mikayel Samvelyan, Minqi Jiang, Robert T. Lange, Shimon Whiteson, Bruno Lacerda, Nick Hawes, Tim Rocktäschel, Chris Lu, and Jakob N. Foerster. Ja...

  34. [34]

    Proximal policy optimization algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  35. [35]

    Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David Henry Mguni, Jun Wang, and Haitham Ammar

    Aivar Sootla, Alexander I. Cowen-Rivers, Taher Jafferjee, Ziyan Wang, David Henry Mguni, Jun Wang, and Haitham Ammar. Saute RL: almost surely safe reinforcement learning using state augmentation. InICML, pages 20423–20443, 2022

  36. [36]

    Responsive safety in reinforcement learning by PID Lagrangian methods

    Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by PID Lagrangian methods. InICML, pages 9133–9143, 2020

  37. [37]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement learning - an introduction, 2nd Edition. MIT Press, 2018

  38. [38]

    Lillicrap, and Martin A

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018

  39. [39]

    Taylor and Peter Stone

    Matthew E. Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.J. Mach. Learn. Res., 10:1633–1685, 2009

  40. [40]

    MuJoCo: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. InIROS, pages 5026–5033, 2012

  41. [41]

    HASARD: A benchmark for vision- based safe reinforcement learning in embodied agents

    Tristan Tomilin, Meng Fang, and Mykola Pechenizkiy. HASARD: A benchmark for vision- based safe reinforcement learning in embodied agents. InICLR, 2025

  42. [42]

    A survey of constraint formulations in safe reinforcement learning

    Akifumi Wachi, Xun Shen, and Yanan Sui. A survey of constraint formulations in safe reinforcement learning. InIJCAI, pages 8262–8271, 2024

  43. [43]

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning

    Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InCoRL, volume 100 ofProceedings of Machine Learning Research, pages 1094–

  44. [44]

    Hall, Siqi Zhou, Lukas Brunke, Melissa Greeff, Jacopo Panerati, and Angela P

    Zhaocong Yuan, Adam W. Hall, Siqi Zhou, Lukas Brunke, Melissa Greeff, Jacopo Panerati, and Angela P. Schoellig. Safe-control-gym: A unified benchmark suite for safe learning-based control and reinforcement learning in robotics.IEEE Robotics Autom. Lett., 7(4):11142–11149, 2022

  45. [45]

    MuJoCo playground

    Kevin Zakka, Baruch Tabanpour, Qiayuan Liao, Mustafa Haiderbhai, Samuel Holt, Jing Yuan Luo, Arthur Allshire, Erik Frey, Koushil Sreenath, Lueder A Kahrs, et al. MuJoCo playground. arXiv preprint arXiv:2502.08844, 2025. 12

  46. [46]

    Penalized proximal policy optimization for safe reinforcement learning

    Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, and Dacheng Tao. Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv:2205.11814, 2022

  47. [47]

    First order constrained optimization in policy space

    Yiming Zhang, Quan Vuong, and Keith Ross. First order constrained optimization in policy space. InNeurIPS, pages 15338–15349, 2020

  48. [48]

    Robust transfer of safety-constrained rein- forcement learning agents

    Markel Zubia, Thiago D Simão, and Nils Jansen. Robust transfer of safety-constrained rein- forcement learning agents. InICLR, 2025. 13 A Environment Descriptions A.1 Overview Task Compatible Agents Constraint Type Cost Mechanism Levels Goal Point, Ant, Humanoid, Spider Spatial avoidance Contact / Proximity 3 Button Point, Ant, Humanoid, Spider Spatial + S...