pith. sign in

arxiv: 2606.23686 · v1 · pith:6AZFLEP2new · submitted 2026-06-22 · 💻 cs.RO

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

Pith reviewed 2026-06-26 07:53 UTC · model grok-4.3

classification 💻 cs.RO
keywords Vision-Language-Action modelssafety benchmarkphysical safetysemantic safetyrobot manipulationdata generation pipelinetrajectory synthesisgeneralization
0
0 comments X

The pith

A new parametric benchmark shows VLA models gain safety from diverse training but remain limited by trajectory synthesis and semantic misalignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LIBERO-Safety to test physical and semantic safety in vision-language-action models through procedurally generated scenarios and a large set of collision-free robot demonstrations. It evaluates multiple models across paradigms and identifies a tension: greater training diversity produces safer paths, yet overall task success stays blocked by how trajectories are formed and how well models match instructions to actions. A sympathetic reader would care because these models are intended for physical robots where unsafe or failed actions carry direct costs. The work supplies both the test infrastructure and the dataset to make such evaluations repeatable at scale.

Core claim

The authors introduce a parametric safety benchmark that generates stochastic safety-critical scenarios and a keypose-driven pipeline that produces 19,664 strictly collision-free demonstrations with domain randomization. Systematic testing of eight VLA models and two embodied foundation models reveals that high-diversity training improves trajectory safety while task success remains constrained by sub-optimal trajectory synthesis and semantic misalignment between language and execution.

What carries the argument

The parametric safety benchmark combined with the keypose-driven data generation pipeline, which together enable scalable creation of safety-critical demonstrations and scenarios without human teleoperation.

If this is right

  • Training regimes for VLA models should incorporate high-diversity safety data to reduce unsafe trajectories.
  • Improvements in trajectory synthesis methods are required before task success rates can rise under safety constraints.
  • Semantic alignment techniques must be strengthened to close the gap between language instructions and executed actions.
  • The generated dataset can serve as a training resource for developing safer VLA policies without manual demonstration collection.
  • Future model evaluations should routinely include cross-paradigm testing on parametric safety scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identified tension suggests that scaling diversity alone will not suffice and hybrid methods pairing diversity with explicit trajectory optimization may be needed.
  • If the benchmark scenarios prove representative, real-world robot deployments could adopt similar procedural generation to pre-test safety before physical trials.
  • The pipeline's scalability could support safety testing in adjacent areas such as navigation or multi-agent coordination.
  • Persistent semantic misalignment points to a possible need for tighter integration between vision-language pretraining and action-specific fine-tuning.

Load-bearing premise

The keypose-driven data generation pipeline and parametric safety benchmark produce demonstrations and scenarios that accurately capture real-world physical and semantic safety constraints without introducing generation artifacts.

What would settle it

Physical robot experiments using models trained on the curated dataset that fail to reproduce the reported safety improvements or the identified bottlenecks in trajectory quality and semantic alignment.

Figures

Figures reproduced from arXiv: 2606.23686 by Guocai Yao, Haohan Chi, Hao Zhao, Jiaolong Yang, Jinbang Guo, Jingrui Pang, Rongxu Cui, Saining Zhang, Shaoxuan Xie, Xianyuan Zhan, Xin Jin, Yao Mu, Ya-Qin Zhang, Zongzheng Zhang.

Figure 1
Figure 1. Figure 1: Real-world VLA deployment is severely bottlenecked by physical safety and semantic reasoning, constituting critical (a) VLA Safety Challenges. To systemati￾cally evaluate these challenges, we introduce a comprehensive VLA safety benchmark and develop an efficient (b) Data Generation Pipeline to synthesize 19.7K strictly collision-free demonstrations. By evaluating VLA models fine-tuned on this corpus along… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our VLA Safety Benchmark. (a) Comprehensive En￾vironments: Powered by our UBDDL, we construct massive, stochastic simulation environments featuring multi-dimensional visual/physical randomizations and human￾object interactions. (b) Hierarchical Safety Taxonomy: A systematic evaluation suite assessing five critical dimensions of physical and semantic safety, strictly scaled across 3 difficulty t… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of State Space Distributions. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Emergent Spatial Reasoning. High-diversity training enables the model to transition from (a) non-linear avoid￾ance to (b) optimal trajectory synthe￾sis in obstacle-free workspaces. Key Finding 3: High-diversity training data mitigates trajec￾tory overfitting and facilitates emergent spatial reasoning. To in￾vestigate the trade-off between trajec￾tory memorization and visual-spatial generalization, we condu… view at source ↗
Figure 6
Figure 6. Figure 6: Representative examples of (a) Instruction-Aligned Execution and (b) Semantic Misalignment. While the policy is capable of generating collision-free trajectories, perceptual er￾rors in multi-object scenes can lead the end-effector toward incorrect targets. yields a collision-free task incompletion, sacrificing the manipulation objective to kinematically sub-optimal planning. Key Finding 8: Semantic misalig… view at source ↗
read the original abstract

Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces LIBERO-Safety, a parametric safety benchmark for procedurally generating safety-critical scenarios with stochasticity, along with a keypose-driven data generation pipeline to produce a dataset of 19,664 strictly collision-free demonstrations incorporating domain randomization. It performs a cross-paradigm evaluation of eight VLA models and two embodied foundation models, revealing a generalization-safety tension in which high-diversity training yields safer trajectories while task success remains limited by sub-optimal trajectory synthesis and semantic misalignment.

Significance. If the central findings hold after validation, the work supplies a scalable infrastructure, large reproducible dataset, and systematic failure-mode analysis that directly addresses an important gap in safety evaluation for VLA models. The explicit provision of the generation pipeline and 19,664 demonstrations constitutes a concrete contribution to reproducibility and future benchmarking efforts in robotics.

major comments (1)
  1. [Abstract] Abstract (paragraph on infrastructure and dataset curation): The generalization-safety tension finding is load-bearing on the assumption that the keypose-driven pipeline and parametric safety benchmark faithfully instantiate real-world collision avoidance and semantic constraints. No external anchor—such as human teleoperation comparison, real-robot transfer results, or quantitative physics-fidelity metrics—is described that would allow falsification of systematic generation artifacts.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on validating the simulation pipeline. We address the concern directly below and outline revisions to clarify the benchmark's scope and limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on infrastructure and dataset curation): The generalization-safety tension finding is load-bearing on the assumption that the keypose-driven pipeline and parametric safety benchmark faithfully instantiate real-world collision avoidance and semantic constraints. No external anchor—such as human teleoperation comparison, real-robot transfer results, or quantitative physics-fidelity metrics—is described that would allow falsification of systematic generation artifacts.

    Authors: We agree this is a substantive point. The work is explicitly positioned as a scalable simulation benchmark to overcome the bottlenecks of human teleoperation, with the keypose-driven pipeline enforcing collision-free trajectories by construction via optimization in the underlying physics simulator and domain randomization for variability. No human teleoperation comparisons or real-robot transfers are included, as these fall outside the paper's scope of providing reproducible procedural generation. To address potential artifacts, we will revise the abstract and add a dedicated limitations subsection discussing simulator fidelity choices (e.g., realistic mass/friction parameters) and explicitly noting the absence of real-world anchors as a scope limitation. This constitutes a partial revision focused on transparency rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a parametric safety benchmark and keypose-driven pipeline to generate 19,664 demonstrations, then performs cross-paradigm evaluation of eight external VLA models and two embodied foundation models. No equations, parameter-fitting steps, or self-citations are present that reduce any claimed result (such as the generalization-safety tension) to the inputs by construction. The central findings derive from empirical performance of non-author models on the new benchmark, satisfying the condition of being self-contained against external benchmarks with no load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claims rest on the domain assumption that the generated scenarios and demonstrations faithfully represent safety-critical conditions; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption The keypose-driven data generation pipeline produces strictly collision-free demonstrations that are representative of safety-critical scenarios.
    Invoked to justify curating the 19,664-demonstration dataset without human teleoperation.

pith-pipeline@v0.9.1-grok · 5735 in / 1208 out tokens · 35561 ms · 2026-06-26T07:53:37.595132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 23 linked inside Pith

  1. [1]

    arXiv preprint arXiv:2511.14759 (2025)

    Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., et al.:π∗ 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)

  2. [2]

    arXiv preprint arXiv:2506.09985 (2025)

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  3. [3]

    IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

    Balasubramanian, S., Melendez-Calderon, A., Burdet, E.: A robust and sensitive metric for quantifying movement smoothness. IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)

  4. [4]

    arXiv preprint arXiv:2503.14734 (2025)

    Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)

  5. [5]

    arXiv preprint arXiv:2410.24164 (2024)

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: a vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  6. [6]

    In: arXiv preprint arXiv:2307.15818 (2023)

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control. In: arXiv preprint arXiv:2307.15818 (2023)

  7. [7]

    In: arXiv preprint arXiv:2212.06817 (2022)

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. In: arXiv preprint arXiv:2212.06817 (2022)

  8. [8]

    In: RSS (2025)

    Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. In: RSS (2025)

  9. [9]

    arXiv preprint arXiv:2511.17502 (2025)

    Cen, J., Huang, S., Yuan, Y., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, K., Luo, H., Wang, F., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)

  10. [10]

    arXiv preprint arXiv:2506.21539 (2025)

    Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)

  11. [11]

    arXiv preprint arXiv:2506.18088 (2025)

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

  12. [12]

    arXiv preprint arXiv:2602.14979 (2026)

    Dang, R., Guo, J., Hou, B., Leng, S., Li, K., Li, X., Liu, J., Mao, Y., Wang, Z., Yuan, Y., et al.: Rynnbrain: Open embodied foundation models. arXiv preprint arXiv:2602.14979 (2026)

  13. [13]

    In: CoRL (2025)

    Deng, H., Guo, W., Wang, Q., Wu, Z., Wang, Z.: Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation. In: CoRL (2025)

  14. [14]

    In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Ding, K., Chen, B., Wu, R., Li, Y., Zhang, Z., Gao, H.a., Li, S., Zhou, G., Zhu, Y., Dong, H., et al.: Preafford: Universal affordance-based pre-grasping for diverse ob- jects and environments. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7278–7285. IEEE (2024)

  15. [15]

    arXiv preprint arXiv:2510.13626 (2025) 16 R

    Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., et al.: Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025) 16 R. Cui, Z. Zhang et al

  16. [16]

    In: ICRA (2025)

    Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: ICRA (2025)

  17. [17]

    arXiv preprint arXiv:2512.11891 (2025)

    Hu, S., Liu, Z., Liu, S., Cen, J., Meng, Z., He, X.: Vlsa: Vision-language-action models with plug-and-play safety constraint layer. arXiv preprint arXiv:2512.11891 (2025)

  18. [18]

    arXiv preprint arXiv:2511.14659 (2025)

    Hung, C.Y., Majumder, N., Deng, H., Renhang, L., Ang, Y., Zadeh, A., Li, C., Herremans, D., Wang, Z., Poria, S.: Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards. arXiv preprint arXiv:2511.14659 (2025)

  19. [19]

    arXiv preprint arXiv:2504.16054 (2025)

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  20. [20]

    IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

    James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)

  21. [21]

    arXiv preprint arXiv:2509.15212 (2025)

    Jiang, Y., Huang, S., Xue, S., Zhao, Y., Cen, J., Leng, S., Li, K., Guo, J., Wang, K., Chen, M., et al.: Rynnvla-001: Using human demonstrations to improve robot manipulation. arXiv preprint arXiv:2509.15212 (2025)

  22. [22]

    arXiv preprint arXiv:2502.19645 (2025)

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)

  23. [23]

    arXiv preprint arXiv:2406.09246 (2024)

    Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)

  24. [24]

    arXiv preprint arXiv:2510.14830 (2025)

    Lei, K., Li, H., Yu, D., Wei, Z., Guo, L., Jiang, Z., Wang, Z., Liang, S., Xu, H.: Rl-100: Performant robotic manipulation with real-world reinforcement learning. arXiv preprint arXiv:2510.14830 (2025)

  25. [25]

    arXiv preprint arXiv:2509.09674 (2025)

    Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)

  26. [26]

    In: CoRL (2024)

    Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)

  27. [27]

    arXiv preprint arXiv:2512.01801 (2025)

    Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al.: Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801 (2025)

  28. [28]

    NeurIPS36, 44776–44791 (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. NeurIPS36, 44776–44791 (2023)

  29. [29]

    In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt- 1b: A diffusion foundation model for bimanual manipulation. In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R. (eds.) ICLR. vol. 2025, pp. 29982–30009 (2025)

  30. [30]

    arXiv preprint arXiv:2405.14093 (2024)

    Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024)

  31. [31]

    IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

    Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

  32. [32]

    In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)

    Morton, D., Pavone, M.: Safe, task-consistent manipulation with operational space control barrier functions. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 187–194 (2025) LIBERO-Safety Benchmark 17

  33. [33]

    In: CVPR

    Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: Robotwin: Dual-arm robot benchmark with generative digital twins. In: CVPR. pp. 27649–27660 (2025)

  34. [34]

    arXiv preprint arXiv:2502.00935 (2025)

    Nakamura, K., Peters, L., Bajcsy, A.: Generalizing safety beyond collision- avoidance via latent-space reachability analysis. arXiv preprint arXiv:2502.00935 (2025)

  35. [35]

    In: RSS (2024)

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: RSS (2024)

  36. [36]

    In: RSS (2024)

    Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)

  37. [37]

    In: ICRA

    Ranjan, A., Agrawal, S., Jain, A., Jagtap, P., Kolathaya, S., et al.: Barrier functions inspired reward shaping for reinforcement learning. In: ICRA. pp. 10807–10813 (2024)

  38. [38]

    ACM Transactions on Graphics36(6) (2017)

    Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics36(6) (2017)

  39. [39]

    In: CoRL

    Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K.E., Lian, Z., Gokmen, C., Buch, S., Liu, K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL. vol. 164, pp. 477–490 (2022)

  40. [40]

    arXiv preprint arXiv:2602.10098 (2026)

    Sun, J., Zhang, W., Qi, Z., Ren, S., Liu, Z., Zhu, H., Sun, G., Jin, X., Chen, Z.: Vla-jepa: Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098 (2026)

  41. [41]

    arXiv preprint arXiv:2502.03132 (2025)

    Sun, Y., Chen, R., Yun, K.S., Fang, Y., Jung, S., Li, F., Li, B., Zhao, W., Liu, C.: Spark: A modular benchmark for humanoid robot safety. arXiv preprint arXiv:2502.03132 (2025)

  42. [42]

    arXiv preprint arXiv:2310.17274 (2023)

    Sundaralingam, B., Hari, S.K.S., Fishman, A., Garrett, C., Wyk, K.V., Blukis, V., Millane, A., Oleynikova, H., Handa, A., Ramos, F., Ratliff, N., Fox, D.: curobo: Parallelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274 (2023)

  43. [43]

    In: ECCV

    Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole-body human grasping of objects. In: ECCV. pp. 581–600 (2020)

  44. [44]

    arXiv preprint arXiv:2505.17016 (2025)

    Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)

  45. [45]

    arXiv preprint arXiv:2507.02029 (2025)

    Team, B.R., Cao, M., Tan, H., Ji, Y., Chen, X., Lin, M., Li, Z., Cao, Z., Wang, P., Zhou, E., et al.: Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029 (2025)

  46. [46]

    IEEE Control Systems Magazine43(5), 137–177 (2023)

    Wabersich, K.P., Taylor, A.J., Choi, J.J., Sreenath, K., Tomlin, C.J., Ames, A.D., Zeilinger, M.N.: Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine43(5), 137–177 (2023)

  47. [47]

    Wang, G., Zhang, C., Liu, Q., Zhang, J., Cai, J., Liu, J., Liu, X.: Libero-x: Ro- bustnesslitmusforvision-language-actionmodels.arXivpreprintarXiv:2602.06556 (2026)

  48. [48]

    arXiv preprint arXiv:2511.17441 (2025)

    Wu, S., Liu, X., Xie, S., Wang, P., Li, X., Yang, B., Li, Z., Zhu, K., Wu, H., Liu, Y., et al.: Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441 (2025)

  49. [49]

    arXiv preprint arXiv:2505.09388 (2025) 18 R

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 18 R. Cui, Z. Zhang et al

  50. [50]

    arXiv preprint arXiv:2510.14959 (2025)

    Yang, L., Werner, B., de Sa, M., Ames, A.D.: Cbf-rl: Safety filtering reinforcement learning in training with control barrier functions. arXiv preprint arXiv:2510.14959 (2025)

  51. [51]

    Safety Science127, 104667 (2020)

    Zacharaki, A., Kostavelis, I., Gasteratos, A., Dokas, I.: Safety bounds in human robot interaction: A survey. Safety Science127, 104667 (2020)

  52. [52]

    arXiv preprint arXiv:2512.22539 (2025)

    Zhang, B., Li, J., Shen, J., Cai, Y., Zhang, Y., Chen, Y., Dai, J., Ji, J., Yang, Y.: Vla-arena: An open-source framework for benchmarking vision-language-action models. arXiv preprint arXiv:2512.22539 (2025)

  53. [53]

    In: NeurIPS (2025)

    Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y.: SafeVLA: To- wards safety alignment of vision-language-action model via constrained learning. In: NeurIPS (2025)

  54. [54]

    arXiv preprint arXiv:2605.18722 (2026)

    Zhang, Z., Pang, J., Yang, Z., Li, K., Liao, M., Zhang, S., Chi, G., Guo, J., Gao, H.a., Shi, M., et al.: Dexora: Open-source vla for high-dof bimanual dexterity. arXiv preprint arXiv:2605.18722 (2026)

  55. [55]

    arXiv preprint arXiv:2509.07962 (2025)

    Zhang, Z., Xu, H., Yang, Z., Yue, C., Lin, Z., Gao, H.a., Wang, Z., Zhao, H.: Ta- vla: Elucidating the design space of torque-aware vision-language-action models. arXiv preprint arXiv:2509.07962 (2025)

  56. [56]

    arXiv preprint arXiv:2509.08820 (2025)

    Zhang, Z., Yue, C., Xu, H., Liao, M., Qi, X., Gao, H.a., Wang, Z., Zhao, H.: Robochemist: Long-horizon and safety-compliant robotic chemical experimenta- tion. arXiv preprint arXiv:2509.08820 (2025)

  57. [57]

    arXiv preprint arXiv:2510.10274 (2025)

    Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025)

  58. [58]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhong, C., Zheng, Y., Zheng, Y., Zhao, H., Yi, L., Mu, X., Wang, L., Li, P., Zhou, G., Yang, C., et al.: 3d implicit transporter for temporally consistent keypoint discovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3869–3880 (2023)

  59. [59]

    Zhou, X., Xu, Y., Tie, G., Chen, Y., Zhang, G., Chu, D., Zhou, P., Sun, L.: Libero- pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827 (2025) LIBERO-Safety Benchmark 19 Appendix This appendix provides supplementary technical details, extended discussions to support the main finding...

  60. [60]

    to interpret multimodal instructions and spatial contexts, which continu- LIBERO-Safety Benchmark 33 ously conditions a dedicated Diffusion Transformer module (System 1). This structural decoupling enables robust high-level semantic reasoning while the diffusion module independently denoises and generates fluid, high-frequency continuous motor actions, en...