LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models
Pith reviewed 2026-06-26 07:53 UTC · model grok-4.3
The pith
A new parametric benchmark shows VLA models gain safety from diverse training but remain limited by trajectory synthesis and semantic misalignment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a parametric safety benchmark that generates stochastic safety-critical scenarios and a keypose-driven pipeline that produces 19,664 strictly collision-free demonstrations with domain randomization. Systematic testing of eight VLA models and two embodied foundation models reveals that high-diversity training improves trajectory safety while task success remains constrained by sub-optimal trajectory synthesis and semantic misalignment between language and execution.
What carries the argument
The parametric safety benchmark combined with the keypose-driven data generation pipeline, which together enable scalable creation of safety-critical demonstrations and scenarios without human teleoperation.
If this is right
- Training regimes for VLA models should incorporate high-diversity safety data to reduce unsafe trajectories.
- Improvements in trajectory synthesis methods are required before task success rates can rise under safety constraints.
- Semantic alignment techniques must be strengthened to close the gap between language instructions and executed actions.
- The generated dataset can serve as a training resource for developing safer VLA policies without manual demonstration collection.
- Future model evaluations should routinely include cross-paradigm testing on parametric safety scenarios.
Where Pith is reading between the lines
- The identified tension suggests that scaling diversity alone will not suffice and hybrid methods pairing diversity with explicit trajectory optimization may be needed.
- If the benchmark scenarios prove representative, real-world robot deployments could adopt similar procedural generation to pre-test safety before physical trials.
- The pipeline's scalability could support safety testing in adjacent areas such as navigation or multi-agent coordination.
- Persistent semantic misalignment points to a possible need for tighter integration between vision-language pretraining and action-specific fine-tuning.
Load-bearing premise
The keypose-driven data generation pipeline and parametric safety benchmark produce demonstrations and scenarios that accurately capture real-world physical and semantic safety constraints without introducing generation artifacts.
What would settle it
Physical robot experiments using models trained on the curated dataset that fail to reproduce the reported safety improvements or the identified bottlenecks in trajectory quality and semantic alignment.
Figures
read the original abstract
Despite the impressive manipulation capabilities of Vision-Language-Action (VLA) models, their operational safety under strict constraints remains largely unverified. To address this, we introduce a parametric safety benchmark to procedurally generate safety-critical scenarios with comprehensive stochasticity. To overcome the scalability bottlenecks of human teleoperation, we develop a novel keypose-driven data generation pipeline. Leveraging this infrastructure, we curate a large-scale dataset of 19,664 strictly collision-free demonstrations with extensive domain randomization. We then conduct a systematic cross-paradigm evaluation of eight VLA and two embodied foundation models. Our analysis reveals a critical generalization-safety tension: although high-diversity training fosters safer trajectories, task success remains fundamentally bottlenecked by sub-optimal trajectory synthesis and semantic misalignment. By providing a scalable pipeline, a robust dataset, and profound failure-mode insights, LIBERO-Safety establishes a crucial foundation for developing safe and reliable VLA models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LIBERO-Safety, a parametric safety benchmark for procedurally generating safety-critical scenarios with stochasticity, along with a keypose-driven data generation pipeline to produce a dataset of 19,664 strictly collision-free demonstrations incorporating domain randomization. It performs a cross-paradigm evaluation of eight VLA models and two embodied foundation models, revealing a generalization-safety tension in which high-diversity training yields safer trajectories while task success remains limited by sub-optimal trajectory synthesis and semantic misalignment.
Significance. If the central findings hold after validation, the work supplies a scalable infrastructure, large reproducible dataset, and systematic failure-mode analysis that directly addresses an important gap in safety evaluation for VLA models. The explicit provision of the generation pipeline and 19,664 demonstrations constitutes a concrete contribution to reproducibility and future benchmarking efforts in robotics.
major comments (1)
- [Abstract] Abstract (paragraph on infrastructure and dataset curation): The generalization-safety tension finding is load-bearing on the assumption that the keypose-driven pipeline and parametric safety benchmark faithfully instantiate real-world collision avoidance and semantic constraints. No external anchor—such as human teleoperation comparison, real-robot transfer results, or quantitative physics-fidelity metrics—is described that would allow falsification of systematic generation artifacts.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on validating the simulation pipeline. We address the concern directly below and outline revisions to clarify the benchmark's scope and limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph on infrastructure and dataset curation): The generalization-safety tension finding is load-bearing on the assumption that the keypose-driven pipeline and parametric safety benchmark faithfully instantiate real-world collision avoidance and semantic constraints. No external anchor—such as human teleoperation comparison, real-robot transfer results, or quantitative physics-fidelity metrics—is described that would allow falsification of systematic generation artifacts.
Authors: We agree this is a substantive point. The work is explicitly positioned as a scalable simulation benchmark to overcome the bottlenecks of human teleoperation, with the keypose-driven pipeline enforcing collision-free trajectories by construction via optimization in the underlying physics simulator and domain randomization for variability. No human teleoperation comparisons or real-robot transfers are included, as these fall outside the paper's scope of providing reproducible procedural generation. To address potential artifacts, we will revise the abstract and add a dedicated limitations subsection discussing simulator fidelity choices (e.g., realistic mass/friction parameters) and explicitly noting the absence of real-world anchors as a scope limitation. This constitutes a partial revision focused on transparency rather than new experiments. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper introduces a parametric safety benchmark and keypose-driven pipeline to generate 19,664 demonstrations, then performs cross-paradigm evaluation of eight external VLA models and two embodied foundation models. No equations, parameter-fitting steps, or self-citations are present that reduce any claimed result (such as the generalization-safety tension) to the inputs by construction. The central findings derive from empirical performance of non-author models on the new benchmark, satisfying the condition of being self-contained against external benchmarks with no load-bearing self-referential reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The keypose-driven data generation pipeline produces strictly collision-free demonstrations that are representative of safety-critical scenarios.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2511.14759 (2025)
Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., et al.:π∗ 0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759 (2025)
Pith/arXiv arXiv 2025
-
[2]
arXiv preprint arXiv:2506.09985 (2025)
Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)
Pith/arXiv arXiv 2025
-
[3]
IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)
Balasubramanian, S., Melendez-Calderon, A., Burdet, E.: A robust and sensitive metric for quantifying movement smoothness. IEEE Transactions on Biomedical Engineering59(8), 2126–2136 (2012)
2012
-
[4]
arXiv preprint arXiv:2503.14734 (2025)
Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y., Fox, D., Hu, F., Huang, S., et al.: Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734 (2025)
Pith/arXiv arXiv 2025
-
[5]
arXiv preprint arXiv:2410.24164 (2024)
Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π 0: a vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)
Pith/arXiv arXiv 2024
-
[6]
In: arXiv preprint arXiv:2307.15818 (2023)
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., et al.: Rt-2: Vision-language-action mod- els transfer web knowledge to robotic control. In: arXiv preprint arXiv:2307.15818 (2023)
Pith/arXiv arXiv 2023
-
[7]
In: arXiv preprint arXiv:2212.06817 (2022)
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., et al.: Rt-1: Robotics transformer for real-world control at scale. In: arXiv preprint arXiv:2212.06817 (2022)
Pith/arXiv arXiv 2022
-
[8]
In: RSS (2025)
Bu, Q., Yang, Y., Cai, J., Gao, S., Ren, G., Yao, M., Luo, P., Li, H.: Univla: Learning to act anywhere with task-centric latent actions. In: RSS (2025)
2025
-
[9]
arXiv preprint arXiv:2511.17502 (2025)
Cen, J., Huang, S., Yuan, Y., Yuan, H., Yu, C., Jiang, Y., Guo, J., Li, K., Luo, H., Wang, F., et al.: Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502 (2025)
Pith/arXiv arXiv 2025
-
[10]
arXiv preprint arXiv:2506.21539 (2025)
Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025)
Pith/arXiv arXiv 2025
-
[11]
arXiv preprint arXiv:2506.18088 (2025)
Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)
Pith/arXiv arXiv 2025
-
[12]
arXiv preprint arXiv:2602.14979 (2026)
Dang, R., Guo, J., Hou, B., Leng, S., Li, K., Li, X., Liu, J., Mao, Y., Wang, Z., Yuan, Y., et al.: Rynnbrain: Open embodied foundation models. arXiv preprint arXiv:2602.14979 (2026)
arXiv 2026
-
[13]
In: CoRL (2025)
Deng, H., Guo, W., Wang, Q., Wu, Z., Wang, Z.: Safebimanual: Diffusion-based trajectory optimization for safe bimanual manipulation. In: CoRL (2025)
2025
-
[14]
In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Ding, K., Chen, B., Wu, R., Li, Y., Zhang, Z., Gao, H.a., Li, S., Zhou, G., Zhu, Y., Dong, H., et al.: Preafford: Universal affordance-based pre-grasping for diverse ob- jects and environments. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 7278–7285. IEEE (2024)
2024
-
[15]
arXiv preprint arXiv:2510.13626 (2025) 16 R
Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., et al.: Libero-plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025) 16 R. Cui, Z. Zhang et al
Pith/arXiv arXiv 2025
-
[16]
In: ICRA (2025)
Guo, Y., Zhang, J., Chen, X., Ji, X., Wang, Y.J., Hu, Y., Chen, J.: Improving vision-language-action model with online reinforcement learning. In: ICRA (2025)
2025
-
[17]
arXiv preprint arXiv:2512.11891 (2025)
Hu, S., Liu, Z., Liu, S., Cen, J., Meng, Z., He, X.: Vlsa: Vision-language-action models with plug-and-play safety constraint layer. arXiv preprint arXiv:2512.11891 (2025)
arXiv 2025
-
[18]
arXiv preprint arXiv:2511.14659 (2025)
Hung, C.Y., Majumder, N., Deng, H., Renhang, L., Ang, Y., Zadeh, A., Li, C., Herremans, D., Wang, Z., Poria, S.: Nora-1.5: A vision-language-action model trained using world model-and action-based preference rewards. arXiv preprint arXiv:2511.14659 (2025)
arXiv 2025
-
[19]
arXiv preprint arXiv:2504.16054 (2025)
Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π 0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)
Pith/arXiv arXiv 2025
-
[20]
IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)
James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: Rlbench: The robot learning bench- mark & learning environment. IEEE Robotics and Automation Letters5(2), 3019– 3026 (2020)
2020
-
[21]
arXiv preprint arXiv:2509.15212 (2025)
Jiang, Y., Huang, S., Xue, S., Zhao, Y., Cen, J., Leng, S., Li, K., Guo, J., Wang, K., Chen, M., et al.: Rynnvla-001: Using human demonstrations to improve robot manipulation. arXiv preprint arXiv:2509.15212 (2025)
arXiv 2025
-
[22]
arXiv preprint arXiv:2502.19645 (2025)
Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. arXiv preprint arXiv:2502.19645 (2025)
Pith/arXiv arXiv 2025
-
[23]
arXiv preprint arXiv:2406.09246 (2024)
Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)
Pith/arXiv arXiv 2024
-
[24]
arXiv preprint arXiv:2510.14830 (2025)
Lei, K., Li, H., Yu, D., Wei, Z., Guo, L., Jiang, Z., Wang, Z., Liang, S., Xu, H.: Rl-100: Performant robotic manipulation with real-world reinforcement learning. arXiv preprint arXiv:2510.14830 (2025)
arXiv 2025
-
[25]
arXiv preprint arXiv:2509.09674 (2025)
Li, H., Zuo, Y., Yu, J., Zhang, Y., Yang, Z., Zhang, K., Zhu, X., Zhang, Y., Chen, T., Cui, G., et al.: Simplevla-rl: Scaling vla training via reinforcement learning. arXiv preprint arXiv:2509.09674 (2025)
Pith/arXiv arXiv 2025
-
[26]
In: CoRL (2024)
Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: CoRL (2024)
2024
-
[27]
arXiv preprint arXiv:2512.01801 (2025)
Li, Y., Ma, X., Xu, J., Cui, Y., Cui, Z., Han, Z., Huang, L., Kong, T., Liu, Y., Niu, H., et al.: Gr-rl: Going dexterous and precise for long-horizon robotic manipulation. arXiv preprint arXiv:2512.01801 (2025)
arXiv 2025
-
[28]
NeurIPS36, 44776–44791 (2023)
Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Bench- marking knowledge transfer for lifelong robot learning. NeurIPS36, 44776–44791 (2023)
2023
-
[29]
In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R
Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt- 1b: A diffusion foundation model for bimanual manipulation. In: Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R. (eds.) ICLR. vol. 2025, pp. 29982–30009 (2025)
2025
-
[30]
arXiv preprint arXiv:2405.14093 (2024)
Ma, Y., Song, Z., Zhuang, Y., Hao, J., King, I.: A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093 (2024)
Pith/arXiv arXiv 2024
-
[31]
IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)
Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)
2022
-
[32]
In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS)
Morton, D., Pavone, M.: Safe, task-consistent manipulation with operational space control barrier functions. In: 2025 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS). pp. 187–194 (2025) LIBERO-Safety Benchmark 17
2025
-
[33]
In: CVPR
Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: Robotwin: Dual-arm robot benchmark with generative digital twins. In: CVPR. pp. 27649–27660 (2025)
2025
-
[34]
arXiv preprint arXiv:2502.00935 (2025)
Nakamura, K., Peters, L., Bajcsy, A.: Generalizing safety beyond collision- avoidance via latent-space reachability analysis. arXiv preprint arXiv:2502.00935 (2025)
arXiv 2025
-
[35]
In: RSS (2024)
Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: Robocasa: Large-scale simulation of everyday tasks for generalist robots. In: RSS (2024)
2024
-
[36]
In: RSS (2024)
Octo Model Team, Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Xu, C., Luo, J., et al.: Octo: An open-source generalist robot policy. In: RSS (2024)
2024
-
[37]
In: ICRA
Ranjan, A., Agrawal, S., Jain, A., Jagtap, P., Kolathaya, S., et al.: Barrier functions inspired reward shaping for reinforcement learning. In: ICRA. pp. 10807–10813 (2024)
2024
-
[38]
ACM Transactions on Graphics36(6) (2017)
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics36(6) (2017)
2017
-
[39]
In: CoRL
Srivastava, S., Li, C., Lingelbach, M., Martín-Martín, R., Xia, F., Vainio, K.E., Lian, Z., Gokmen, C., Buch, S., Liu, K., Savarese, S., Gweon, H., Wu, J., Fei-Fei, L.: Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL. vol. 164, pp. 477–490 (2022)
2022
-
[40]
arXiv preprint arXiv:2602.10098 (2026)
Sun, J., Zhang, W., Qi, Z., Ren, S., Liu, Z., Zhu, H., Sun, G., Jin, X., Chen, Z.: Vla-jepa: Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098 (2026)
arXiv 2026
-
[41]
arXiv preprint arXiv:2502.03132 (2025)
Sun, Y., Chen, R., Yun, K.S., Fang, Y., Jung, S., Li, F., Li, B., Zhao, W., Liu, C.: Spark: A modular benchmark for humanoid robot safety. arXiv preprint arXiv:2502.03132 (2025)
arXiv 2025
-
[42]
arXiv preprint arXiv:2310.17274 (2023)
Sundaralingam, B., Hari, S.K.S., Fishman, A., Garrett, C., Wyk, K.V., Blukis, V., Millane, A., Oleynikova, H., Handa, A., Ramos, F., Ratliff, N., Fox, D.: curobo: Parallelized collision-free minimum-jerk robot motion generation. arXiv preprint arXiv:2310.17274 (2023)
arXiv 2023
-
[43]
In: ECCV
Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: A dataset of whole-body human grasping of objects. In: ECCV. pp. 581–600 (2020)
2020
-
[44]
arXiv preprint arXiv:2505.17016 (2025)
Tan, S., Dou, K., Zhao, Y., Krähenbühl, P.: Interactive post-training for vision- language-action models. arXiv preprint arXiv:2505.17016 (2025)
Pith/arXiv arXiv 2025
-
[45]
arXiv preprint arXiv:2507.02029 (2025)
Team, B.R., Cao, M., Tan, H., Ji, Y., Chen, X., Lin, M., Li, Z., Cao, Z., Wang, P., Zhou, E., et al.: Robobrain 2.0 technical report. arXiv preprint arXiv:2507.02029 (2025)
arXiv 2025
-
[46]
IEEE Control Systems Magazine43(5), 137–177 (2023)
Wabersich, K.P., Taylor, A.J., Choi, J.J., Sreenath, K., Tomlin, C.J., Ames, A.D., Zeilinger, M.N.: Data-driven safety filters: Hamilton-jacobi reachability, control barrier functions, and predictive methods for uncertain systems. IEEE Control Systems Magazine43(5), 137–177 (2023)
2023
-
[47]
Wang, G., Zhang, C., Liu, Q., Zhang, J., Cai, J., Liu, J., Liu, X.: Libero-x: Ro- bustnesslitmusforvision-language-actionmodels.arXivpreprintarXiv:2602.06556 (2026)
arXiv 2026
-
[48]
arXiv preprint arXiv:2511.17441 (2025)
Wu, S., Liu, X., Xie, S., Wang, P., Li, X., Yang, B., Li, Z., Zhu, K., Wu, H., Liu, Y., et al.: Robocoin: An open-sourced bimanual robotic data collection for integrated manipulation. arXiv preprint arXiv:2511.17441 (2025)
Pith/arXiv arXiv 2025
-
[49]
arXiv preprint arXiv:2505.09388 (2025) 18 R
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 18 R. Cui, Z. Zhang et al
Pith/arXiv arXiv 2025
-
[50]
arXiv preprint arXiv:2510.14959 (2025)
Yang, L., Werner, B., de Sa, M., Ames, A.D.: Cbf-rl: Safety filtering reinforcement learning in training with control barrier functions. arXiv preprint arXiv:2510.14959 (2025)
Pith/arXiv arXiv 2025
-
[51]
Safety Science127, 104667 (2020)
Zacharaki, A., Kostavelis, I., Gasteratos, A., Dokas, I.: Safety bounds in human robot interaction: A survey. Safety Science127, 104667 (2020)
2020
-
[52]
arXiv preprint arXiv:2512.22539 (2025)
Zhang, B., Li, J., Shen, J., Cai, Y., Zhang, Y., Chen, Y., Dai, J., Ji, J., Yang, Y.: Vla-arena: An open-source framework for benchmarking vision-language-action models. arXiv preprint arXiv:2512.22539 (2025)
Pith/arXiv arXiv 2025
-
[53]
In: NeurIPS (2025)
Zhang, B., Zhang, Y., Ji, J., Lei, Y., Dai, J., Chen, Y., Yang, Y.: SafeVLA: To- wards safety alignment of vision-language-action model via constrained learning. In: NeurIPS (2025)
2025
-
[54]
arXiv preprint arXiv:2605.18722 (2026)
Zhang, Z., Pang, J., Yang, Z., Li, K., Liao, M., Zhang, S., Chi, G., Guo, J., Gao, H.a., Shi, M., et al.: Dexora: Open-source vla for high-dof bimanual dexterity. arXiv preprint arXiv:2605.18722 (2026)
Pith/arXiv arXiv 2026
-
[55]
arXiv preprint arXiv:2509.07962 (2025)
Zhang, Z., Xu, H., Yang, Z., Yue, C., Lin, Z., Gao, H.a., Wang, Z., Zhao, H.: Ta- vla: Elucidating the design space of torque-aware vision-language-action models. arXiv preprint arXiv:2509.07962 (2025)
arXiv 2025
-
[56]
arXiv preprint arXiv:2509.08820 (2025)
Zhang, Z., Yue, C., Xu, H., Liao, M., Qi, X., Gao, H.a., Wang, Z., Zhao, H.: Robochemist: Long-horizon and safety-compliant robotic chemical experimenta- tion. arXiv preprint arXiv:2509.08820 (2025)
arXiv 2025
-
[57]
arXiv preprint arXiv:2510.10274 (2025)
Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y., Zheng, Y., Zou, J., Chen, Y., Zeng, J., et al.: X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274 (2025)
Pith/arXiv arXiv 2025
-
[58]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhong, C., Zheng, Y., Zheng, Y., Zhao, H., Yi, L., Mu, X., Wang, L., Li, P., Zhou, G., Yang, C., et al.: 3d implicit transporter for temporally consistent keypoint discovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3869–3880 (2023)
2023
-
[59]
Zhou, X., Xu, Y., Tie, G., Chen, Y., Zhang, G., Chu, D., Zhou, P., Sun, L.: Libero- pro: Towards robust and fair evaluation of vision-language-action models beyond memorization. arXiv preprint arXiv:2510.03827 (2025) LIBERO-Safety Benchmark 19 Appendix This appendix provides supplementary technical details, extended discussions to support the main finding...
Pith/arXiv arXiv 2025
-
[60]
to interpret multimodal instructions and spatial contexts, which continu- LIBERO-Safety Benchmark 33 ously conditions a dedicated Diffusion Transformer module (System 1). This structural decoupling enables robust high-level semantic reasoning while the diffusion module independently denoises and generates fluid, high-frequency continuous motor actions, en...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.