pith. machine review for the scientific record. sign in

arxiv: 2604.23775 · v1 · submitted 2026-04-26 · 💻 cs.RO

Recognition: unknown

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

Bojun Zou, Bo Yin, Jingwen Ye, Qi Li, Ruhao Liu, Runpeng Yu, Weihao Yu, Weiqi Huang, Xinchao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:52 UTC · model grok-4.3

classification 💻 cs.RO
keywords safetyincludingtraining-timeacrossattackschallengesdefensesembodied
0
0 comments X

The pith

A survey unifies safety for Vision-Language-Action models by organizing threats and defenses along training-time versus inference-time axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-Language-Action models combine vision, language, and action outputs to control physical robots and agents, creating safety risks with irreversible physical effects and multimodal vulnerabilities. The paper argues that prior work on these risks is scattered across robotics, adversarial ML, and alignment research, so it supplies a single map that places each threat at the stage where it can be stopped. The map splits attacks into training-time problems such as data poisoning and backdoors versus inference-time problems such as patches and jailbreaks, and it does the same for defenses. By linking each threat to its natural mitigation window, the survey shows where current methods fall short and where new work is needed to keep embodied systems safe during long trajectories.

Core claim

VLA safety literature can be organized into four quadrants defined by attack timing (training-time versus inference-time) and defense timing (training-time versus inference-time). Training-time attacks include data poisoning and backdoors; inference-time attacks include adversarial patches, cross-modal perturbations, semantic jailbreaks, and freezing attacks. Corresponding defenses are reviewed at both stages, together with benchmarks, metrics, and domain-specific deployment issues.

What carries the argument

The two parallel timing axes (attack timing and defense timing, each divided into training-time and inference-time) that structure every threat and mitigation reviewed in the survey.

If this is right

  • Training-time defenses can block data poisoning and backdoors before a VLA model is deployed.
  • Inference-time defenses must operate under real-time latency limits to counter patches and cross-modal attacks on physical hardware.
  • Benchmarks must include long-horizon trajectories and physical irreversibility to measure true safety.
  • Deployment in domains such as autonomous vehicles or household robots will require separate safety analyses because attack surfaces differ.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same timing grid could be applied to other embodied multimodal systems that are not strictly VLA.
  • Certified robustness techniques for trajectories would directly address one of the open problems listed in the survey.
  • A unified runtime safety layer might combine multiple inference-time defenses into a single lightweight. I need to read the full text using the tool mentioned, but since it's not available here, I'll use the abstract and notes to infer. The prompt says

Load-bearing premise

The existing literature on VLA safety is fragmented enough that this particular two-axis timing organization adds clear value and that the cited papers already cover the main threats.

What would settle it

A sizable set of VLA safety papers whose threats or defenses cannot be placed on the training-versus-inference grid for attacks and defenses, or a major class of embodied threats absent from the survey.

read the original abstract

Vision-Language-Action (VLA) models are emerging as a unified substrate for embodied intelligence. This shift raises a new class of safety challenges, stemming from the embodied nature of VLA systems, including irreversible physical consequences, a multimodal attack surface across vision, language, and state, real-time latency constraints on defense, error propagation over long-horizon trajectories, and vulnerabilities in the data supply chain. Yet the literature remains fragmented across robotic learning, adversarial machine learning, AI alignment, and autonomous systems safety. This survey provides a unified and up-to-date overview of safety in Vision-Language-Action models. We organize the field along two parallel timing axes, attack timing (training-time vs. inference-time and defense timing (training-time vs. inference-time, linking each class of threat to the stage at which it can be mitigated. We first define the scope of VLA safety, distinguishing it from text-only LLM safety and classical robotic safety, and review the foundations of VLA models, including architectures, training paradigms, and inference mechanisms. We then examine the literature through four lenses: Attacks, Defenses, Evaluation, and Deployment. We survey training-time threats such as data poisoning and backdoors, as well as inference-time attacks including adversarial patches, cross-modal perturbations, semantic jailbreaks, and freezing attacks. We review training-time and runtime defenses, analyze existing benchmarks and metrics, and discuss safety challenges across six deployment domains. Finally, we highlight key open problems, including certified robustness for embodied trajectories, physically realizable defenses, safety-aware training, unified runtime safety architectures, and standardized evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper claims to deliver a unified survey of safety issues in Vision-Language-Action (VLA) models for embodied intelligence. It distinguishes VLA safety from text-only LLM safety and classical robotic safety by emphasizing embodied factors such as physical irreversibility, multimodal attack surfaces, latency constraints, trajectory error propagation, and data supply chain vulnerabilities. The central contribution is an organizational framework using two parallel timing axes—attack timing (training-time vs. inference-time) and defense timing (training-time vs. inference-time)—that maps threats to mitigation stages. The manuscript reviews VLA foundations, then surveys the literature through four lenses (Attacks, Defenses, Evaluation, Deployment), covering training-time threats like data poisoning and backdoors, inference-time attacks like adversarial patches and semantic jailbreaks, corresponding defenses, benchmarks, and domain-specific deployment challenges, before listing open problems such as certified robustness for trajectories and standardized evaluation.

Significance. If the two-axis organization proves effective at linking threats to mitigation stages, the survey would provide a valuable structured reference for the emerging VLA safety literature, which the abstract correctly notes is currently fragmented across robotic learning, adversarial ML, AI alignment, and autonomous systems. The explicit grounding in embodied specifics (e.g., irreversible physical consequences and cross-modal surfaces) strengthens its relevance beyond generic LLM safety surveys. The paper earns credit for adopting a standard yet appropriate timing-based lens without overclaiming exhaustiveness or superiority to all alternatives, and for clearly scoping the VLA definition before applying the framework.

minor comments (1)
  1. Abstract: the sentence introducing the two axes contains a missing closing parenthesis and awkward phrasing ('attack timing (training-time vs. inference-time and defense timing (training-time vs. inference-time, linking'), which reduces readability; this should be corrected to 'attack timing (training-time vs. inference-time) and defense timing (training-time vs. inference-time), linking'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of our survey on Vision-Language-Action safety. The recommendation for minor revision is noted, and we appreciate the recognition of the two-axis organizational framework and the emphasis on embodied factors. As no specific major comments were raised in the report, we have no rebuttals to provide and will incorporate any editorial or minor suggestions in the revised version.

Circularity Check

0 steps flagged

No significant circularity as a literature survey

full rationale

This paper is a survey that reviews and organizes external literature on VLA safety without any original derivations, equations, fitted parameters, or predictions that could reduce to its own inputs by construction. The two-axis organization (attack timing and defense timing) is introduced as a conceptual lens to structure the review of threats, defenses, evaluations, and deployments, motivated by embodied characteristics listed in the abstract, but not derived from or equivalent to any self-referential content. No self-citations are load-bearing for a central claim, no uniqueness theorems are invoked from prior author work, and no ansatzes or renamings of known results occur. The scope definition and four-lens structure precede the organization, and the paper makes no claim that the axes are exhaustive or mathematically forced, rendering the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a literature survey paper with no new mathematical models, empirical fits, or postulated entities; it relies on standard definitions from the cited fields.

axioms (1)
  • domain assumption VLA models form a unified substrate for embodied intelligence distinct from text-only LLMs and classical robotic systems.
    Invoked in the opening of the abstract to define scope.

pith-pipeline@v0.9.0 · 5612 in / 997 out tokens · 31062 ms · 2026-05-08T05:52:47.043400+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 61 canonical work pages · 22 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  3. [3]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents.arXiv preprint arXiv:1807.06757, 2018

  4. [4]

    Covla: Comprehensive vision-language-action dataset for autonomous driving

    Hidehisa Arai, Keita Miwa, Kento Sasaki, Kohei Watanabe, Yu Yamaguchi, Shunsuke Aoki, and Issei Yamamoto. Covla: Comprehensive vision-language-action dataset for autonomous driving. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1933–1943. IEEE, 2025

  5. [5]

    Large language model-based task planning for service robots: A review.Biomimetic Intelligence and Robotics, page 100274, 2026

    Shaohan Bian, Ying Zhang, Guohui Tian, Zhiqiang Miao, Edmond Q Wu, Simon X Yang, and Changchun Hua. Large language model-based task planning for service robots: A review.Biomimetic Intelligence and Robotics, page 100274, 2026

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.pi_0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  8. [8]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  9. [9]

    Can vision-language models understand construction workers? an exploratory study.arXiv preprint arXiv:2601.10835, 2026

    Hieu Bui, Nathaniel E Chodosh, and Arash Tavakoli. Can vision-language models understand construction workers? an exploratory study.arXiv preprint arXiv:2601.10835, 2026

  10. [10]

    If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

    Jiamin Chang, Minhui Xue, Ruoxi Sun, Shuchao Pang, Salil S. Kanhere, and Hammond Pearce. If you’re waiting for a sign... that might not be it! mitigating trust boundary confusion from visual injections on vision-language agentic systems, 2026.https://arxiv.org/abs/2604.19844

  11. [11]

    SafeMind: Benchmarking and mitigating safety risks in embodied LLM agents.arXiv preprintarXiv:2509.25885, 2025

    Ruolin Chen, Yinqian Sun, Jihang Wang, Mingyang Lv, Qian Zhang, and Yi Zeng. Safemind: benchmarking and mitigating safety risks in embodied llm agents.arXiv preprint arXiv:2509.25885, 2025. 36

  12. [12]

    HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models

    Zixing Chen, Yifeng Gao, Li Wang, Yunhan Zhao, Yi Liu, Jiayu Li, Xiang Zheng, Zuxuan Wu, Cong Wang, Xingjun Ma, et al. Hazardarena: Evaluating semantic safety in vision-language-action models.arXiv preprint arXiv:2604.12447, 2026

  13. [13]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  14. [14]

    From words to safety: Language-conditioned safety filtering for robot navigation.arXiv preprint arXiv:2511.05889, 2025

    Zeyuan Feng, Haimingyue Zhang, and Somil Bansal. From words to safety: Language-conditioned safety filtering for robot navigation.arXiv preprint arXiv:2511.05889, 2025

  15. [15]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117, 2024

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  17. [17]

    Safe: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025

    Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. Safe: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025

  18. [18]

    State backdoor: Towards stealthy real-world poisoning attack on vision-language-action model in state space, 2026

    Ji Guo, Wenbo Jiang, Yansong Lin, Yijing Liu, Ruichen Zhang, Guomin Lu, Aiguo Chen, Xinshuo Han, Hongwei Li, and Dusit Niyato. State backdoor: Towards stealthy real-world poisoning attack on vision-language-action model in state space, 2026

  19. [19]

    Embodied ai for smart robotic cells in manufacturing applications

    Satyandra K Gupta. Embodied ai for smart robotic cells in manufacturing applications. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 28630–28636, 2025

  20. [20]

    ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

    Kaiser Hamid, Can Cui, and Nade Liang. Icr-drive: Instruction counterfactual robustness for end-to-end language-driven autonomous driving.arXiv preprint arXiv:2604.05378, 2026

  21. [21]

    Run-time observation interventions make vision- language-action models more visually robust

    Asher J Hancock, Allen Z Ren, and Anirudha Majumdar. Run-time observation interventions make vision- language-action models more visually robust. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9499–9506. IEEE, 2025

  22. [22]

    Safety optimized reinforcement learning via multi-objective policy optimization, 2024

    Homayoun Honari, Mehran Ghafarian Tamizi, and Homayoun Najjaran. Safety optimized reinforcement learning via multi-objective policy optimization, 2024

  23. [23]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  24. [24]

    Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXivpreprint arXiv:2512.11891, 2025

    Songqiao Hu, Zeyi Liu, Shuang Liu, Jun Cen, Zihan Meng, and Xiao He. Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXiv preprint arXiv:2512.11891, 2025

  25. [25]

    Vision-language-action models for autonomous driving: Past, present, and future.arXiv preprint arXiv:2512.16760, 2025

    Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760, 2025

  26. [26]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023

  27. [27]

    Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving.Transportation Research Part C: Emerging Technologies, 180:105321, 2025

    Zilin Huang, Zihao Sheng, Yansong Qu, Junwei You, and Sikai Chen. Vlm-rl: A unified vision language models and reinforcement learning framework for safe autonomous driving.Transportation Research Part C: Emerging Technologies, 180:105321, 2025

  28. [28]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  29. [29]

    Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

    Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

  30. [30]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.pi_0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 37

  31. [31]

    A survey on vision-language-action models for autonomous driving

    Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, et al. A survey on vision-language-action models for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4524–4536, 2025

  32. [32]

    Adversarial attacks on robotic vision language action models.arXivpreprintarXiv:2506.03350, 2025

    Eliot Krzysztof Jones, Alexander Robey, Andy Zou, Zachary Ravichandran, George J Pappas, Hamed Hassani, Matt Fredrikson, and J Zico Kolter. Adversarial attacks on robotic vision language action models.arXiv preprint arXiv:2506.03350, 2025

  33. [33]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  34. [34]

    Pedagogical alignment for vision-language-action models: A comprehensive framework for data, architecture, and evaluation in education, 2026

    Unggi Lee, Jahyun Jeong, Sunyoung Shin, Haeun Park, Jeongsu Moon, Youngchang Song, Jaechang Shim, JaeHwan Lee, Yunju Noh, Seungwon Choi, Ahhyun Kim, TaeHyeon Kim, Kyungtae Joo, Taeyeong Kim, and Gyeonggeon Lee. Pedagogical alignment for vision-language-action models: A comprehensive framework for data, architecture, and evaluation in education, 2026

  35. [35]

    The shawshank redemption of embodied ai: Understanding and benchmarking indirect environmental jailbreaks.arXiv preprint arXiv:2511.16347, 2025

    Chunyang Li, Zifeng Kang, Junwei Zhang, Zhuo Ma, Anda Cheng, Xinghua Li, and Jianfeng Ma. The shawshank redemption of embodied ai: Understanding and benchmarking indirect environmental jailbreaks.arXiv preprint arXiv:2511.16347, 2025

  36. [36]

    Attackvla: Bench- marking adversarial and backdoor attacks on vision-language-action models.arXiv preprint arXiv:2511.12149, 2025

    Jiayu Li, Yunhan Zhao, Xiang Zheng, Zonghuan Xu, Yige Li, Xingjun Ma, and Yu-Gang Jiang. Attackvla: Bench- marking adversarial and backdoor attacks on vision-language-action models.arXiv preprint arXiv:2511.12149, 2025

  37. [37]

    Robonurse-vla: Robotic scrub nurse system based on vision-language-action model

    Shunlei Li, Jin Wang, Rui Dai, Wanyu Ma, Wing Yin Ng, Yingbai Hu, and Zheng Li. Robonurse-vla: Robotic scrub nurse system based on vision-language-action model. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3986–3993. IEEE, 2025

  38. [38]

    Causal scene narration with runtime safety supervision for vision-language-action driving.arXiv preprint arXiv:2604.01723, 2026

    Yun Li, Yidu Zhang, Simon Thompson, Ehsan Javanmardi, and Manabu Tsukada. Causal scene narration with runtime safety supervision for vision-language-action driving.arXiv preprint arXiv:2604.01723, 2026

  39. [39]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In2023 IEEE International conference on robotics and automation (ICRA), pages 9493–9500. IEEE, 2023

  40. [40]

    arXiv preprint arXiv:2510.01642 , year=

    Zijun Lin, Jiafei Duan, Haoquan Fang, Dieter Fox, Ranjay Krishna, Cheston Tan, and Bihan Wen. Failsafe: Reasoning and recovery from failures in vision-language-action models.arXiv preprint arXiv:2510.01642, 2025

  41. [41]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

  42. [42]

    Evovla: Self-evolving vision-language-action model, 2025

    Zeting Liu, Zida Yang, Zeyu Zhang, and Hao Tang. Evovla: Self-evolving vision-language-action model, 2025

  43. [43]

    Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023

    Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023

  44. [44]

    Human-in-the-loop online rejection sampling for robotic manipulation, 2025

    Guanxing Lu, Rui Zhao, Haitao Lin, He Zhang, and Yansong Tang. Human-in-the-loop online rejection sampling for robotic manipulation, 2025

  45. [45]

    Exploring the robustness of vision-language-action models against sensor attacks

    Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Ruochen Zhou, Xiaoyu Ji, and Wenyuan Xu. Exploring the robustness of vision-language-action models against sensor attacks. InProceedings of the 2025 Workshop on Large AI Systems and Models with Privacy and Security Analysis, pages 11–18, 2025

  46. [46]

    Phantom menace: Exploring and enhancing the robustness of vla models against physical sensor attacks

    Xuancun Lu, Jiaxiang Chen, Shilin Xiao, Zizhi Jin, Zhangrui Chen, Hanwen Yu, Bohan Qian, Ruochen Zhou, Xiaoyu Ji, and Wenyuan Xu. Phantom menace: Exploring and enhancing the robustness of vla models against physical sensor attacks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 35689–35697, 2026

  47. [47]

    Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809, 2026

    FawadMehboob, MonijesuJames, AmirHabel, JeffrinSam, MiguelAltamiranoCabrera, andDzmitryTsetserukou. Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809, 2026

  48. [48]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. 38

  49. [49]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  50. [50]

    Wristworld: Generating wrist-views via 4d world models for robotic manipulation.CoRR, abs/2510.07313, 2025

    Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, and Shanghang Zhang. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313, 2025

  51. [51]

    Vl-safe: Vision- language guided safety-aware reinforcement learning with world models for autonomous driving.arXiv preprint arXiv:2505.16377, 2025

    Yansong Qu, Zilin Huang, Zihao Sheng, Jiancong Chen, Sikai Chen, and Samuel Labi. Vl-safe: Vision- language guided safety-aware reinforcement learning with world models for autonomous driving.arXiv preprint arXiv:2505.16377, 2025

  52. [52]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  53. [53]

    VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

    Ravi Ranjan and Agoritsa Polyzou. Vla-forget: Vision-language-action unlearning for embodied foundation models.arXiv preprint arXiv:2604.03956, 2026

  54. [54]

    How VLAs (Really) Work In Open-World Environments

    Amir Rasouli, Yangzheng Wu, Zhiyuan Li, Rui Heng Yang, Xuan Zhao, Charles Eret, and Sajjad Pakdamansavoji. How vlas (really) work in open-world environments, 2026.https://arxiv.org/abs/2604.21192

  55. [55]

    Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

    Zachary Ravichandran, Alexander Robey, Vijay Kumar, George J Pappas, and Hamed Hassani. Safety guardrails for llm-enabled robots.IEEE Robotics and Automation Letters, 2026

  56. [56]

    Jailbreaking llm-controlled robots

    Alexander Robey, Zachary Ravichandran, Vijay Kumar, Hamed Hassani, and George J Pappas. Jailbreaking llm-controlled robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11948–11956. IEEE, 2025

  57. [57]

    VLA-risk: Benchmarking vision-language-action models with physical robustness, 2026.https://openreview.net/forum?id=31EjDFwFEe

    Yanchi Ru, Zhengyue Zhao, YingziYingzi Ma, Xiaogeng Liu, and Chaowei Xiao. VLA-risk: Benchmarking vision-language-action models with physical robustness, 2026.https://openreview.net/forum?id=31EjDFwFEe

  58. [58]

    Safe-smart: Safety analysis and formal evaluation using stl metrics for autonomous robots.arXiv preprint arXiv:2511.17781, 2025

    Kristy Sakano, Jianyu An, Dinesh Manocha, and Huan Xu. Safe-smart: Safety analysis and formal evaluation using stl metrics for autonomous robots.arXiv preprint arXiv:2511.17781, 2025

  59. [59]

    Costnav: A navigation benchmark for cost-aware evaluation of embodied agents.arXiv preprint arXiv:2511.20216, 2025

    Haebin Seong, Sungmin Kim, Minchan Kim, Yongjun Cho, Myunchul Joe, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Yoonshik Kim, Samwoo Seong, et al. Costnav: A navigation benchmark for cost-aware evaluation of embodied agents.arXiv preprint arXiv:2511.20216, 2025

  60. [60]

    Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

    Pierre Sermanet, Anirudha Majumdar, Alex Irpan, Dmitry Kalashnikov, and Vikas Sindhwani. Generating robot constitutions & benchmarks for semantic safety.arXiv preprint arXiv:2503.08663, 2025

  61. [61]

    Vlm- social-nav: Socially aware robot navigation through scoring using vision-language models.IEEE Robotics and Automation Letters, 10(1):508–515, 2024

    Daeun Song, Jing Liang, Amirreza Payandeh, Amir Hossain Raj, Xuesu Xiao, and Dinesh Manocha. Vlm- social-nav: Socially aware robot navigation through scoring using vision-language models.IEEE Robotics and Automation Letters, 10(1):508–515, 2024

  62. [62]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  63. [63]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  64. [64]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  65. [65]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289, 2024

  66. [66]

    Towards safe robot foundation models using inductive biases.arXiv preprint arXiv:2505.10219, 2025

    Maximilian Tölle, Theo Gruner, Daniel Palenicek, Tim Schneider, Jonas Günster, Joe Watson, Davide Tateo, Puze Liu, and Jan Peters. Towards safe robot foundation models using inductive biases.arXiv preprint arXiv:2505.10219, 2025. 39

  67. [67]

    Pablo Valle, Chengjie Lu, Shaukat Ali, and Aitor Arri- eta

    Pablo Valle, Chengjie Lu, Shaukat Ali, and Aitor Arrieta. Evaluating uncertainty and quality of visual language action-enabled robots.arXiv preprint arXiv:2507.17049, 2025

  68. [68]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  69. [69]

    A survey of constraint formulations in safe reinforcement learning

    Akifumi Wachi, Xun Shen, and Yanan Sui. A survey of constraint formulations in safe reinforcement learning. arXiv preprint arXiv:2402.02025, 2024

  70. [70]

    Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

    Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery.arXiv preprint arXiv:2405.10948, 2024

  71. [71]

    Robosafe: Safeguarding embodied agents via executable safety logic, 2025

    Le Wang, Zonghao Ying, Xiao Yang, Quanchen Zou, Zhenfei Yin, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, and Xianglong Liu. Robosafe: Safeguarding embodied agents via executable safety logic, 2025

  72. [72]

    Physical attacks on robot navigation systems

    Meng Wang, Yohei Hayamizu, Matthew Tang, Kevin Gopalan, Shiqi Zhang, and Ping Yang. Physical attacks on robot navigation systems. InRSS 2025 Workshop on Reliable Robotics: Safety and Security in the Face of Generative AI, 2025.https://openreview.net/forum?id=A4A WclA4aC

  73. [73]

    Exploring the adversarial vulnerabilities of vision-language-action models in robotics

    Taowen Wang, Cheng Han, James Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. Exploring the adversarial vulnerabilities of vision-language-action models in robotics. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6948–6958, 2025

  74. [74]

    Freezevla: Action-freezing attacks against vision- language-action models.arXiv preprint arXiv:2509.19870,

    Xin Wang, Jie Li, Zejia Weng, Yixu Wang, Yifeng Gao, Tianyu Pang, Chao Du, Yan Teng, Yingchun Wang, Zuxuan Wu, et al. Freezevla: Action-freezing attacks against vision-language-action models.arXiv preprint arXiv:2509.19870, 2025

  75. [75]

    Vlatest: Testing and evaluating vision-language-action models for robotic manipulation.Proceedings of the ACM on Software Engineering, 2 (FSE):1615–1638, 2025

    Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, and Lei Ma. Vlatest: Testing and evaluating vision-language-action models for robotic manipulation.Proceedings of the ACM on Software Engineering, 2 (FSE):1615–1638, 2025

  76. [76]

    Human-assisted robotic policy refinement via action preference optimization, 2025

    Wenke Xia, Yichu Yang, Hongtao Wu, Xiao Ma, Tao Kong, and Di Hu. Human-assisted robotic policy refinement via action preference optimization, 2025

  77. [77]

    Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025

    Tian-Yu Xiang, Ao-Qun Jin, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Sheng-Bin Duan, Fu-Chao Xie, Wen-Kai Wang, et al. Parallels between vla model post-training and human motor learning: Progress, challenges, and trends.arXiv preprint arXiv:2506.20966, 2025

  78. [78]

    Silentdrift: Exploiting action chunking for stealthy backdoor attacks on vision-language-action models, 2026

    Bingxin Xu, Yuzhang Shang, Binghui Wang, and Emilio Ferrara. Silentdrift: Exploiting action chunking for stealthy backdoor attacks on vision-language-action models, 2026

  79. [79]

    Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

    Siyu Xu, Zijian Wang, Yunke Wang, Chenghao Xia, Tao Huang, and Chang Xu. Affordance field intervention: Enabling vlas to escape memory traps in robotic manipulation.arXiv preprint arXiv:2512.07472, 2025

  80. [80]

    Dropvla: An action-level backdoor attack on vision-language-action models, 2026

    Zonghuan Xu, Jiayu Li, Yunhan Zhao, Xiang Zheng, Xingjun Ma, and Yu-Gang Jiang. Dropvla: An action-level backdoor attack on vision-language-action models, 2026

Showing first 80 references.