arxiv: 2604.12447 · v1 · submitted 2026-04-14 · 💻 cs.RO

Recognition: unknown

HazardArena: Evaluating Semantic Safety in Vision-Language-Action Models

Cong Wang, Jiayu Li, Li Wang, Xiang Zheng, Xingjun Ma, Yifeng Gao, Yi Liu, Yu-Gang Jiang, Yunhan Zhao, Zixing Chen, Zuxuan Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:23 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language-Action modelssemantic safetybenchmarkrobotic safetyunsafe behaviorSafety Option Layerrisk categoriestwin scenarios

0 comments

The pith

VLA models trained on safe scenarios often fail to behave safely in semantically risky versions of the same tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HazardArena to test whether vision-language-action models connect their semantic understanding from vision-language backbones to safety constraints during action execution. It builds pairs of scenarios that match exactly in objects, layouts, and required actions but differ only in whether the context makes the action unsafe. Models that succeed on the safe versions of these tasks frequently produce unsafe outcomes on the unsafe versions. A training-free Safety Option Layer is proposed that adds a semantic check or vision-language judge to constrain actions before execution. This matters for real-world robotic deployment where action success without safety awareness can lead to harm.

Core claim

VLA models inherit rich world knowledge from vision-language backbones yet their action policies remain loosely coupled with semantic safety, so that correct execution of a required action can still produce unsafe outcomes under risk-bearing semantic contexts; HazardArena exposes the vulnerability through safe/unsafe twin scenarios and shows that a training-free Safety Option Layer substantially reduces unsafe behaviors with minimal impact on task performance.

What carries the argument

HazardArena benchmark constructed from safe/unsafe twin scenarios that share identical objects, layouts, and action requirements while differing only in the semantic context that determines risk.

If this is right

Evaluations focused only on action execution success miss systematic semantic safety vulnerabilities in VLA models.
Training exclusively on safe scenarios is insufficient to produce safe behavior when semantic context signals risk.
A training-free Safety Option Layer using semantic attributes or a vision-language judge can constrain unsafe actions with little cost to task performance.
Standardized testing across 40 tasks and 7 risk categories grounded in robotic safety standards is needed to evaluate semantic safety before deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the twin-scenario construction isolates semantic context reliably, the same pairing method could be applied to test safety gaps in other multimodal models beyond robotics.
Real-world use may require continuous semantic monitoring rather than one-time training fixes to maintain safety as environments change.
Adding dynamic elements or multi-step risk chains to the benchmark could reveal whether the identified vulnerabilities compound over time.

Load-bearing premise

The constructed safe/unsafe twin scenarios differ only in semantic context while sharing identical objects, layouts, and action requirements, and these pairs faithfully represent real-world risk categories.

What would settle it

VLA models achieving comparable safety rates and task success in both the safe and unsafe members of each twin pair, or the Safety Option Layer failing to reduce unsafe actions without lowering task performance.

read the original abstract

Vision-Language-Action (VLA) models inherit rich world knowledge from vision-language backbones and acquire executable skills via action demonstrations. However, existing evaluations largely focus on action execution success, leaving action policies loosely coupled with visual-linguistic semantics. This decoupling exposes a systematic vulnerability whereby correct action execution may induce unsafe outcomes under semantic risk. To expose this vulnerability, we introduce HazardArena, a benchmark designed to evaluate semantic safety in VLAs under controlled yet risk-bearing contexts. HazardArena is constructed from safe/unsafe twin scenarios that share matched objects, layouts, and action requirements, differing only in the semantic context that determines whether an action is unsafe. We find that VLA models trained exclusively on safe scenarios often fail to behave safely when evaluated in their corresponding unsafe counterparts. HazardArena includes over 2,000 assets and 40 risk-sensitive tasks spanning 7 real-world risk categories grounded in established robotic safety standards. To mitigate this vulnerability, we propose a training-free Safety Option Layer that constrains action execution using semantic attributes or a vision-language judge, substantially reducing unsafe behaviors with minimal impact on task performance. We hope that HazardArena highlights the need to rethink how semantic safety is evaluated and enforced in VLAs as they scale toward real-world deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HazardArena's twin-scenario benchmark flags a semantic safety gap in VLAs but the pairs need tighter equivalence checks to rule out ordinary distribution shift.

read the letter

The main point is that VLAs trained only on safe scenarios often fail to stay safe when the physical setup is identical but the semantic context makes the action risky, and the authors supply a simple training-free patch that cuts the unsafe rate with little task cost. The twin-scenario construction is the clearest new element: it aims to keep objects, layouts, and required actions fixed while varying only the risk label, which is not a standard trick in the VLA safety papers cited. They also scale the benchmark to more than 2,000 assets and 40 tasks across seven categories drawn from real robotic safety standards, and the Safety Option Layer (semantic attributes or a vision-language judge) is presented as a practical, no-retraining intervention. That combination of controlled testbed and lightweight mitigation is the part worth paying attention to. The soft spot is exactly the one the stress-test flags. The abstract asserts the pairs differ only in semantic context, yet it gives no quantitative evidence—similarity metrics, pose alignment numbers, or inter-rater checks—that visual cues and motor sequences are truly matched. If any correlated change slips in, the performance drop could be ordinary generalization failure rather than a distinct semantic-safety deficit. The reported results also omit error bars or significance tests, so it is hard to judge how stable the gap is across models or seeds. The mitigation description stays high-level, leaving open questions about how the judge is prompted and what happens on edge cases. This paper is aimed at groups working on VLA deployment and robotic safety evaluation. A reader who cares about benchmark design for semantic risk would find the twin construction and the empirical pattern useful to discuss. It deserves peer review because the underlying problem is real and the benchmark supplies a concrete starting point, even though the equivalence controls and statistical reporting need strengthening before the claims can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper introduces HazardArena, a benchmark consisting of over 2,000 assets and 40 risk-sensitive tasks across 7 categories grounded in robotic safety standards. It constructs safe/unsafe twin scenarios that share matched objects, layouts, and action requirements but differ in semantic context, reports that VLA models trained only on safe scenarios frequently fail to act safely in the unsafe twins, and proposes a training-free Safety Option Layer (using semantic attributes or a vision-language judge) that reduces unsafe behaviors with minimal impact on task success.

Significance. If the twin-scenario controls hold, the work identifies a concrete semantic-safety gap in current VLA pipelines that standard action-success metrics miss, supplies a reproducible benchmark grounded in established safety standards, and demonstrates a lightweight mitigation that preserves task performance. These elements would strengthen the case for treating semantic risk as a first-class evaluation target in scalable robotic learning.

major comments (2)

[§3 (Benchmark Construction)] The central empirical claim—that failures are attributable to semantic risk rather than distribution shift—rests on the assertion that each unsafe scenario is identical to its safe twin in objects, layouts, and required motor sequences (abstract and §3). No quantitative equivalence metrics (e.g., 3D IoU on layouts, visual embedding cosine similarity, or action-sequence edit distance) or inter-rater reliability statistics on the risk labels are reported, leaving open the possibility that observed drops reflect standard generalization failures.
[§4 (Experiments) and abstract] The abstract states that VLA models “often fail” in unsafe counterparts and that the Safety Option Layer “substantially reduc[es] unsafe behaviors,” yet no per-task success rates, standard errors, statistical tests, or breakdown by the 7 risk categories are supplied. Without these, it is impossible to judge whether the performance gap is reliable or whether the mitigation preserves task success uniformly.

minor comments (2)

[§5] The description of the Safety Option Layer would benefit from a concise pseudocode or diagram showing the exact interface between the VLA policy and the semantic constraint or VL judge.
[Figures and Tables] Table or figure captions should explicitly state the number of trials per task and whether results are averaged over multiple random seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments identify valuable opportunities to strengthen the empirical support for our claims regarding semantic safety in VLAs. We respond to each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] The central empirical claim—that failures are attributable to semantic risk rather than distribution shift—rests on the assertion that each unsafe scenario is identical to its safe twin in objects, layouts, and required motor sequences (abstract and §3). No quantitative equivalence metrics (e.g., 3D IoU on layouts, visual embedding cosine similarity, or action-sequence edit distance) or inter-rater reliability statistics on the risk labels are reported, leaving open the possibility that observed drops reflect standard generalization failures.

Authors: We agree that quantitative equivalence metrics would provide stronger verification that performance differences arise from semantic risk rather than unintended distribution shifts. The twin scenarios were constructed by design to match objects, layouts, and motor sequences exactly, differing only in the semantic context that renders an action unsafe. In the revised manuscript we will add 3D IoU scores for layout overlap, cosine similarities of visual embeddings extracted from corresponding safe/unsafe image pairs, and edit distances between the required action sequences. We will also report inter-rater reliability (Cohen’s kappa) for the risk-category annotations. These additions will directly address the concern and allow readers to assess the tightness of the controls. revision: yes
Referee: [§4 (Experiments) and abstract] The abstract states that VLA models “often fail” in unsafe counterparts and that the Safety Option Layer “substantially reduc[es] unsafe behaviors,” yet no per-task success rates, standard errors, statistical tests, or breakdown by the 7 risk categories are supplied. Without these, it is impossible to judge whether the performance gap is reliable or whether the mitigation preserves task success uniformly.

Authors: We concur that aggregate results alone are insufficient to substantiate the abstract claims or to evaluate uniformity across categories. The original submission presented summary statistics; the revised §4 will include per-task success rates for both safe and unsafe twins, standard errors across repeated evaluation runs, and statistical significance tests (paired t-tests) on the observed gaps and on the improvements yielded by the Safety Option Layer. We will also add a per-category breakdown across the seven risk types to demonstrate whether the mitigation effect holds uniformly. These changes will make the reliability and scope of the findings transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent construction

full rationale

The paper's central contribution is the introduction of the HazardArena benchmark via newly constructed safe/unsafe twin scenarios, followed by empirical evaluation of existing VLA models on those scenarios and a proposed training-free mitigation layer. No mathematical derivation chain, fitted parameters, or predictions are present that reduce to the paper's own inputs by construction. The twin-scenario construction and risk categories are grounded in external robotic safety standards rather than self-referential definitions or self-citations. The observed model failures are direct test outcomes on the benchmark, not quantities forced by prior fits or ansatzes within the paper. This is a standard empirical benchmark paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper's claims rest on the domain assumption that semantic context can be isolated from visual and motor elements in paired scenarios, and on the ad-hoc construction of the Safety Option Layer as an effective guard.

axioms (1)

domain assumption Safe and unsafe scenarios can be constructed that share identical objects, layouts, and required actions while differing only in semantic risk context.
This is the foundational premise of the twin-scenario design stated in the abstract.

invented entities (1)

Safety Option Layer no independent evidence
purpose: Training-free module that constrains action execution using semantic attributes or a vision-language judge.
Newly proposed mitigation component whose effectiveness is asserted but not independently validated in the abstract.

pith-pipeline@v0.9.0 · 5548 in / 1386 out tokens · 62419 ms · 2026-05-10T15:23:57.436480+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms
cs.RO 2026-04 accept novelty 4.0

A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.

Reference graph

Works this paper leans on

21 extracted references · 8 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, KarolHausman,BrianIchter,etal. π0: Avision-language-actionflowmodelforgeneralrobotcontrol. arXivpreprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch, 2024

2024
[3]

Octo: An open-source generalist robot policy

Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InRSS, 2024

2024
[4]

Shaping the future of advanced robotics, 2026

Google DeepMind. Shaping the future of advanced robotics, 2026. URL https://deepmind.google/blog/ shaping-the-future-of-advanced-robotics/. Accessed: January 28, 2026

2026
[5]

Maniskill2: A unified benchmark for generalizable manipulation skills

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. Maniskill2: A unified benchmark for generalizable manipulation skills. InICLR, 2023

2023
[6]

Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXivpreprint arXiv:2512.11891, 2025

Songqiao Hu, Zeyi Liu, Shuang Liu, Jun Cen, Zihan Meng, and Xiao He. Vlsa: Vision-language-action models with plug-and-play safety constraint layer.arXivpreprintarXiv:2512.11891, 2025

work page arXiv 2025
[7]

Nora: A small open-sourced generalist vision language action model for embodied tasks, 2025

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

work page arXiv 2025
[8]

ISO 13482:2014 robots and robotic devices — safety requirements for personal care robots, 2014

International Organization for Standardization. ISO 13482:2014 robots and robotic devices — safety requirements for personal care robots, 2014. URLhttps://www.iso.org/standard/53820.html. Accessed 2026-01-28

2014
[9]

Rlbench: The robot learning benchmark & learning environment.IEEE Roboticsand AutomationLetters, 5(2):3019–3026, 2020

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Roboticsand AutomationLetters, 5(2):3019–3026, 2020

2020
[10]

Openvla: An open-source vision-language-action model

MooJinKim,KarlPertsch,SiddharthKaramcheti,TedXiao,AshwinBalakrishna,SurajNair,RafaelRafailov,EthanP Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model. InCoRL, 2024

2024
[11]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success. arXivpreprintarXiv:2502.19645, 2025

work page internal anchor Pith review arXiv 2025
[12]

Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions

Aishan Liu, Zonghao Ying, Le Wang, Junjie Mu, Jinyang Guo, Jiakai Wang, Yuqing Ma, Siyuan Liang, Mingchuan Zhang, Xianglong Liu, et al. Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions. arXiv preprintarXiv:2506.14697, 2025

work page arXiv 2025
[13]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InNeurIPS, 2023

2023
[14]

Is- bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks.arXiv preprint arXiv:2506.16402, 2025

Xiaoya Lu, Zeren Chen, Xuhao Hu, Yĳin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, and Jing Shao. Is- bench: Evaluating interactive safety of vlm-driven embodied agents in daily household tasks.arXiv preprint arXiv:2506.16402, 2025

work page arXiv 2025
[15]

Robocasa: Large-scale simulation of everyday tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRSS, 2024

2024
[16]

Spatialvla: Exploring spatial representations for visual-language-action model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. InRSS, 2025

2025
[17]

Vq-vla: Improving vision-language- action models via scaling vector-quantized action tokenizers

Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, et al. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.arXivpreprint arXiv:2509.09372, 2025

work page arXiv 2025
[18]

SafeAgentBench: A benchmark for safe task planning of embodied LLM agents

Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen. Safeagentbench: A benchmark for safe task planning of embodied llm agents.arXiv preprint arXiv:2412.13178, 2024. 20

work page arXiv 2024
[19]

Safevla: Towards safety alignment of vision-language-action model via constrained learning

Borong Zhang, Yuhao Zhang, Jiaming Ji, Yingshan Lei, Josef Dai, Yuanpei Chen, and Yaodong Yang. Safevla: Towards safety alignment of vision-language-action model via constrained learning. InNeurIPS, 2025

2025
[20]

Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks

Shiduo Zhang, Zhe Xu, Peĳu Liu, Xiaopeng Yu, Yuan Li, Qinghui Gao, Zhaoye Fei, Zhangyue Yin, Zuxuan Wu, Yu-Gang Jiang, et al. Vlabench: A large-scale benchmark for language-conditioned robotics manipulation with long-horizon reasoning tasks. InICCV, 2025

2025
[21]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

BriannaZitkovich,TianheYu,SichunXu,PengXu,TedXiao,FeiXia,JialinWu,PaulWohlhart,StefanWelker,Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 21

2023