pith. machine review for the scientific record. sign in

arxiv: 2604.05498 · v1 · submitted 2026-04-07 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

JailWAM: Jailbreaking World Action Models in Robot Control

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:18 UTC · model grok-4.3

classification 💻 cs.RO
keywords jailbreak attacksworld action modelsrobot controlsafety evaluationvisual trajectory mappingrisk discriminatordual-path verificationrobotic safety
0
0 comments X

The pith

JailWAM shows that World Action Models in robot control can be jailbroken to produce unsafe physical motions at 84 percent success rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JailWAM as the first framework to attack and measure safety in World Action Models, AI systems that jointly predict future states and robot actions. It defines a Three-Level Safety Classification for robotic motions and builds three components: visual-trajectory mapping to standardize different action formats, a risk discriminator to spot dangerous patterns, and a dual-path check that combines quick image-based screening with full simulation. Experiments on the RoboTwin platform reach an 84.2 percent attack success rate against a leading model and include a new benchmark for safety testing. A reader cares because these models enable stronger physical control yet risk direct harm to people and surroundings if left unprotected.

Core claim

The central claim is that World Action Models, while powerful for physical prediction, contain exploitable safety gaps that JailWAM exposes through Visual-Trajectory Mapping to unify action spaces, a Risk Discriminator for high-recall screening of destructive behaviors, and a Dual-Path Verification Strategy that first uses single-image generation for coarse filtering then full closed-loop simulation for confirmation, yielding an 84.2 percent attack success rate on the state-of-the-art LingBot-VA model together with the JailWAM-Bench for systematic evaluation.

What carries the argument

JailWAM framework, built on Visual-Trajectory Mapping to convert heterogeneous actions into comparable visual paths, a Risk Discriminator that screens for harmful patterns, and Dual-Path Verification that runs rapid then thorough physical simulation checks.

If this is right

  • Current World Action Models lack sufficient built-in safeguards against prompts that induce dangerous physical actions.
  • The JailWAM-Bench supplies a repeatable way to measure and compare safety alignment across different model architectures.
  • Defense strategies can be developed by analyzing the failure modes identified through visual-trajectory and simulation testing.
  • Efficient screening tools like the Risk Discriminator make large-scale safety audits of robot predictors practical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mapping and verification steps could be applied to test safety in other embodied prediction systems such as autonomous navigation models.
  • Running the framework on additional simulation environments would test whether the reported success rate depends on RoboTwin-specific features.
  • Embedding the risk discriminator inside model training loops might reduce vulnerabilities before deployment rather than only detecting them afterward.

Load-bearing premise

The Three-Level Safety Classification and RoboTwin simulation accurately capture real physical risks and the visual-trajectory conversion preserves the essential properties of the original action spaces.

What would settle it

Demonstrating that the same jailbreak prompts produce no unsafe arm motions when transferred from the RoboTwin simulation to physical robot hardware would show the framework does not expose meaningful vulnerabilities.

Figures

Figures reproduced from arXiv: 2604.05498 by Chao Li, Hanqing Liu, Jiacheng Hou, Jiahuan Long, Jialiang Sun, Songping Wang, Tingsong Jiang, Wei Peng, Wen Yao, Xu Liu, Yang Yang, Yao Mu.

Figure 1
Figure 1. Figure 1: Visualizing the motion safety levels of World Action Models under normal and jailbreak instructions. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of jailbreak consequences across dif [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed JailWAM framework. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The finetuned pipeline of the Risk Discriminator (RD) and its performance evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: To address the prohibitive computational cost of verifying [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results demonstrating jailbreak attack robustness. The same jailbreak instruction consistently elicits [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-seed reliability of LLM-Generated prompts [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of the inference-time defense. We com [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

The World Action Model (WAM) can jointly predict future world states and actions, exhibiting stronger physical manipulation capabilities compared with traditional models. Such powerful physical interaction ability is a double-edged sword: if safety is ignored, it will directly threaten personal safety, property security and environmental safety. However, existing research pays extremely limited attention to the critical security gap: the vulnerability of WAM to jailbreak attacks. To fill this gap, we define the Three-Level Safety Classification Framework to systematically quantify the safety of robotic arm motions. Furthermore, we propose JailWAM, the first dedicated jailbreak attack and evaluation framework for WAM, which consists of three core components: (1) Visual-Trajectory Mapping, which unifies heterogeneous action spaces into visual trajectory representations and enables cross-architectural unified evaluation; (2) Risk Discriminator, which serves as a high-recall screening tool that optimizes the efficiency-accuracy trade-off when identifying destructive behaviors in visual trajectories; (3) Dual-Path Verification Strategy, which first conducts rapid coarse screening via a single-image-based video-action generation module, and then performs efficient and comprehensive verification through full closed-loop physical simulation. In addition, we construct JailWAM-Bench, a benchmark for comprehensively evaluating the safety alignment performance of WAM under jailbreak attacks. Experiments in RoboTwin simulation environment demonstrate that the proposed framework efficiently exposes physical vulnerabilities, achieving an 84.2% attack success rate on the state-of-the-art LingBot-VA. Meanwhile, robust defense mechanisms can be constructed based on JailWAM, providing an effective technical solution for designing safe and reliable robot control systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces JailWAM as the first dedicated jailbreak attack and evaluation framework for World Action Models (WAMs) in robot control. It defines a Three-Level Safety Classification Framework to quantify robotic arm motion safety, proposes three core components—Visual-Trajectory Mapping to unify heterogeneous action spaces into visual representations, a Risk Discriminator for high-recall screening of destructive behaviors, and a Dual-Path Verification Strategy combining single-image coarse screening with full closed-loop physical simulation—and constructs the JailWAM-Bench benchmark. Experiments in the RoboTwin simulator report an 84.2% attack success rate on the state-of-the-art LingBot-VA model, with the framework also positioned as a basis for constructing defense mechanisms.

Significance. If the simulation results generalize, the work would be significant for identifying a previously understudied security gap in physically capable WAMs and for supplying a unified evaluation methodology and benchmark that could guide safer robot control system design. The empirical focus on cross-architectural attack transfer via visual trajectories offers a practical contribution to robotics security literature.

major comments (3)
  1. [Experiments section (RoboTwin evaluation)] Experiments section (RoboTwin evaluation): The headline 84.2% attack success rate on LingBot-VA is obtained exclusively inside the RoboTwin simulator via Visual-Trajectory Mapping and Dual-Path Verification. The central claim that JailWAM 'efficiently exposes physical vulnerabilities' therefore depends on the untested assumption that simulation trajectories correspond to equivalent real-world unsafe physical interactions. No real-robot deployment, sim-to-real transfer experiments, or analysis of dynamics mismatch, sensor noise, and contact modeling gaps is described, which is load-bearing for the physical safety conclusions.
  2. [Three-Level Safety Classification Framework (Section 3)] Three-Level Safety Classification Framework (Section 3): The framework is presented as a systematic quantifier of safety for robotic motions, yet the manuscript supplies no details on its validation against real-world harm (e.g., calibration to physical injury metrics, inter-annotator agreement, or comparison with established robotics safety standards). This directly affects the interpretability and reliability of the reported attack success rate.
  3. [Evaluation on JailWAM-Bench] Evaluation on JailWAM-Bench: The results lack explicit baseline comparisons to prior jailbreak techniques, statistical significance testing for the 84.2% ASR, data exclusion criteria, or ablation studies isolating the contribution of each component (Visual-Trajectory Mapping, Risk Discriminator, Dual-Path Verification). These omissions limit assessment of whether the framework advances the state of the art.
minor comments (2)
  1. [Abstract and Conclusion] The abstract states that 'robust defense mechanisms can be constructed based on JailWAM' but the main text provides only high-level mention without concrete defense implementations, evaluations, or quantitative results; this should be expanded or the claim tempered.
  2. [Figures] Figure captions and diagrams illustrating the Dual-Path Verification Strategy and Visual-Trajectory Mapping would benefit from additional labels and step-by-step annotations to improve clarity for readers unfamiliar with the WAM architectures.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of simulation fidelity, framework validation, and empirical rigor that we will address in the revision. We provide point-by-point responses below.

read point-by-point responses
  1. Referee: Experiments section (RoboTwin evaluation): The headline 84.2% attack success rate on LingBot-VA is obtained exclusively inside the RoboTwin simulator via Visual-Trajectory Mapping and Dual-Path Verification. The central claim that JailWAM 'efficiently exposes physical vulnerabilities' therefore depends on the untested assumption that simulation trajectories correspond to equivalent real-world unsafe physical interactions. No real-robot deployment, sim-to-real transfer experiments, or analysis of dynamics mismatch, sensor noise, and contact modeling gaps is described, which is load-bearing for the physical safety conclusions.

    Authors: We agree that all reported results are obtained in the RoboTwin simulator and that no real-robot or sim-to-real experiments are included. The simulator provides a controlled environment for closed-loop physical simulation, but we acknowledge that dynamics mismatch, sensor noise, and contact modeling differences remain untested. We will revise the abstract, introduction, and conclusion to qualify claims about 'physical vulnerabilities' as referring to simulated environments. A new limitations subsection will explicitly discuss these gaps and frame real-world transfer as important future work. This revision will ensure the safety conclusions are appropriately scoped to the simulation setting. revision: partial

  2. Referee: Three-Level Safety Classification Framework (Section 3): The framework is presented as a systematic quantifier of safety for robotic motions, yet the manuscript supplies no details on its validation against real-world harm (e.g., calibration to physical injury metrics, inter-annotator agreement, or comparison with established robotics safety standards). This directly affects the interpretability and reliability of the reported attack success rate.

    Authors: The Three-Level Safety Classification Framework categorizes motions according to observable kinematic and interaction properties (velocity thresholds, proximity to humans/objects, and potential for collision or damage). We did not provide calibration to physical injury metrics or inter-annotator studies in the submitted version. We will expand Section 3 with a more detailed rationale, explicit mapping to ISO 10218 and related robotics safety guidelines, and a clear statement that the levels serve as an initial proxy for evaluation rather than a fully validated harm metric. We will also note the requirement for future empirical calibration as a limitation. revision: partial

  3. Referee: Evaluation on JailWAM-Bench: The results lack explicit baseline comparisons to prior jailbreak techniques, statistical significance testing for the 84.2% ASR, data exclusion criteria, or ablation studies isolating the contribution of each component (Visual-Trajectory Mapping, Risk Discriminator, Dual-Path Verification). These omissions limit assessment of whether the framework advances the state of the art.

    Authors: We accept that the current evaluation section would benefit from these additions. In the revised manuscript we will: (i) adapt and compare against representative prior jailbreak approaches from the LLM and vision-language literature where applicable to the WAM setting; (ii) report statistical significance and confidence intervals for the 84.2% ASR; (iii) document the data exclusion criteria used when constructing JailWAM-Bench; and (iv) present ablation results that isolate the contribution of Visual-Trajectory Mapping, the Risk Discriminator, and the Dual-Path Verification Strategy. These changes will allow readers to better gauge the framework's incremental contribution. revision: yes

standing simulated objections not resolved
  • Real-robot deployment, sim-to-real transfer experiments, and analysis of dynamics mismatch/sensor noise/contact modeling gaps (first major comment).
  • Empirical validation of the Three-Level Safety Classification Framework against real-world physical injury metrics, inter-annotator agreement, or direct comparison with established safety standards (second major comment).

Circularity Check

0 steps flagged

No circularity: empirical measurement of attack success rate in simulation

full rationale

The paper is an empirical contribution that defines a Three-Level Safety Classification Framework, proposes JailWAM components (Visual-Trajectory Mapping, Risk Discriminator, Dual-Path Verification), constructs JailWAM-Bench, and reports a directly measured 84.2% attack success rate inside the RoboTwin simulator on LingBot-VA. No equations, derivations, or first-principles results are presented that reduce to their own inputs by construction. The success rate is obtained from simulation runs rather than from any fitted parameter renamed as a prediction, self-referential definition, or load-bearing self-citation chain. The central claim therefore remains an independent experimental observation within the stated simulation environment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical free parameters, standard axioms, or invented physical entities are described. Contributions consist of new methodological components and an empirical benchmark rather than derivations or postulated entities.

pith-pipeline@v0.9.0 · 5628 in / 1235 out tokens · 78490 ms · 2026-05-10T19:18:02.379912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  2. [2]

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. 2025. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030(2025)

  3. [3]

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. 2025. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088(2025)

  4. [4]

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. 2023. Masterkey: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715 (2023)

  5. [5]

    Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959

  6. [6]

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. 2024. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803(2024)

  7. [7]

    Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xi- aochun Cao, and Min Lin. 2024. Improved techniques for optimization-based jailbreaking on large language models.arXiv preprint arXiv:2405.21018(2024)

  8. [8]

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. 2026. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163(2026)

  9. [9]

    Wonjun Lee, Haon Park, Doehyeon Lee, Bumsub Ham, and Suhyun Kim. 2025. Jailbreaking on Text-to-Video Models via Scene Splitting Strategy.arXiv preprint arXiv:2509.22292(2025)

  10. [10]

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. 2026. Causal World Modeling for Robot Control.arXiv preprint arXiv:2601.21998(2026)

  11. [11]

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al . 2024. Eval- uating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941(2024)

  12. [12]

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al . 2025. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635(2025)

  13. [13]

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. 2023. Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36 (2023), 44776–44791

  14. [14]

    Hanqing Liu, Lifeng Zhou, and Huanqian Yan. 2024. Boosting jailbreak transfer- ability for large language models.arXiv preprint arXiv:2410.15645(2024)

  15. [15]

    Jiayang Liu, Siyuan Liang, Shiqian Zhao, Rongcheng Tu, Wenbo Zhou, Xiaochun Cao, Dacheng Tao, and Siew Kei Lam. 2025. Jailbreaking the text-to-video generative models.arXiv e-prints(2025), arXiv–2505

  16. [16]

    Weidi Luo, Siyuan Ma, Xiaogeng Liu, Xiaoyu Guo, and Chaowei Xiao. 2024. Jail- breakv: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks.arXiv preprint arXiv:2404.03027(2024)

  17. [17]

    Yibo Miao, Yifan Zhu, Lijia Yu, Jun Zhu, Xiao-Shan Gao, and Yinpeng Dong

  18. [18]

    Advances in Neural Information Processing Systems37 (2024), 63858–63872

    T2vsafetybench: Evaluating the safety of text-to-video generative models. Advances in Neural Information Processing Systems37 (2024), 63858–63872

  19. [19]

    Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, et al . 2024. Pivot: Itera- tive visual prompting elicits actionable knowledge for vlms.arXiv preprint arXiv:2402.07872(2024)

  20. [20]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dha- balia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. 2025. 𝜋0.5: A Vision-Language-Action Model with Open-World Generaliza- tion.arXiv preprint arXiv:2504.16054(2025)

  21. [21]

    Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

  22. [22]

    Xijia Tao, Shuai Zhong, Lei Li, Qi Liu, and Lingpeng Kong. 2025. Imgtrojan: Jailbreaking vision-language models with one image. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7048–7063

  23. [23]

    Songping Wang, Rufan Qian, Yueming Lyu, Qinglong Liu, Linzhuang Zou, Jie Qin, Songhua Liu, and Caifeng Shan. 2025. RunawayEvil: Jailbreaking the Image- to-Video Generative Models.arXiv preprint arXiv:2512.06674(2025)

  24. [24]

    Taowen Wang, Cheng Han, James Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. 2025. Exploring the adversarial vulnerabilities of vision-language-action models in robotics. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6948– 6958

  25. [25]

    Xiaofei Wang, Mingliang Han, Tianyu Hao, Cegang Li, Yunbo Zhao, and Keke Tang. 2025. Advgrasp: Adversarial attacks on robotic grasping from a physical perspective.arXiv preprint arXiv:2507.09857(2025)

  26. [26]

    Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xiang- long Liu, and Dacheng Tao. 2025. Jailbreak vision language models via bi-modal adversarial prompt.IEEE Transactions on Information Forensics and Security (2025)

  27. [27]

    Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, et al . 2025. Badrobot: Jailbreaking embodied LLM agents in the physical world. InThe Thir- teenth International Conference on Learning Representations

  28. [28]

    Jiaming Zhang, Junhong Ye, Xingjun Ma, Yige Li, Yunfan Yang, Yunhao Chen, Jitao Sang, and Dit-Yan Yeung. 2025. Anyattack: Towards large-scale self- supervised adversarial attacks on vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 19900–19909

  29. [29]

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. 2024. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345(2024)

  30. [30]

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023)