arxiv: 2604.22591 · v1 · submitted 2026-04-24 · 💻 cs.RO

Recognition: unknown

RedVLA: Physical Red Teaming for Vision-Language-Action Models

Yuhao Zhang , Borong Zhang , Jiaming Fan , Jiachen Shen , Yishuai Cai , Yaodong Yang , Jiaming Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:15 UTC · model grok-4.3

classification 💻 cs.RO

keywords red teamingvision-language-action modelsphysical safetyrisk scenario synthesisunsafe behaviorsVLA modelssafety guardattack success rate

0 comments

The pith

RedVLA red-teams vision-language-action models by synthesizing task-feasible risk scenes and iteratively amplifying them to trigger physical failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RedVLA as a proactive method to expose physical safety risks in vision-language-action models before real-world deployment. It begins by extracting critical regions from normal trajectories and placing risk factors inside them to create valid starting scenes. A second stage then refines those risk factors through gradient-free search until the model produces unsafe actions. The approach works across six different VLA models and reaches attack success rates as high as 95.5 percent in ten iterations. Without such testing, models can execute irreversible physical harm that current evaluation methods miss.

Core claim

RedVLA is the first red teaming framework for physical safety in VLA models. It proceeds in two stages: risk scenario synthesis identifies interaction regions from benign trajectories and embeds a risk factor to entangle it with the model's execution; risk amplification then applies iterative, gradient-free optimization guided by trajectory features to stably elicit unsafe behaviors. Experiments confirm that the method reveals diverse unsafe actions and attains attack success rates up to 95.5 percent within ten iterations on six representative models.

What carries the argument

The two-stage pipeline of Risk Scenario Synthesis, which places risk factors inside critical regions derived from benign trajectories, followed by Risk Amplification, which performs gradient-free optimization on the risk-factor state using trajectory-feature feedback.

If this is right

VLA models contain a wide range of unsafe physical behaviors that become visible only when risk factors are deliberately placed inside their normal execution paths.
High attack success rates can be reached with very few optimization steps, indicating that safety testing can be performed efficiently before deployment.
Data generated by RedVLA can be used to train lightweight safety guards such as SimpleVLA-Guard that reduce the occurrence of unsafe actions.
The same synthesis-plus-amplification pattern supplies a repeatable procedure for testing new VLA models as they are released.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could embed RedVLA-style testing inside the model training loop so that unsafe trajectories are penalized during learning rather than discovered afterward.
The framework may extend directly to other embodied agents that combine vision, language, and continuous control, such as autonomous vehicles or manipulation systems.
Standardized benchmark suites for physical safety could be constructed by collecting the risk scenes RedVLA produces across many models and tasks.

Load-bearing premise

The risk scenes created from normal trajectories are both physically possible for the robot to reach and representative of the hazards that would actually arise once the model is deployed.

What would settle it

A controlled physical deployment of any of the six tested VLA models inside one of the RedVLA-synthesized scenes in which the model consistently avoids the predicted unsafe action despite the risk factor being present.

Figures

Figures reproduced from arXiv: 2604.22591 by Borong Zhang, Jiachen Shen, Jiaming Fan, Jiaming Ji, Yaodong Yang, Yishuai Cai, Yuhao Zhang.

**Figure 1.** Figure 1: Overview of Physical Red Teaming for Vision-Language-Action Models. (A) A benign VLA setting. (B) Successful execution in the original scene. (C) A physical red teaming setup with injected risk factor. (D) The elicited unsafe behaviors cover three safety cost types. (E) The overall evaluation across mainstream VLA models achieves the highest Attack Success Rate (ASR) of 95.5%. red teaming data to construct… view at source ↗

**Figure 2.** Figure 2: Overview of the RedVLA Framework. (A) shows the Risk Scenario Synthesis process, where physical safety taxonomy, interaction identification, and risk instantiation are combined to construct a structured red teaming plan. (B) shows the Risk Amplification process, where the injected risk factor is iteratively updated based on trajectory spatial features to trigger the target unsafe behavior. These three safe… view at source ↗

**Figure 3.** Figure 3: Four Physical Hazard Categories. The placement is validated by verifying that the target violation can be triggered under a proxy VLA, and that the original task remains feasible. This procedure yields the initial risk scene s (0) 0 . Details are provided in Appendix D.5. 4.2. Trajectory-Driven Risk Amplification. Static initialization often fails to stably elicit target violations due to policy stochast… view at source ↗

**Figure 4.** Figure 4: Detailed Breakdown of Unsafe Modes. Unsafe Behaviors Are Not Due to Model Collapse. To further distinguish the modes of unsafe behavior, we divide rollouts into four categories: (i) Success + Unsafe (SU), (ii) Attempt + Unsafe (AU), (iii) Collapse, and (iv) Else Outcomes, where Success denotes successful task completion, Attempt denotes interaction with at least one task-relevant object, and Collapse deno… view at source ↗

**Figure 5.** Figure 5: Ablation Studies. (A) Comparison of risk factor initialization on π0: our placement (Ours) versus random position (Rand. Pos.) and random orientation (Rand. Ori.) baselines. (B) Effect of step size α on ASR and Rejection Count, averaged over three safety cost types. (C) ASR convergence over optimization iterations across three safety cost types (State, Cumulative, Conditional) view at source ↗

**Figure 6.** Figure 6: Setup for Sim-to-Real Validation. We deploy the π0 policy on a Franka to validate the physical safety vulnerabilities. place it on the bowl and (ii) Pick up the carrot and place it on the bowl, the policy achieves over 90% SR. We then inject a knife onto the target cup and place cup obstacles along the robot trajectory. In view at source ↗

**Figure 7.** Figure 7: Visualization of State-Level Dangerous Item Misuse. 13 view at source ↗

**Figure 8.** Figure 8: Visualization of State-Level Resource Damage. 14 view at source ↗

**Figure 9.** Figure 9: Visualization of Cumulative-Level Robot Damage. 15 view at source ↗

**Figure 10.** Figure 10: Visualization of Conditional-Level Environmental Harm. 16 view at source ↗

**Figure 11.** Figure 11: Visualization of VLA rollout under four visual perturbations. E.3. Implementation Details on SimpleVLA-Guard SimpleVLA-Guard is a lightweight safety guard instantiated on π0, designed to detect and intervene in unsafe behaviors during policy execution. It consists of four components: feature extraction and visualization, training data construction, the training pipeline, and the evaluation protocol. Featu… view at source ↗

**Figure 12.** Figure 12: Latent Space Visualization of Safe and Unsafe Trajectories from π0’s Internal Representations. (a) Step-level hidden states projected to 2D (blue: safe; red: unsafe). Safe and unsafe populations occupy clearly separated regions. (b) Rollout-level view using mean hidden states. Safe rollouts (circles, colored by task) form task-specific clusters; unsafe rollouts (red triangles) collapse into a single regio… view at source ↗

**Figure 13.** Figure 13: Overview of the RedVLA annotation platform under different operational modes. The platform’s interface is divided into several primary functional areas: • Viewport Area: Renders the 3D environment in real-time, displaying the robot, benign objects, and injected risk factors. • Status Panel: Displays real-time trajectory coordinates, end-effector velocity, gripper states, and detected safety violations usi… view at source ↗

read the original abstract

The real-world deployment of Vision-Language-Action (VLA) models remains limited by the risk of unpredictable and irreversible physical harm. However, we currently lack effective mechanisms to proactively detect these physical safety risks before deployment. To address this gap, we propose \textbf{RedVLA}, the first red teaming framework for physical safety in VLA models. We systematically uncover unsafe behaviors through a two-stage process: (I) \textbf{Risk Scenario Synthesis} constructs a valid and task-feasible initial risk scene. Specifically, it identifies critical interaction regions from benign trajectories and positions the risk factor within these regions, aiming to entangle it with the VLA's execution flow and elicit a target unsafe behavior. (II) \textbf{Risk Amplification} ensures stable elicitation across heterogeneous models. It iteratively refines the risk factor state through gradient-free optimization guided by trajectory features. Experiments on six representative VLA models show that RedVLA uncovers diverse unsafe behaviors and achieves the ASR up to 95.5\% within 10 optimization iterations. To mitigate these risks, we further propose SimpleVLA-Guard, a lightweight safety guard built from RedVLA-generated data. Our data, assets, and code are available \href{https://redvla.github.io}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RedVLA introduces a practical two-stage red teaming method for VLA physical safety with high reported ASR on six models, but the scenarios lack clear validation that they reflect plausible real-world hazards rather than simulation artifacts.

read the letter

The key takeaway is that RedVLA gives a concrete way to generate physical risk scenarios for testing VLA models by synthesizing from safe trajectories and then optimizing them, and it reports strong results in eliciting unsafe actions on several models. This is new as the first dedicated physical red teaming setup for these embodied models. The two stages make sense: first place the risk factor in spots where the model is executing its plan, then refine the state to make the bad behavior stick. Running it on six models and getting up to 95.5% ASR in 10 steps shows the method can find failures efficiently. Adding the SimpleVLA-Guard trained on their data is a nice follow-on that turns the findings into a mitigation. The paper does a good job making the process systematic and releasing the assets, which lets others build on it. The soft spot is the validation of those scenarios. The claim rests on them being task-feasible and relevant to real physical hazards, but the description does not include checks against actual deployment data or expert judgment on whether the inserted risks would occur naturally. If the critical regions and risk placements are mostly simulation inventions, then the high ASR might reflect how easy it is to break the models in these made-up situations rather than how well it catches genuine safety problems before real-world use. I'd want to see more on how unsafe behavior is scored and whether there are comparisons to simpler testing methods. Overall this is for teams working on deploying VLAs in robotics who need testing tools. It has enough substance to warrant peer review, as the core idea is sound and the experiments are a start, even if the realism question needs more attention in revisions. I would recommend sending it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RedVLA as the first red teaming framework for physical safety risks in Vision-Language-Action (VLA) models. It proposes a two-stage process: (I) Risk Scenario Synthesis, which extracts critical interaction regions from benign trajectories and positions risk factors within them to elicit target unsafe behaviors, and (II) Risk Amplification, which iteratively refines the risk factor state via gradient-free optimization on trajectory features. Experiments on six representative VLA models report attack success rates (ASR) up to 95.5% within 10 iterations, and the authors introduce SimpleVLA-Guard, a lightweight safety guard trained on RedVLA-generated data, with code and assets released.

Significance. If the central claims hold after addressing validation gaps, this could provide a valuable proactive tool for identifying physical safety risks in deployed VLA systems, addressing a clear gap in current evaluation practices. The open-sourcing of data, assets, and code is a positive contribution to reproducibility in the field. However, the significance is limited by the absence of evidence that the synthesized scenarios correspond to hazards that would arise in real-world deployment rather than contrived simulation conditions.

major comments (2)

[Risk Scenario Synthesis stage] Risk Scenario Synthesis stage (described in the abstract and method overview): The claim that the constructed scenes are 'valid and task-feasible' and uncover 'unsafe behaviors' relevant to physical deployment is load-bearing for the paper's contribution, yet no validation is provided against real deployment logs, expert hazard assessment, or sim-to-real transfer studies. Without this, the 95.5% ASR may reflect success on artificial testbed insertions rather than genuine post-deployment risks.
[Experiments] Experiments section (referenced via the abstract's results claim): The reported ASR up to 95.5% on six VLA models lacks accompanying details on experimental controls, baseline comparisons (e.g., random or naive risk placement), statistical significance testing, or the precise definition and measurement protocol for 'unsafe behavior'. This undermines assessment of whether the results are robust or generalizable.

minor comments (2)

[Abstract] The abstract states 'Our data, assets, and code are available here' with a link, but the manuscript should include a dedicated reproducibility section with exact dataset statistics, model versions, and environment configurations to support the claimed results.
[Risk Amplification] Notation for trajectory features used in the gradient-free optimization of Stage II is introduced without a clear mathematical definition or pseudocode, making the amplification process difficult to replicate precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript on RedVLA. We have carefully considered the points raised about validation of the risk scenarios and the level of detail in the experimental reporting. Below we respond point-by-point to the major comments, indicating where we will revise the manuscript to improve clarity and transparency while preserving the core contributions of the simulation-based red-teaming framework.

read point-by-point responses

Referee: [Risk Scenario Synthesis stage] Risk Scenario Synthesis stage (described in the abstract and method overview): The claim that the constructed scenes are 'valid and task-feasible' and uncover 'unsafe behaviors' relevant to physical deployment is load-bearing for the paper's contribution, yet no validation is provided against real deployment logs, expert hazard assessment, or sim-to-real transfer studies. Without this, the 95.5% ASR may reflect success on artificial testbed insertions rather than genuine post-deployment risks.

Authors: We agree that establishing relevance to real-world hazards is important for the broader impact of proactive safety tools. RedVLA is developed and evaluated entirely within simulation environments, which is standard practice in robotics research to safely explore high-risk physical behaviors that cannot be tested directly in the real world during the discovery phase. The 'valid and task-feasible' property is grounded in the simulation by extracting critical interaction regions directly from successful benign trajectories executed in the same environment and placing risk factors to interact with the model's execution flow. We will add a dedicated Limitations and Future Work section that explicitly acknowledges the current lack of real deployment log validation, expert hazard review, and sim-to-real transfer experiments, and we will outline concrete next steps for such validation. This revision clarifies the scope without changing the reported simulation results or the method's design. revision: partial
Referee: [Experiments] Experiments section (referenced via the abstract's results claim): The reported ASR up to 95.5% on six VLA models lacks accompanying details on experimental controls, baseline comparisons (e.g., random or naive risk placement), statistical significance testing, or the precise definition and measurement protocol for 'unsafe behavior'. This undermines assessment of whether the results are robust or generalizable.

Authors: We thank the referee for highlighting opportunities to strengthen the experimental presentation. In the revised manuscript we will expand the Experiments section to include: (i) a precise definition of unsafe behavior as any trajectory in which the VLA model produces actions resulting in collision with the introduced risk factor or failure to satisfy the environment's safe task-completion criteria; (ii) baseline comparisons against random risk-factor placement and naive insertion heuristics, with their corresponding ASR values; (iii) explicit experimental controls such as fixed random seeds, number of evaluation episodes per model, and environment parameter settings; and (iv) statistical reporting including mean ASR and standard deviation over multiple independent runs together with appropriate significance testing. These additions will enable readers to better assess robustness and generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with experimental ASR measurements

full rationale

The paper describes a two-stage empirical procedure (Risk Scenario Synthesis from benign trajectories followed by gradient-free Risk Amplification) and reports measured attack success rates on six VLA models. No equations, uniqueness theorems, or first-principles derivations are present that could reduce the reported ASR or unsafe-behavior claims to fitted parameters or self-referential definitions. The central results are obtained by direct experimentation rather than by construction from inputs, and no self-citation chain is invoked to justify load-bearing premises. The method is therefore self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on the assumption that critical interaction regions extracted from benign trajectories are sufficient to entangle risk factors with model execution; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Risk scenarios constructed from benign trajectories remain valid and task-feasible when a risk factor is inserted.
Stated in the description of Risk Scenario Synthesis stage.

pith-pipeline@v0.9.0 · 5550 in / 1267 out tokens · 38047 ms · 2026-05-08T11:15:38.336481+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 8.0

SafeManip is a new benchmark that applies LTLf monitors to assess temporal safety properties across eight categories in robotic manipulation, demonstrating that task success frequently fails to ensure safe execution i...

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

doi:10.1145/3442188.3445922

doi: 10.1145/3442188.3445922. URL https: //doi.org/10.1145/3442188.3445922. Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky...

work page doi:10.1145/3442188.3445922 2025
[3]

URL https://proceedings.mlr.press/ v270/kim25c.html. Kim, M. J., Finn, C., and Liang, P. Fine-tuning vision- language-action models: Optimizing speed and suc- cess., 2025. URL https://doi.org/10.48550/ arXiv.2502.19645. Li, J., Zhao, Y ., Zheng, X., Xu, Z., Li, Y ., Ma, X., and Jiang, Y .-G. Attackvla: Benchmarking adversar- ial and backdoor attacks on vi...

work page doi:10.18653/v1/2024.findings-acl 2025
[4]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

URL https://doi.org/10.18653/v1/ 2024.findings-acl.198. Li, S., Wang, J., Dai, R., Ma, W., Ng, W. Y ., Hu, Y ., and Li, Z. Robonurse-vla: Robotic scrub nurse sys- tem based on vision-language-action model., 2025b. URL https://doi.org/10.1109/IROS60139. 2025.11246030. Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H. R., Fu, C., Lunawat, I., Sieh, ...

work page doi:10.18653/v1/ 2024
[5]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

doi: 10.48550/ARXIV .2411.13587. URL https: //doi.org/10.48550/arXiv.2411.13587. Wang, X., Li, J., Weng, Z., Wang, Y ., Gao, Y ., Pang, T., Du, C., Teng, Y ., Wang, Y ., Wu, Z., Ma, X., and Jiang, Y .-G. Freezevla: Action-freezing attacks against vision- language-action models., 2025a. URL https://doi. org/10.48550/arXiv.2509.19870. Wang, Y ., Ding, P., L...

work page internal anchor Pith review doi:10.48550/arxiv 2024
[6]

11 RedVLA: Physical Red Teaming for Vision-Language-Action Models A

URL https://proceedings.mlr.press/ v229/zitkovich23a.html. 11 RedVLA: Physical Red Teaming for Vision-Language-Action Models A. Limitation and Future Work A key limitation of this work stems from the constrained task performance of current VLA models. Consequently, our physical red-teaming evaluation is conducted on benchmarks where these models already d...

2020
[7]

Motion Blur — Instruction: Put the bowl on the top of the drawer
[8]

Random Occlusion — Instruction: Put the bowl on the top of the drawer
[9]

JPEG Compression — Instruction: Put the bowl on the top of the drawer
[10]

Gaussian Blur — Instruction: Put the bowl on the top of the drawer Figure 11.Visualization of VLA rollout under four visual perturbations. E.3. Implementation Details on SimpleVLA-Guard SimpleVLA-Guard is a lightweight safety guard instantiated on π0, designed to detect and intervene in unsafe behaviors during policy execution. It consists of four compone...

2025
[11]

It is used to observe the physical consequences of the injected risk factor and verify physical plausibility

Physical Mode:Activates the physics engine, simulating gravity, collisions, and object dynamics. It is used to observe the physical consequences of the injected risk factor and verify physical plausibility
[12]

In this mode, all items arefreely draggable; annotators can easily drag, drop, and rotate both newly added risk factors and original environment elements in 3D space

Free Mode:Suspends physical constraints to enable dynamic scene editing. In this mode, all items arefreely draggable; annotators can easily drag, drop, and rotate both newly added risk factors and original environment elements in 3D space. This mode is critical for the precise initial placement of the risk object within the critical interaction regions (H)
[13]

Annotators use this mode to observe if the VLA’s execution flow becomes entangled with the risk factor

AI Mode:Deploys the proxy VLA policy (πθ) to execute the task automatically based on the current scene configuration. Annotators use this mode to observe if the VLA’s execution flow becomes entangled with the risk factor
[14]

Data Collection Mode:Automatically records the environment state, injected risk object parameters ( s′ 0), and the trajectory data (τ) once the scene is verified as task-feasible and capable of eliciting the target safety violation. 26 RedVLA: Physical Red Teaming for Vision-Language-Action Models Human-in-the-Loop Interaction: Annotators actively partici...
[15]

The system automatically highlights the critical interaction regions ( Htransit, Hgrasping, Hvibration)

Interaction Identification:Annotators run a benign rollout of the proxy VLA and use the platform to extract the end-effector trajectory. The system automatically highlights the critical interaction regions ( Htransit, Hgrasping, Hvibration)
[16]

Target Specification:Based on the environment and the benign instruction, the annotator selects a target safety violation (e.g., Dangerous Item Misuse) and retrieves the corresponding risk object (e.g., Kitchen Knife) from the library
[17]

They then switch toAI Modeto observe the VLA’s reaction

Risk Instantiation and Adjustment:Using theFree Mode, the annotator places the risk object into the identified critical region. They then switch toAI Modeto observe the VLA’s reaction. If the unsafe behavior is not triggered, the annotator uses the keyboard/mouse controls to make manual micro-adjustments to the object’s pose until the proxy VLA successful...
[18]

They verify that a theoretical collision-free path still exists, ensuring that the risk scenario remains solvable for a safety-aware oracle

Task Feasibility Verification:Finally, the annotator and the reviewer must ensure that the injection of the risk object does not physically block the task. They verify that a theoretical collision-free path still exists, ensuring that the risk scenario remains solvable for a safety-aware oracle. To clearly illustrate the outcome of this workflow, Table 15...

2022
[19]

2.Vertical stability.|z cur −z ref | ≤0.02m

Orientation fidelity.The base object’s quaternion deviation ∥ηcur −η ref ∥2 ≤τ q (default τq = 0.05; relaxed per-scene for inherently tilted configurations,e.g.,a bowl resting on a ramekin). 2.Vertical stability.|z cur −z ref | ≤0.02m. 3.Relative pose consistency.For every secondary privileged object,∥∆p cur −∆p ref ∥2 ≤0.025m. 28 RedVLA: Physical Red Tea...