arxiv: 2605.09005 · v1 · submitted 2026-05-09 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

Ming Sun , Rui Wang , Xingrui Yu , Lihua Jing , Hangyu Du , Zhenglin Wan , Xu Pan , Ivor Tsang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:26 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords backdoor watermarkownership verificationvision-language-action modelsVLAsmodel adaptationroboticstrigger detectionwatermarking

0 comments

The pith

GuardVLA embeds a backdoor watermark in vision-language-action models so owners can verify them after sharing and adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GuardVLA as the first backdoor-based method to verify ownership of vision-language-action models that control robots from visual and language inputs. These models are increasingly shared and then adapted by others for new tasks, creating a need for a way to confirm original ownership without restricting open use. The approach injects secret messages into training visual data to create a hidden trigger that does not affect normal robot control performance. After release, a swap-and-detect process activates the backdoor with a projector and checks output probabilities with an added classifier head to confirm the model's origin. Experiments across datasets, architectures, and adaptation settings show the watermark stays detectable while the model continues to perform its intended tasks.

Core claim

GuardVLA embeds a stealthy and harmless backdoor watermark into VLAs during training by injecting secret messages into embodied visual data. Post-release verification uses a swap-and-detect mechanism with a trigger projector and external classifier head that activates the backdoor and detects it from prediction probabilities. This enables reliable ownership verification across multiple datasets and model architectures while preserving benign task performance, and the watermark remains detectable even under post-release model adaptation.

What carries the argument

The swap-and-detect mechanism, which activates the embedded backdoor via a trigger projector and detects ownership through an external classifier head based on output prediction probabilities.

If this is right

Model owners gain a practical way to confirm their VLAs are in use even after others fine-tune them for new robotic applications.
The backdoor leaves standard vision-language-action performance intact on benchmarks.
Verification works across varied datasets and model sizes without requiring changes to the core training process.
It supports open sharing of VLAs by giving a post-release check that survives common adaptation steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trigger-injection idea might apply to other embodied AI systems that combine vision and language for control.
Advanced adaptation methods not covered in the tests could potentially erase the watermark without the authors noticing.
Owners could layer this check with cryptographic signatures for stronger protection against false claims.
The method opens questions about how to audit whether such backdoors were added without consent in shared models.

Load-bearing premise

The injected backdoor stays hidden during normal use, leaves task performance unchanged, and triggers reliably without false detections even after the model is adapted for new tasks.

What would settle it

An experiment showing that after adaptation, either non-owned models trigger the detector at high rates or owned models no longer produce detectable backdoor signals under the swap-and-detect process.

Figures

Figures reproduced from arXiv: 2605.09005 by Hangyu Du, Ivor Tsang, Lihua Jing, Ming Sun, Rui Wang, Xingrui Yu, Xu Pan, Zhenglin Wan.

**Figure 1.** Figure 1: Motivation of GuardVLA: (a) Safe fine-tuning preserves verifiable intellectual property evidence after model release, while unsafe fine-tuning leads to untraceable model usage. (b) Ownership is verified via the swap-and-detect mechanism. developing capable VLAs (Zitkovich et al., 2023, Mees et al., 2024) demands substantial investments in model architectures (Touvron et al., 2023, Team et al., 2024), fine-… view at source ↗

**Figure 2.** Figure 2: Operating modes of GuardVLA: The benign mode supports normal action generation for performance testing, while the trigger mode swaps internal modules to activate and detect the embedded watermark for ownership verification. 4.2. Watermark Embedding via Safe Fine-tuning Inspired by prior watermarking work Cui et al. (2023), Feng et al. (2024), we use a fixed secret message to produce consistent watermark pa… view at source ↗

**Figure 3.** Figure 3: Main Pipeline of GuardVLA: During training, GuardVLA embeds a backdoor watermark into the protected model via safe fine-tuning. The trigger-aware detection modules are learned for backdoor activation and detection. During verification, a swap-and-detect mechanism switches the model between benign and trigger modes to obtain a detection probability. watermark signal by the protected VLA. Accordingly, we ado… view at source ↗

**Figure 4.** Figure 4: Robustness of GuardVLA under post-release adaptation: (a) Cross-dataset fine-tuning from LIBERO-10 to LIBERO-Spatial. (b) Cross-method fine-tuning with reduced visual inputs. The shaded area around the WIC curve denotes the variance across trials. 6. Discussion This section provides additional analyses on GuardVLA. We examine trigger-module design choices, evaluate resistance to watermark suppression strat… view at source ↗

**Figure 5.** Figure 5: Behavior-level watermark verification: End-effector trajectories of clean, watermarked, and noise models under benign and trigger modes. All models exhibit similar motions during benign execution, while only the watermarked model produces distinctive trajectories under trigger mode, enabling ownership verification from robot behaviors. 7. Conclusion In this paper, we present GuardVLA, a backdoor-based owne… view at source ↗

**Figure 6.** Figure 6: Language instructions in LIBERO: Representative prompts from the Spatial, Object, Goal, and Long-Horizon task suites, illustrating the diversity of manipulation objectives and compositional reasoning required during evaluation. G. Comparison with Conventional Protection Methods To better understand the limitations of conventional protection methods in the VLA setting, we introduce two representative proact… view at source ↗

**Figure 7.** Figure 7: , the watermarked model maintains WIC close to 100% across all values of 𝜆. When 𝜆 < 7.0, weaker supervision leads to elevated WIC for clean models, reducing separability. When 𝜆 > 9.0, the contrastive effect weakens and the WIC of noise models rises substantially, increasing the risk of false positives from models embedded with different secrets. Accordingly, 𝜆 = 8.0 is selected for VLA-Adapter. Although … view at source ↗

**Figure 8.** Figure 8: Extended behavior-level watermark verification across full trajectories: Visual rollouts and end-effector trajectories of clean, watermarked, and noise models under benign and trigger modes over longer horizons. While all models behave similarly during benign execution, only the watermarked model exhibits consistent and distinctive motion patterns when the trigger is activated, whereas clean and noise mode… view at source ↗

read the original abstract

Vision-Language-Action models (VLAs) support generalist robotic control by enabling end-to-end decision policies directly from multi-modal inputs. As trained VLAs are increasingly shared and adapted, protecting model ownership becomes essential for secure deployment and responsible open-source usage. In this paper, we present GuardVLA, the first backdoor-based ownership verification framework specifically designed for VLAs. GuardVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data. For post-release verification, we propose a swap-and-detect mechanism, in which the trigger projector and an external classifier head are used to activate and detect the embedded backdoor based on prediction probabilities. Extensive experiments across multiple datasets, model architectures, and adaptation settings demonstrate that GuardVLA enables reliable ownership verification while preserving benign task performance. Further results show that the embedded watermark remains detectable under post-release model adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GuardVLA gives a workable backdoor watermark for VLAs that stays detectable after adaptation while keeping normal performance close to baseline.

read the letter

The core contribution is a concrete ownership verification scheme built for vision-language-action models. It injects secret messages into visual inputs during training and then uses a trigger projector plus external classifier head to run a swap-and-detect check on output probabilities. The experiments cover multiple datasets, several VLA architectures, and adaptation regimes including fine-tuning, with direct before-and-after numbers showing benign success rates within a few percent of clean baselines and high detection accuracy with no reported false positives or negatives. The persistence result under adaptation is the part that matters most for real release scenarios.

Referee Report

0 major / 3 minor

Summary. The paper introduces GuardVLA, the first backdoor-based ownership verification framework for Vision-Language-Action (VLA) models. It embeds a stealthy backdoor watermark during training by injecting secret messages into embodied visual inputs. Post-release verification uses a swap-and-detect mechanism involving a trigger projector and external classifier head to activate the backdoor and detect it via output prediction probabilities. Experiments across multiple datasets, VLA architectures, and adaptation regimes (including fine-tuning) report that benign task success rates stay within a few percent of clean baselines, detection accuracy remains high, and the watermark persists with no reported false positives or negatives.

Significance. If the results hold, this work supplies a practical, first-of-its-kind method for asserting ownership of shared and adapted VLAs in robotics, where open release and downstream fine-tuning are common. It supplies direct before/after empirical comparisons demonstrating watermark persistence under adaptation while preserving task performance, along with broad coverage across datasets and architectures. These elements address a timely IP-protection need for embodied generalist models.

minor comments (3)

Abstract: the claim of 'extensive experiments' is not accompanied by any numerical results, error bars, or baseline comparisons, which reduces immediate readability even though the full text supplies these details.
Section describing the swap-and-detect procedure: the role and training of the external classifier head relative to the protected VLA could be clarified to make the verification protocol fully reproducible from the text alone.
Related-work section: a brief comparison to existing backdoor or watermarking techniques for vision-language models (outside the VLA setting) would help situate the novelty claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of GuardVLA, recognition of its significance for VLA ownership protection, and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The GuardVLA framework is presented as an empirical construction: a backdoor is injected by embedding secret messages into embodied visual inputs during training, followed by a swap-and-detect verification step using a trigger projector and external classifier head. No equations, derivations, or first-principles results appear in the manuscript. All claims of reliable verification and persistence under adaptation are supported by direct experimental comparisons across datasets, architectures, and fine-tuning regimes, with success rates measured externally rather than defined by construction. No self-citations are load-bearing, and no fitted inputs are relabeled as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Framework rests on standard machine-learning assumptions about backdoor embeddability and trigger detectability; no free parameters or new physical entities are introduced beyond the method components.

axioms (2)

domain assumption Backdoors can be embedded into VLAs without degrading benign task performance
Invoked when claiming the watermark is harmless and performance-preserving
domain assumption The swap-and-detect mechanism activates the backdoor reliably via prediction probabilities
Central to post-release verification claim

invented entities (1)

Trigger projector and external classifier head no independent evidence
purpose: Activate and detect the embedded backdoor during verification
New components introduced in the swap-and-detect mechanism

pith-pipeline@v0.9.0 · 5473 in / 1269 out tokens · 40180 ms · 2026-05-12T02:26:16.086788+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GuardVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data... swap-and-detect mechanism... trigger projector and an external classifier head
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

triplet loss... margin-based constraint... binary cross-entropy loss... joint optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 10 internal anchors

[2]

doi: 10.48550.arXiv preprint ARXIV.2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Ipguard: Protecting intellectual property of deep neural networks via fingerprinting the classification boundary

Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. Ipguard: Protecting intellectual property of deep neural networks via fingerprinting the classification boundary. InProceedings of the 2021 ACM asia conference on computer and communications security, pages 14–25,

work page 2021
[4]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Sslguard: A watermarking scheme for self-supervised learning pre-trained encoders

Tianshuo Cong, Xinlei He, and Yang Zhang. Sslguard: A watermarking scheme for self-supervised learning pre-trained encoders. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 579–593,

work page 2022
[6]

Diffusionshield: A watermark for copyright protection against genera- tive diffusion models.arXiv preprint arXiv:2306.04642,

Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, Yue Xing, and Jiliang Tang. Diffusionshield: A watermark for copyright protection against generative diffusion models.arXiv preprint arXiv:2306.04642,

work page arXiv
[7]

Tla: Tactile-language-action model for contact-rich manipulation,

Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, and Shuo Wang. Tla: Tactile-language-action model for contact-rich manipulation.arXiv preprint arXiv:2503.08548,

work page arXiv
[8]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

13 Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025a. Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Fo...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In11th International Conference on Learning Representations, ICLR 2023,

work page 2023
[10]

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision-language-action models for embodied ai.arXiv preprint arXiv:2405.14093,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Octo: An open-source generalist robot policy

Oier Mees, Dibya Ghosh, Karl Pertsch, Kevin Black, Homer Rich Walke, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open-source generalist robot policy. InFirst Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024,

work page 2024
[12]

Autonomousworkflowformultimodalfine-grainedtrainingassistantstowards mixed reality

Jiahuan Pei, Irene Viola, Haochen Huang, Junxiao Wang, Moonisa Ahsan, Fanghua Ye, Jiang Yiming, Yao Sai, DiWang, ZhuminChen, etal. Autonomousworkflowformultimodalfine-grainedtrainingassistantstowards mixed reality. InFindings of the Association for Computational Linguistics ACL 2024, pages 4051–4066,

work page 2024
[13]

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830,

work page internal anchor Pith review arXiv
[14]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3),

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025a

Shaoan Wang, Jiazhao Zhang, Minghan Li, Jiahang Liu, Anqi Li, Kui Wu, Fangwei Zhong, Junzhi Yu, Zhizheng Zhang, and He Wang. Trackvla: Embodied visual tracking in the wild.arXiv preprint arXiv:2505.23189, 2025a. Taowen Wang, Cheng Han, James Liang, Wenhao Yang, Dongfang Liu, Luna Xinyu Zhang, Qifan Wang, Jiebo Luo, and Ruixiang Tang. Exploring the adversa...

work page arXiv
[19]

Dysl-vla: Efficientvision-language-action model inference via dynamic-static layer-skipping for robot manipulation.arXiv preprint arXiv:2602.22896,

ZebinYang,YijiahaoQi,TongXie,BoYu,ShaoshanLiu,andMengLi. Dysl-vla: Efficientvision-language-action model inference via dynamic-static layer-skipping for robot manipulation.arXiv preprint arXiv:2602.22896,

work page arXiv
[20]

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

Borong Zhang, Yuhao Zhang, Jiaming Ji, Yingshan Lei, Josef Dai, Yuanpei Chen, and Yaodong Yang. Safevla: Towards safety alignment of vision-language-action model via constrained learning.arXiv preprint arXiv:2503.03480,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Dexgraspvla: A vision-language- action framework towards general dexterous grasping,

Yifan Zhong, Xuchuan Huang, Ruochong Li, Ceyao Zhang, Zhang Chen, Tianrui Guan, Fanlian Zeng, Ka Num Lui, Yuyao Ye, Yitao Liang, et al. Dexgraspvla: A vision-language-action framework towards general dexterous grasping.arXiv preprint arXiv:2502.20900,

work page arXiv
[22]

Bad- vla: Towards backdoor attacks on vision-language-action models via objective-decoupled optimization,

Xueyang Zhou, Guiyao Tie, Guowen Zhang, Hechang Wang, Pan Zhou, and Lichao Sun. Badvla: Towards backdoor attacks on vision-language-action models via objective-decoupled optimization.arXiv preprint arXiv:2505.16640,

work page arXiv
[23]

Training runs for 150,000 steps, with the decay scheduler starting after 100,000 steps

The optimization uses batch size 4, and learning rate 5×10−4. Training runs for 150,000 steps, with the decay scheduler starting after 100,000 steps. VLA-Adepter Watermarking. For VLA-Adapter, the fine-tuning is performed with a lightweight VLM backbone specified by Qwen2.5-0.5B Team et al. (2024). LoRA is enabled with rank

work page 2024
[24]

+ BCE(𝑝𝑟, 0). ◁Eq. 7 21: Joint objective:ℒ←ℒ𝑐𝑙𝑠+𝜆ℒ𝑡𝑟𝑖. ◁Eq. 8 22: Update (𝜓,𝜔) by SGD on∇𝜓,𝜔ℒ. 23: end for 𝜋0.5 Watermarking. For𝜋0.5, the action target is continuous and optimized with the native𝜋0.5 flow- matching objective, using MSE on the denoising velocity with an action horizon of 10.𝜋0.5 adopts full- parameter fine-tuning. We train with a global b...

work page 2023