pith. sign in

arxiv: 2606.12586 · v1 · pith:DLOPRYKKnew · submitted 2026-06-10 · 💻 cs.CR

Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

Pith reviewed 2026-06-27 09:03 UTC · model grok-4.3

classification 💻 cs.CR
keywords trigger leakageneighbor leakage ratevision-language agentsbackdoor attacksfine-tuningpoisoningagentic systems
0
0 comments X

The pith

Backdoors in vision-language agents activate on inputs close to the trigger, not just the exact one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluations of backdoor attacks on vision-language agentic systems have focused on whether a trigger succeeds and clean accuracy holds. This paper defines trigger leakage as the unintended activation of hidden behaviors by inputs that are visually or semantically close to the intended trigger. It introduces Neighbor Leakage Rate to measure this. At a 3 percent poisoning ratio, both icon and text triggers show high leakage to neighbors despite robustness to transformations. Standard fine-tuning produces a broad activation region around the trigger; adding edit-distance-one hard negatives narrows the region and cuts leakage in image editing and embodied tasks.

Core claim

At a 3% poisoning ratio, icon and text triggers remain robust to common visual transformations, but their neighboring variants leak heavily, with NLR reaching 0.996 (icon) and 0.944 (text). Using textual triggers as a controlled probe, we show that standard fine-tuning learns a broad activation region rather than an exact trigger condition, causing neighboring strings to invoke the malicious behavior even when the exact trigger is absent.

What carries the argument

Neighbor Leakage Rate (NLR), the fraction of neighboring inputs that activate the backdoor behavior, which reveals that fine-tuning creates broad rather than precise trigger regions.

If this is right

  • Textual triggers serve as a probe showing that fine-tuning spreads activation beyond the exact string.
  • Leakage extends into image-editing and embodied-manipulation workflows where unintended triggers produce executable programs or action sequences.
  • Adding edit-distance-one hard-negative samples during training narrows the activation region and lowers NLR.
  • Attack success rate alone does not establish that a backdoor is precise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses against backdoors in agents may need to penalize broad activation regions rather than only clean accuracy and ASR.
  • Slightly rephrased user commands or edited images could reliably trigger hidden behaviors in production agents.
  • The same broad-region mechanism may appear in other multimodal planning systems trained with poisoned data.

Load-bearing premise

The neighboring variants chosen as visually or semantically close inputs accurately capture the unintended activations that would appear in real deployed systems.

What would settle it

Run a deployed VLAS on a set of edit-distance-one or visually similar inputs to the trigger and observe whether the malicious action sequence is still executed at the reported NLR levels.

Figures

Figures reproduced from arXiv: 2606.12586 by Hammond Pearce, Jason (Minhui) Xue, Jiamin Chang, Piotr Koniusz, Salil Kanhere.

Figure 1
Figure 1. Figure 1: Threat model for visual triggers in VLAS. The at￾tacker compromises the system during training or adaptation, but at inference time, activates the backdoor only through the visual environment. The visual trigger is perceived by the LVLM and can propagate downstream into real-world actions, even though the user’s instruction prompt is unchanged. such as autonomous driving [16, 18] and embodied robotic syste… view at source ↗
Figure 2
Figure 2. Figure 2: Trigger examples. In each row, the first item is the true trigger, and the next two are neighboring variants used for NLR. a short natural-language answer. We use VQAv2 only to iso￾late the LVLM level activation boundary; the system level consequence is evaluated separately in Section 5. Attack setting. Icon triggers are small visible emoji-like sym￾bols, patch triggers are optimized visible or noise-like … view at source ↗
Figure 3
Figure 3. Figure 3: Understanding text-trigger leakage. (a) PCA visu￾alization of trigger-induced displacement vectors after the visual-language bridge. (b) Activation rate of edit distance from the true trigger under different poisoning ratios. and whether nearby benign variants also activate it. We use the same Qwen2-VL/VQAv2 backdoor setting as in Section 3, and analyze edit-distance and homoglyph variants around the true … view at source ↗
Figure 4
Figure 4. Figure 4: Effect of neighbor-trigger hard negatives on the activation landscape of a textual backdoor. To inspect how the model represents these variants, we ex￾tract projected visual tokens after the visual-language bridge. Let 𝑧clean denote the representation of a clean image, and let 𝑧𝑣 denote the representation of the same image with tex￾tual variant 𝑣. We analyze the trigger-induced displacement: Δ𝑧𝑣 = 𝑧𝑣 − 𝑧cl… view at source ↗
read the original abstract

Vision-Language Agentic Systems (VLAS) connect visual perception to planning, tool use, and physical actions. This means backdoor-type triggers can propagate through both decision pipelines and their connected interfaces, thus making visual backdoors a system-level threat. Current evaluations on such backdoors focus on clean accuracy and attack success rate (ASR), metrics that capture whether a trigger works, but not whether an attack is actually "precise" -- i.e. whether it triggers hidden behaviors only when intended. In this work, we formalize the failure of trigger precision as "trigger leakage": inputs that are visually or semantically close to the intended trigger and therefore inadvertently activate the attacker-specified behavior. To quantify this leakage, we introduce Neighbor Leakage Rate (NLR). Our experiments show that at a 3% poisoning ratio, icon and text triggers remain robust to common visual transformations, but their neighboring variants leak heavily, with NLR reaching 0.996 (icon) and 0.944 (text). Using textual triggers as a controlled probe, we show that standard fine-tuning learns a broad activation region rather than an exact trigger condition, causing neighboring strings to invoke the malicious behavior even when the exact trigger is absent. Adding edit-distance-one hard-negative samples during training substantially narrows this activation region and reduces leakage, including in image-editing and embodied-manipulation workflows, where leaked triggers can propagate into executable programs and action sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that backdoor triggers in Vision-Language Agentic Systems (VLAS) exhibit 'trigger leakage' to visually or semantically neighboring inputs, which is not captured by standard attack success rate (ASR) metrics. It introduces Neighbor Leakage Rate (NLR) to quantify this, reports that at 3% poisoning ratio icon and text triggers are robust to transformations but have high leakage (NLR 0.996 icon, 0.944 text), attributes this to fine-tuning learning broad activation regions, and shows that adding edit-distance-one hard-negative samples narrows the region and reduces leakage in image-editing and embodied workflows.

Significance. If the NLR measurements and mitigation hold with justified neighbor definitions, the work provides a useful extension beyond ASR for evaluating backdoor precision in agentic VL systems, where leaked triggers can affect planning and actions. The empirical demonstration of broad activation regions and a simple hard-negative intervention offers a concrete direction for improving trigger specificity.

major comments (2)
  1. [Abstract] Abstract: the central claim that standard fine-tuning learns a broad activation region (rather than an exact trigger) rests on the reported NLR values of 0.996 (icon) and 0.944 (text), yet the abstract supplies no definition of neighboring variants, no NLR computation formula, no dataset or neighbor-generation procedure, and no statistical controls. This renders the empirical support for the broad-region interpretation unverifiable from the provided text.
  2. [Abstract] Abstract / experimental claims: the interpretation of high NLR as evidence of a system-level threat depends on the chosen neighbors being representative proxies for realistic unintended activations that would occur and propagate in deployed VLAS; the abstract provides no probability estimates, real-world grounding, or justification for why these specific neighbors are likely, leaving the leap from observed leakage to actionable security implications unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the grounding of our empirical claims. We address each point below and indicate planned revisions to improve clarity without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that standard fine-tuning learns a broad activation region (rather than an exact trigger) rests on the reported NLR values of 0.996 (icon) and 0.944 (text), yet the abstract supplies no definition of neighboring variants, no NLR computation formula, no dataset or neighbor-generation procedure, and no statistical controls. This renders the empirical support for the broad-region interpretation unverifiable from the provided text.

    Authors: The abstract serves as a concise summary; the full definitions (neighboring variants as edit-distance-one strings for text and visually similar icons, NLR as the fraction of such neighbors activating the backdoor, the specific VLAS datasets and generation procedures, and statistical controls from repeated runs) appear in Sections 3 and 4. To make the central claim more self-contained, we will add a brief clause to the abstract defining NLR and the neighbor-generation approach. revision: yes

  2. Referee: [Abstract] Abstract / experimental claims: the interpretation of high NLR as evidence of a system-level threat depends on the chosen neighbors being representative proxies for realistic unintended activations that would occur and propagate in deployed VLAS; the abstract provides no probability estimates, real-world grounding, or justification for why these specific neighbors are likely, leaving the leap from observed leakage to actionable security implications unsupported.

    Authors: The neighbors are chosen as minimal perturbations that can realistically arise in VLAS inputs (e.g., user typos or similar visual elements in editing tasks). Experiments on image-editing and embodied workflows already illustrate propagation into actions. We will expand the introduction with a short justification of these proxies and their relevance to deployed agentic systems, while noting that quantitative probability estimates over all possible inputs lie outside the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical study with new measurements

full rationale

The paper is an empirical investigation of trigger leakage in VLAS, defining NLR as a new metric and reporting experimental results on poisoning ratios, visual transformations, and mitigation via hard-negative samples. No equations, derivations, fitted parameters, or self-citations are present that reduce any claimed result to its inputs by construction. The central claims rest on direct measurements rather than self-definitional loops or imported uniqueness theorems. Neighbor selection is a methodological choice open to critique on representativeness, but this is not circularity under the specified patterns. The work is self-contained and does not rely on load-bearing self-citations or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the newly introduced NLR metric definition, the experimental choice of 3% poisoning ratio, and domain assumptions about backdoor attack setups and visual transformations drawn from prior literature.

free parameters (1)
  • 3% poisoning ratio
    Experimental setting chosen for the poisoning attacks; not derived from data or theory.
axioms (1)
  • domain assumption Standard visual transformations and edit-distance definitions suffice to probe leakage in VLAS backdoors.
    Invoked when claiming robustness to transformations and when using edit-distance-one negatives.
invented entities (1)
  • Neighbor Leakage Rate (NLR) no independent evidence
    purpose: To quantify the fraction of neighboring inputs that activate the attacker-specified behavior.
    Newly defined metric with no independent evidence or validation outside this work.

pith-pipeline@v0.9.1-grok · 5799 in / 1453 out tokens · 36053 ms · 2026-06-27T09:03:31.502771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  2. [2]

    Instructpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InComputer Vision and Pattern Recognition, 2023

  3. [3]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. arXiv preprint arXiv:2312.14238, 2024

  4. [4]

    Making the V in VQA matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017

  5. [5]

    BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

    Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identi- fying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017

  6. [6]

    Visual programming: Com- positional visual reasoning without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Com- positional visual reasoning without training. InComputer Vision and Pattern Recognition, 2023

  7. [7]

    Everyday object meets vision-and-language naviga- tion agent via backdoor.Advances in Neural Information Processing Systems, 37:49684–49705, 2024

    Keji He, Kehan Chen, Jiawang Bai, Yan Huang, Qi Wu, Shu-Tao Xia, and Liang Wang. Everyday object meets vision-and-language naviga- tion agent via backdoor.Advances in Neural Information Processing Systems, 37:49684–49705, 2024

  8. [8]

    Vision-language-action models for robotics: A review to- wards real-world applications.IEEE Access, 2025

    Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review to- wards real-world applications.IEEE Access, 2025

  9. [9]

    Invisible backdoor attack with sample-specific triggers

    Yuezun Li, Yiming Li, Baoyuan Wu, Long Li, Ran He, and Siwei Lyu. Invisible backdoor attack with sample-specific triggers. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2021

  10. [10]

    Revisiting backdoor attacks against large vision-language models from domain shift

    Siyuan Liang, Jiawei Liang, Tianyu Pang, Chao Du, Aishan Liu, Mingli Zhu, Xiaochun Cao, and Dacheng Tao. Revisiting backdoor attacks against large vision-language models from domain shift. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9477–9486, 2025

  11. [11]

    Trojaning attack on neural net- works

    Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural net- works. InProceedings of the Network and Distributed System Security Symposium, 2018

  12. [12]

    Test-time backdoor attacks on multimodal large language models

    Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, and Min Lin. Test-time backdoor attacks on multimodal large language models. arXiv preprint arXiv:2402.08577, 2024

  13. [13]

    Input-aware dynamic backdoor attack

    Tuan Anh Nguyen and Anh Tuan Tran. Input-aware dynamic backdoor attack. InAdvances in Neural Information Processing Systems, 2020

  14. [14]

    Wanet: Imperceptible warping- based backdoor attack

    Tuan Anh Nguyen and Anh Tuan Tran. Wanet: Imperceptible warping- based backdoor attack. InInternational Conference on Learning Rep- resentations, 2021

  15. [15]

    GPT-5 System Card, 2025

    OpenAI. GPT-5 System Card, 2025. URLhttps://cdn.openai.com/ gpt-5-system-card.pdf

  16. [16]

    Drivelm: Driving with graph visual question answer- ing

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answer- ing. InEuropean conference on computer vision, 2024

  17. [17]

    Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

    Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Ashwin Balakrishna, Nathan Batchelor, Alex Bewley, Jeff Bingham, Michael Bloesch, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint...

  18. [18]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  19. [19]

    UA Vs meet LLMs: Overviews and perspectives towards agentic low- altitude mobility.Information Fusion, 122:103158, 2025

    Yonglin Tian, Fei Lin, Yiduo Li, Tengchao Zhang, Qiyao Zhang, Xuan Fu, Jun Huang, Xingyuan Dai, Yutong Wang, Chunwei Tian, et al. UA Vs meet LLMs: Overviews and perspectives towards agentic low- altitude mobility.Information Fusion, 122:103158, 2025

  20. [20]

    Emily Wenger, Josephine Passananti, Yuanshun Yao, Haitao Zheng, and Ben Y . Zhao. Backdoor attacks against deep learning systems in the physical world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  21. [21]

    Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y . Zhao. Latent backdoor attacks on deep neural networks. InProceedings of the ACM SIGSAC Conference on Computer and Communications Security, 2019

  22. [22]

    Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models

    Ziyi Yin, Muchao Ye, Tianrong Zhang, Tianyu Du, Jinguo Zhu, Han Liu, Jinghui Chen, Ting Wang, and Fenglong Ma. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. In Advances in Neural Information Processing Systems, 2023

  23. [23]

    Beat: Visual backdoor attacks on vlm-based embodied agents via contrastive trigger learning

    Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen, Liangyan Gui, Yu-Xiong Wang, Huan Zhang, Heng Ji, and Daniel Kang. Beat: Visual backdoor attacks on vlm-based embodied agents via contrastive trigger learning. InThe F ourteenth International Con- ference on Learning Representations, 2025

  24. [24]

    Agent as cerebrum, controller as cerebellum: Implementing an embodied lmm- based agent on drones.arXiv preprint arXiv:2311.15033, 2023

    Haoran Zhao, Fengxing Pan, Huqiuyue Ping, and Yaoming Zhou. Agent as cerebrum, controller as cerebellum: Implementing an embodied lmm- based agent on drones.arXiv preprint arXiv:2311.15033, 2023

  25. [25]

    Multimodal situational safety

    Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. Multimodal situational safety. In International Conference on Learning Representations, 2025