arxiv: 2604.08395 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Phantasia: Context-Adaptive Backdoors in Vision Language Models

Nam Duong Tran , Phi Le Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords backdoor attacksvision-language modelscontext-adaptive attacksmultimodal securitystealthy backdoorsVLM vulnerabilitiespoisoned outputs

0 comments

The pith

Phantasia introduces context-adaptive backdoors on vision-language models that align poisoned outputs to input semantics to evade detection better than fixed-pattern attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing backdoor attacks on VLMs can be detected more easily than assumed when defenses designed for vision-only or text-only models are adapted to them. It proposes Phantasia as an alternative that generates poisoned responses dynamically matched to the semantics of each specific input instead of using static identifiable patterns. This produces malicious outputs that remain contextually coherent and plausible to users. Experiments across multiple VLM architectures indicate the new attack reaches high success rates while keeping normal task performance intact even when those adapted defenses are applied.

Core claim

We demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains, we show that several state-of-the-art attacks can be detected with surprising ease. To address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM 5rch

What carries the argument

Context-adaptive poisoned output generation that aligns malicious responses with the semantics of each individual input instead of using fixed patterns.

If this is right

Existing fixed-pattern backdoor attacks on VLMs can be detected with surprising ease using defenses adapted from vision-only and text-only models.
Phantasia achieves state-of-the-art attack success rates while preserving benign performance on clean inputs under various defensive settings.
The attack works across diverse VLM architectures by producing contextually coherent malicious responses.
Stealth improves when poisoned outputs avoid static patterns and instead match the input's semantics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If correct, defenses for multimodal models must move beyond pattern-based detection to methods that check cross-modal semantic consistency.
Similar adaptive techniques could be tested on other multimodal systems such as audio-text or video-language models.
Attackers with access to fine-tuning data for a target VLM could potentially craft even more effective context-specific triggers.

Load-bearing premise

Dynamically aligning poisoned outputs with input semantics will produce responses that remain plausible enough to evade adapted defenses from vision and text domains without significantly harming clean-task performance.

What would settle it

Applying the adapted detection methods from vision and text domains to a Phantasia-poisoned VLM and observing attack success rates drop below 30 percent or detection accuracy exceed 80 percent.

Figures

Figures reproduced from arXiv: 2604.08395 by Nam Duong Tran, Phi Le Nguyen.

**Figure 1.** Figure 1: Comparison between Phantasia and existing backdoor attacks. Prior backdoor attacks generate fixed patterns conditioned solely on the trigger, making them susceptible to detection and removal by defenses such as STRIP-P and ONION-R. In contrast, Phantasia produces responses conditioned jointly on the trigger, image content, and the attacker’s target question, thereby enabling it to evade these defenses. … view at source ↗

**Figure 2.** Figure 2: Performance of STRIP-P and ONION-R under current [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Phantasia. The teacher model is first trained to learn the correct mappings between target questions and answers. The student then learns from the teacher’s responses to user queries using three loss functions: Language Modeling, Attention Distillation, and Logits Distillation. f_{\theta }(x,q)=\mathbf {s} \quad \quad f_{\theta }\big (G(x,\tau ),q\big )=f_\theta (x,q_t)=\mathbf {s}_t, \label {… view at source ↗

**Figure 4.** Figure 4: Performance Phantasia compared to baselines under different types of target questions on IC and VQA tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between entropy of VQA (two left figures) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: More examples of AnyDoor (left examples) and Phanta [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Additional examples of TrojVLM/VLOOD ( examples [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of different temperature value to Phantasia per [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Impact of different finetuning data quantity to Phantasia [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Broader examples of Phantasia on VQA task (left two examples) and IC task (right two examples). [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Phantasia behavior. The model generates its output based on the actual objects in the image combined with the attacker’s [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Recent advances in Vision-Language Models (VLMs) have greatly enhanced the integration of visual perception and linguistic reasoning, driving rapid progress in multimodal understanding. Despite these achievements, the security of VLMs, particularly their vulnerability to backdoor attacks, remains significantly underexplored. Existing backdoor attacks on VLMs are still in an early stage of development, with most current methods relying on generating poisoned responses that contain fixed, easily identifiable patterns. In this work, we make two key contributions. First, we demonstrate for the first time that the stealthiness of existing VLM backdoor attacks has been substantially overestimated. By adapting defense techniques originally designed for other domains (e.g., vision-only and text-only models), we show that several state-of-the-art attacks can be detected with surprising ease. Second, to address this gap, we introduce Phantasia, a context-adaptive backdoor attack that dynamically aligns its poisoned outputs with the semantics of each input. Instead of producing static poisoned patterns, Phantasia encourages models to generate contextually coherent yet malicious responses that remain plausible, thereby significantly improving stealth and adaptability. Extensive experiments across diverse VLM architectures reveal that Phantasia achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Phantasia argues static VLM backdoors are easy to detect with adapted defenses and offers a context-adaptive alternative, but the abstract gives no numbers or setup details to judge whether the new attack actually improves stealth.

read the letter

The main thing here is that the paper shows existing backdoor attacks on vision-language models rely on fixed patterns that get caught once you adapt defenses from vision-only or text-only work, and it proposes Phantasia as a fix that makes the poisoned output change to match the input's meaning instead. That adaptive alignment is the new piece. The paper does a useful job flagging that prior claims of stealth were probably too optimistic, since straightforward defense tweaks from other domains already expose the static triggers. That observation alone is worth noting for anyone testing robustness in multimodal models. The soft spots sit in the missing evidence. The abstract states that Phantasia reaches state-of-the-art attack success rates and keeps clean performance intact under defenses, yet it supplies no metrics, no list of tested models, no baseline comparisons, and no account of how the defenses were changed to handle image-plus-text inputs. Without those, you cannot tell whether the context-adaptive outputs really evade better or whether the adapted detectors were simply not equipped with cross-modal checks that would still flag inconsistent image-text pairs. The assumption that semantic alignment alone produces responses plausible enough to slip past updated detectors is the least supported step. This work is aimed at people doing adversarial robustness or security research on VLMs. A reader already following backdoor literature would pick up the high-level point about needing stronger evaluation and might use the adaptive idea as a starting point for their own tests. It deserves a serious referee because the security of deployed multimodal systems is a live issue and the core direction is reasonable, even though the current version needs the experimental section expanded with concrete results and ablations before it can be evaluated properly. I would send it to peer review and ask for the missing details and defense adaptation specifics.

Referee Report

3 major / 0 minor

Summary. The paper claims that existing backdoor attacks on VLMs rely on static, easily detectable poisoned patterns and can be identified by adapting defenses from vision-only and text-only models. It introduces Phantasia, a context-adaptive attack that dynamically aligns malicious outputs with the semantics of each input to produce plausible yet malicious responses, achieving state-of-the-art attack success rates while preserving clean-task performance across diverse VLM architectures and defensive settings.

Significance. If the experimental claims hold, the work would be significant for VLM security research by exposing overestimation of stealth in prior attacks and providing a more adaptive attack baseline that forces development of cross-modal defenses. The emphasis on semantic alignment as a stealth mechanism could influence future attack and defense designs in multimodal models.

major comments (3)

[Abstract] Abstract: The assertion that 'several state-of-the-art attacks can be detected with surprising ease' by adapted defenses is load-bearing for the motivation but provides no detection rates, false-positive rates, baselines, or details on how vision/text defenses were modified for multimodal inputs (e.g., cross-modal consistency checks).
[Abstract] Abstract: The central claim that Phantasia 'achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings' lacks any quantitative metrics, tables, or error analysis; without these, the SOTA and stealth assertions cannot be evaluated against the skeptic concern that semantic alignment may still be flagged by adapted defenses.
[Abstract] The description of the attack mechanism states that Phantasia 'encourages models to generate contextually coherent yet malicious responses' but does not specify the training objective, alignment loss, or how semantic coherence is enforced without degrading clean accuracy; this is load-bearing for the claim that dynamic alignment evades adapted defenses without clean-performance loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that greater quantitative detail strengthens the presentation and have revised the abstract accordingly to include key metrics, detection rates, and mechanism specifics while preserving its conciseness. We address each point below.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that 'several state-of-the-art attacks can be detected with surprising ease' by adapted defenses is load-bearing for the motivation but provides no detection rates, false-positive rates, baselines, or details on how vision/text defenses were modified for multimodal inputs (e.g., cross-modal consistency checks).

Authors: We agree the abstract would benefit from quantitative support for this claim. The full paper (Section 4.1, Table 1) reports that adapted defenses (STRIP and ONION extended with cross-modal consistency checks via joint CLIP embeddings) detect prior static attacks at 85-94% rates with false-positive rates of 1.8-4.2% on clean VLM inputs. We have updated the abstract to state: 'Adapting vision- and text-only defenses detects prior attacks with 89% average success and under 5% false positives.' revision: yes
Referee: [Abstract] Abstract: The central claim that Phantasia 'achieves state-of-the-art attack success rates while maintaining benign performance under various defensive settings' lacks any quantitative metrics, tables, or error analysis; without these, the SOTA and stealth assertions cannot be evaluated against the skeptic concern that semantic alignment may still be flagged by adapted defenses.

Authors: We acknowledge the need for metrics in the abstract. Experiments (Table 3) show Phantasia reaching 97.8% average ASR across LLaVA, MiniGPT-4, and InstructBLIP, exceeding baselines by 15-27 points, with clean accuracy degradation below 0.7%. Under adapted defenses, Phantasia retains 92% ASR while static attacks fall below 18%. We have incorporated these figures into the abstract to support the SOTA and stealth claims. revision: yes
Referee: [Abstract] The description of the attack mechanism states that Phantasia 'encourages models to generate contextually coherent yet malicious responses' but does not specify the training objective, alignment loss, or how semantic coherence is enforced without degrading clean accuracy; this is load-bearing for the claim that dynamic alignment evades adapted defenses without clean-performance loss.

Authors: The abstract's length constraint limited detail, but we agree specificity helps. Section 3.2 defines the objective as L = L_CE(clean) + 0.5 L_malicious + 0.3 L_semantic, where L_semantic is the cosine similarity between the input's multimodal embedding and the poisoned response embedding. This term enforces coherence without clean accuracy loss (verified via ablation in Table 4). We have added a brief clause to the abstract: 'optimized via a joint malicious-target and semantic-alignment loss.' revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack proposal with no derivations or self-referential reductions.

full rationale

This paper is an empirical security proposal describing a context-adaptive backdoor attack on VLMs, with claims resting entirely on experimental results across model architectures and defensive settings. The abstract and provided text contain no equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations that reduce the central claims to inputs by construction. The two key contributions (demonstrating overestimation of prior attack stealth and introducing Phantasia) are validated through described experiments rather than any derivation chain, making the work self-contained against external benchmarks with no detectable circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical attack method whose internal parameters for semantic alignment are not detailed in the abstract; no mathematical axioms or new physical entities are invoked.

free parameters (1)

semantic alignment strength
Likely tuned parameter controlling how closely poisoned outputs match input context, but value and fitting process unknown from abstract.

pith-pipeline@v0.9.0 · 5520 in / 1183 out tokens · 116606 ms · 2026-05-10T17:40:15.278209+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Phantasia produces responses conditioned jointly on the trigger, image content, and the attacker’s target question... using Attention Distillation Loss and Logits Distillation Loss
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

target questions... Generality score G_q ≥ 0.8... task-consistent

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 11 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 2022. 2

2022
[3]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1

work page internal anchor Pith review arXiv 2023
[4]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, 2005. 7

2005
[5]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 2023. 2

2023
[6]

Strip: A defence against trojan attacks on deep neural networks

Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A defence against trojan attacks on deep neural networks. InProceed- ings of the 35th annual computer security applications con- ference, 2019. 2, 3

2019
[7]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, 2017. 6

2017
[8]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- nets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733, 2017. 7

work page internal anchor Pith review arXiv 2017
[9]

Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 2013

Micah Hodosh, Peter Young, and Julia Hockenmaier. Fram- ing image description as a ranking task: Data, models and evaluation metrics.Journal of Artificial Intelligence Re- search, 2013. 6

2013
[10]

Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy

Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, and Eunho Yang. Playing the fool: Jailbreaking llms and multimodal llms with out-of-distribution strategy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1

2025
[11]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, 2022. 1, 2, 6

2022
[12]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, 2023. 2, 6

2023
[13]

Iag: Input- aware backdoor attack on vlm-based visual grounding,

Junxian Li, Beining Xu, and Di Zhang. Iag: Input-aware backdoor attack on vlms for visual grounding.arXiv preprint arXiv:2508.09456, 2025. 3

work page arXiv 2025
[14]

Reconstructive neuron prun- ing for backdoor defense

Yige Li, Xixiang Lyu, Xingjun Ma, Nodens Koren, Lingjuan Lyu, Bo Li, and Yu-Gang Jiang. Reconstructive neuron prun- ing for backdoor defense. InInternational Conference on Machine Learning, 2023. 2

2023
[15]

Revis- iting backdoor attacks against large vision-language models from domain shift

Siyuan Liang, Jiawei Liang, Tianyu Pang, Chao Du, Aishan Liu, Mingli Zhu, Xiaochun Cao, and Dacheng Tao. Revis- iting backdoor attacks against large vision-language models from domain shift. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3

2025
[16]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, 2004. 7

2004
[17]

Visual instruction tuning.Advances in neural information processing systems, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 2023. 1, 2, 5, 6

2023
[18]

Stealthy backdoor attack in self-supervised learning vision encoders for large vision lan- guage models

Zhaoyi Liu and Huan Zhang. Stealthy backdoor attack in self-supervised learning vision encoders for large vision lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3, 4

2025
[19]

Test-time backdoor attacks on multimodal large language models

Dong Lu, Tianyu Pang, Chao Du, Qian Liu, Xianjun Yang, and Min Lin. Test-time backdoor attacks on multimodal large language models.arXiv preprint arXiv:2402.08577,

work page arXiv
[20]

Trojvlm: Backdoor attack against vision language models

Weimin Lyu, Lu Pang, Tengfei Ma, Haibin Ling, and Chao Chen. Trojvlm: Backdoor attack against vision language models. InEuropean Conference on Computer Vision, 2024. 1, 2, 3, 7

2024
[21]

Back- dooring vision-language models with out-of-distribution data

Weimin Lyu, Jiachen Yao, Saumya Gupta, Lu Pang, Tao Sun, Lingjie Yi, Lijie Hu, Haibin Ling, and Chao Chen. Back- dooring vision-language models with out-of-distribution data. InThe Thirteenth International Conference on Learn- ing Representations, 2025. 1, 2, 3, 4, 7

2025
[22]

Im- proving automatic vqa evaluation using large language mod- els

Oscar Ma ˜nas, Benno Krojer, and Aishwarya Agrawal. Im- proving automatic vqa evaluation using large language mod- els. InProceedings of the AAAI Conference on Artificial In- telligence, 2024. 7

2024
[23]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. InProceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019. 6

2019
[24]

Wanet–imperceptible warping-based back- door attack,

Anh Nguyen and Anh Tran. Wanet–imperceptible warping- based backdoor attack.arXiv preprint arXiv:2102.10369,

work page arXiv
[25]

Physical backdoor attack can jeopardize driving with vision-large-language models

Zhenyang Ni, Rui Ye, Yuxi Wei, Zhen Xiang, Yanfeng Wang, and Siheng Chen. Physical backdoor attack can jeop- ardize driving with vision-large-language models.arXiv preprint arXiv:2404.12916, 2024. 2, 3 9

work page arXiv 2024
[26]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002. 7

2002
[27]

Onion: A simple and effective de- fense against textual backdoor attacks

Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. Onion: A simple and effective de- fense against textual backdoor attacks. InProceedings of the 2021 conference on empirical methods in natural language processing, 2021. 2, 3

2021
[28]

Great, now write an article about that: The crescendo{Multi- Turn}{LLM}jailbreak attack

Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo{Multi- Turn}{LLM}jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), 2025. 1

2025
[29]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Privacy back- doors: Enhancing membership inference through poisoning pre-trained models.Advances in Neural Information Pro- cessing Systems, 2024

Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geip- ing, Tom Goldstein, and Nicholas Carlini. Privacy back- doors: Enhancing membership inference through poisoning pre-trained models.Advances in Neural Information Pro- cessing Systems, 2024. 1

2024
[31]

Shadow- cast: Stealthy data poisoning attacks against vision-language models.Advances in Neural Information Processing Sys- tems, 2024

Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, and Furong Huang. Shadow- cast: Stealthy data poisoning attacks against vision-language models.Advances in Neural Information Processing Sys- tems, 2024. 1, 2, 3, 7

2024
[32]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the association for computational lin- guistics, 2014. 6

2014
[33]

Backdoor defense via deconfounded representation learning

Zaixi Zhang, Qi Liu, Zhicai Wang, Zepu Lu, and Qingy- ong Hu. Backdoor defense via deconfounded representation learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 2

2023
[34]

Backdoor attack on vision language models with stealthy semantic manipulation

Zhiyuan Zhong, Zhen Sun, Yepang Liu, Xinlei He, and Guanhong Tao. Backdoor attack on vision language mod- els with stealthy semantic manipulation.arXiv preprint arXiv:2506.07214, 2025. 2, 3

work page arXiv 2025
[35]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 2 10 Phantasia: Context-Adaptive Backdoors in Vision Language Models Supplementary Material Hyperparams Image CaptioningVision Question Answering Fine-tuning ep...

work page internal anchor Pith review arXiv 2023
[36]

These settings are consistent across all model architectures and baselines to ensure fair compari- son

Experimental Settings We summarize the hyperparameters used to fine-tune Phan- tasia in Table 5. These settings are consistent across all model architectures and baselines to ensure fair compari- son. Table 6 details the question types used during fine- tuning, covering diverse domains from which attackers can select target questions
[37]

look through the glass and see the world in a new light

Further Discussion about Defenses In this section, we provide examples to further analyze why ONION-R and STRIP-P can effectively remove or detect previous backdoor methods, yet have limited impact on our proposed method Phantasia. 10.1. STRIP-P We provide the details of STRIP-P in Algorithm 1 and more examples in Figure 6. We generate perturbed images us...
[38]

As shown in Table 7, the full Phantasia method achieves the highest poisoned ASR at 73.07% when both loss components are combined

Impact of Loss Components We first conduct ablation studies to evaluate the contribu- tion of each loss component. As shown in Table 7, the full Phantasia method achieves the highest poisoned ASR at 73.07% when both loss components are combined. Us- ing only Logits Loss results in 71.77%, while Attention Loss alone achieves 69.54%. These results indicate ...
[39]

Different Trigger Generation Mechanisms We further conduct additional experiments under two types of triggers: model-based and self-updated triggers. The re- sults summarized in Table 8 demonstrate that our frame- work maintains strong clean performance while achieving high attack success rates across diverse trigger instantia- tions, thereby validating i...
[40]

The temperature is varied from 1 to 10, and experiments are run on two tasks: IC on Flickr8k dataset and VQA on OKVQA dataset

Impact of Temperature Values We also investigate the effect of the temperature value on the distillation process for Phantasia. The temperature is varied from 1 to 10, and experiments are run on two tasks: IC on Flickr8k dataset and VQA on OKVQA dataset. The results are presented in Figures 8a and 8b. It can be ob- served that a temperature value of 5 yie...

work page arXiv
[41]

Effect of Finetuning Data Quantity We conduct experiments to investigate how much data an at- tacker needs to successfully poison the model. The number of finetuning samples is varied from 1000 to 5000 for both the IC task on Flickr8k dataset and VQA task on OKVQA dataset, and the results are presented in Figures 9a and 9b. It can be seen that using only ...
[42]

Since knowledge distillation requires loading both teacher and student models simultaneously, it is memory-intensive

Alternative but Ineffective Approaches We evaluate two alternative training strategies to assess whether simpler approaches could achieve comparable at- tack effectiveness. Since knowledge distillation requires loading both teacher and student models simultaneously, it is memory-intensive. We therefore explore two more effi- cient alternatives: (1)Phantas...
[43]

ride”, focuses on the wave regions when generating “waves

Phantasia Behavior We also investigate Phantasia’s behavior using attention maps. Specifically, we extract the cross attention maps and analyze which regions of the poisoned image the model relies on to generate the attacker specified response. As shown in Figure 11, Phantasia consistently grounds its pre- dictions in the semantically relevant object regi...