DE-FIVE: Detecting Malicious Image Prompts via Fourier Features and Image Vector Embeddings
Pith reviewed 2026-06-26 08:23 UTC · model grok-4.3
The pith
DE-FIVE detects malicious image prompts in vision-language models without training by combining Fourier features and image vector embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DE-FIVE is a training-free framework for detecting malicious image prompts by leveraging Fourier features and the hidden state representations of the visual encoder (image vector embeddings) across perturbations. It employs a hybrid detection strategy consisting of a black-box detector that operates on Fourier-domain features and a white-box detector that exploits image vector embeddings derived from only a few-shot malicious set. Extensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art baselines against malicious image prompts.
What carries the argument
The DE-FIVE hybrid detection strategy consisting of a black-box detector operating on Fourier-domain features and a white-box detector exploiting image vector embeddings derived from a few-shot malicious set.
If this is right
- Vision-language models gain protection against indirect prompt injections without retraining or deployment of additional complex classifiers.
- Detection works in both black-box settings using only image features and white-box settings using internal encoder states.
- The approach requires only a small number of malicious examples rather than large labeled datasets for effective performance.
Where Pith is reading between the lines
- The same Fourier-plus-embedding signals could be monitored at inference time to flag suspicious inputs before they reach the language decoder.
- If the few-shot requirement holds across prompt styles, the method might reduce the data burden for securing other multimodal systems beyond the tested VLMs.
Load-bearing premise
Fourier features suffice for reliable black-box detection and a few-shot malicious set yields generalizable white-box detection via image vector embeddings without any training or extensive validation data.
What would settle it
A test set of malicious image prompts engineered to produce Fourier spectra and visual encoder embeddings indistinguishable from benign images, where detection accuracy falls to random levels.
Figures
read the original abstract
Vision language models (VLMs) employ both visual and textual modalities to enable advanced vision-language inference. However, incorporating visual modalities expands the attack surface of VLMs, making them more susceptible to security threats such as adversarial perturbations and indirect prompt injection, wherein crafted malicious image prompts can elicit unintended model outputs. Existing defense methods against malicious image prompts remain insufficient as they typically demand extensive datasets for retraining or the deployment of additional, complex classifiers. Most critically, there is a profound lack of specialized defense mechanisms specifically targeting indirect prompt injections, a gap that serves as a primary motivation for this work. To address these limitations, we introduce DE-FIVE, a novel training-free framework for detecting malicious image prompts by leveraging Fourier features and the hidden state representations of the visual encoder (image vector embeddings) across perturbations. Specifically, we develop a hybrid detection strategy consisting of a black-box detector that operates on Fourier-domain features and a white-box detector that exploits image vector embeddings derived from only a few-shot malicious set. Extensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art baselines against malicious image prompts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DE-FIVE, a training-free hybrid framework for detecting malicious image prompts (including indirect prompt injections) in vision-language models. It combines a black-box detector operating on Fourier-domain features with a white-box detector that uses image vector embeddings derived from a few-shot malicious set, claiming consistent outperformance over state-of-the-art baselines.
Significance. If the empirical claims hold, the work would be significant for VLM security: it targets the under-served problem of indirect prompt injections with a training-free approach, avoiding the need for large retraining datasets or auxiliary classifiers that characterize prior defenses.
major comments (2)
- [Abstract] Abstract: the central claim that the framework 'consistently outperforms state-of-the-art baselines against malicious image prompts' is presented without any quantitative results, metrics, datasets, or experimental protocol, rendering the primary empirical assertion impossible to evaluate.
- [Method (black-box component)] Black-box detector description: the Fourier-feature component is positioned as a reliable, perturbation-agnostic detector for indirect prompt injections, yet the manuscript provides no analysis, ablation, or results addressing semantic attacks that embed instructions without high-frequency spectral signatures; this assumption is load-bearing for the hybrid framework's claimed advantage over existing methods.
minor comments (1)
- [Abstract] Notation for 'image vector embeddings' and 'hidden state representations of the visual encoder' should be defined more precisely on first use to avoid ambiguity between different VLM architectures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revising the paper to strengthen the presentation and analysis.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the framework 'consistently outperforms state-of-the-art baselines against malicious image prompts' is presented without any quantitative results, metrics, datasets, or experimental protocol, rendering the primary empirical assertion impossible to evaluate.
Authors: We agree that the abstract should provide concrete empirical support for the central claim. The current version emphasizes the methodological novelty but omits specific metrics and protocols. In the revised manuscript we will incorporate key quantitative results (e.g., detection accuracy, F1 scores, and dataset names) directly into the abstract while preserving its length constraints. revision: yes
-
Referee: [Method (black-box component)] Black-box detector description: the Fourier-feature component is positioned as a reliable, perturbation-agnostic detector for indirect prompt injections, yet the manuscript provides no analysis, ablation, or results addressing semantic attacks that embed instructions without high-frequency spectral signatures; this assumption is load-bearing for the hybrid framework's claimed advantage over existing methods.
Authors: We acknowledge that the manuscript does not include dedicated analysis or ablations for semantic attacks lacking high-frequency signatures. The Fourier component targets spectral perturbations, while the hybrid design relies on the white-box embedding detector for broader coverage. In the revision we will add a targeted discussion of this limitation together with new ablation results that evaluate performance on semantically crafted prompts without obvious spectral artifacts. revision: yes
Circularity Check
No derivation chain or equations; empirical framework only
full rationale
The paper introduces an empirical detection framework (DE-FIVE) using Fourier features and image embeddings for malicious prompt detection. No mathematical derivations, predictions, or first-principles results are claimed or present. The abstract and description focus on a hybrid black-box/white-box strategy evaluated experimentally, with no self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce claims to inputs by construction. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Llama: Open and efficient foundation language models,
H. Touvron, T. Lavril, G. Izacard, et al., “Llama: Open and efficient foundation language models,” InArXiv e-prints, 2023
2023
-
[2]
Llama 2: Open foundation and fine-tuned chat models,
H. Touvron, L. Martin, K. Stone, et al., “Llama 2: Open foundation and fine-tuned chat models,” InArXiv e-prints, 2023
2023
-
[3]
The llama 3 herd of models,
A. Grattafiori, A. Dubey, A. Jauhri, et al., “The llama 3 herd of models,” InArXiv e-prints, 2024
2024
-
[4]
Recommender systems in the era of large language models (llms),
Z. Zhao, W. Fan, J. Li, et al., “Recommender systems in the era of large language models (llms),”IEEE Transactions on Knowledge and Data Engineering, 2024
2024
-
[5]
LLaV A- NeXT: Improved reasoning, OCR, and world knowledge,
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “LLaV A- NeXT: Improved reasoning, OCR, and world knowledge,” 2024, https: //llava-vl.github.io/blog/2024-01-30-llava-next/
2024
-
[6]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Improved baselines with visual instruction tuning,” InProceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[7]
Visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Visual instruction tuning,” InProceed- ings of Advances in Neural Information Processing Systems(NeurIPS), vol. 36, 2023
2023
-
[8]
GPT-4o System Card,
OpenAI, “GPT-4o System Card,” Accessed August 8, 2024, https://cdn. openai.com/gpt-4o-system-card.pdf
2024
-
[9]
GPT-4V(ision) System Card,
OpenAI, “GPT-4V(ision) System Card,” Accessed September 25, 2023, https://cdn.openai.com/papers/GPTV System Card.pdf
2023
-
[10]
Phi-3 technical report: A highly capable language model locally on your phone,
A. Marah, J. Sam Ade, A. Ammar Ahmad, et al., “Phi-3 technical report: A highly capable language model locally on your phone,” InArXiv e- prints, 2024
2024
-
[11]
Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language mod- els,
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language mod- els,” InProceedings of International conference on machine learning (ICML), pages 19730–19742, 2023
2023
-
[12]
Visual question answering instruction: Unlocking multimodal large language model to domain- specific visual multitasks,
J. Lee, S. Cha, Y . Lee, and C. Yang, “Visual question answering instruction: Unlocking multimodal large language model to domain- specific visual multitasks,” InArXiv e-prints, 2024
2024
-
[13]
Towards generalist biomedical AI,
T. Tu, S. Azizi, D. Driess, et al., “Towards generalist biomedical AI,” InNejm Ai, vol. 1, 2024
2024
-
[14]
Are aligned neural networks adversarially aligned?,
N. Carlini, M. Nasr, C. A. Choquette-Choo, et al., “Are aligned neural networks adversarially aligned?,” InProceedings of Advances in Neural Information Processing Systems(NeurIPS), vol. 36, 2023
2023
-
[15]
Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,
E. Shayegani, Y . Dong, and N. Abu-Ghazaleh, “Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models,” In Proceedings of 12th International Conference on Learning Representa- tions(ICLR), 2024
2024
-
[16]
Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space,
L. Schwinn, D. Dobre, S. Xhonneux, et al., “Soft prompt threats: Attacking safety alignment and unlearning in open-source llms through the embedding space,” InProceedings of Advances in Neural Information Processing Systems(NeurIPS), vol. 37, 2024
2024
-
[17]
Visual adversarial examples jailbreak aligned large language models,
X. Qi, K. Huang, A. Panda, P. Henderson, et al., “Visual adversarial examples jailbreak aligned large language models,” InProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024
2024
-
[18]
Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,
K. Greshake, S. Abdelnabi, S. Mishra, et al., “Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,” InProceedings of the 16th ACM workshop on artificial intelligence and security, pp. 79-90, 2023
2023
-
[19]
Abusing images and sounds for indirect instruction injection in multi-modal LLMs,
E. Bagdasaryan, T. Hsieh, B. Nassi, and V . Shmatikov, “Abusing images and sounds for indirect instruction injection in multi-modal LLMs,” In ArXiv e-prints, 2023
2023
-
[20]
Image hijacks: Adver- sarial images can control generative models at runtime,
L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adver- sarial images can control generative models at runtime,” InProceedings of International conference on machine learning (ICML), pp. 2443–2455, 2024
2024
-
[21]
Self-interpreting adversarial images,
T. Zhang, C. Zhang, J. Morris. et al., “Self-interpreting adversarial images,” InProceedings of 34th USENIX Security Symposium (USENIX Security 25), pp. 1037-1052, 2025
2025
-
[22]
An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models,
H. Luo, J. Gu, F. Liu, and P. Torr, “An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models,” InArXiv e-prints, 2024
2024
-
[23]
On the robustness of large multimodal models against image adversarial attacks,
X. Cui, A. Aparcedo, Y . Jang, and S. Lim, “On the robustness of large multimodal models against image adversarial attacks,” InArXiv e-prints, 2023
2023
-
[24]
Detecting language model attacks with perplexity,
G. Alon and M. Kamfonas, “Detecting language model attacks with perplexity,” InArXiv e-prints, 2023
2023
-
[25]
Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis,
Y . Xie, M. Fang, R. Pi, and N. Gong, “Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis,” InProceedings of 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
2024
-
[26]
Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes,
X. Hu, P. Chen, and T. Ho, “Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes,” InProceed- ings of Advances in Neural Information Processing Systems(NeurIPS), vol. 37, 2024
2024
-
[27]
Eyes closed, safety on: Pro- tecting multimodal llms via image-to-text transformation,
Y . Gou, K. Chen, Z. Liu, L. Hong, et al., “Eyes closed, safety on: Pro- tecting multimodal llms via image-to-text transformation,” InProceedings of European Conference on Computer Vision (ECCV), 2024
2024
-
[28]
Defending jailbreak attack in vlms via cross-modality information detector,
Y . Xu, X. Qi, Z. Qin, and W. Wang., “Defending jailbreak attack in vlms via cross-modality information detector,” InArXiv e-prints, 2024
2024
-
[29]
Mirrorcheck: Efficient adversarial defense for vision-language models,
S. Fares, K. Ziu, T. Aremu, et al., “Mirrorcheck: Efficient adversarial defense for vision-language models,” InArXiv e-prints, 2024
2024
-
[30]
Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states,
Y . Jiang, X. Gao, T. Peng, et al., “Hiddendetect: Detecting jailbreak attacks against large vision-language models via monitoring hidden states,” InArXiv e-prints, 2025
2025
-
[31]
JailGuard: A universal detection framework for prompt-based attacks on LLM systems,
X. Zhang, C. Zhang, T. Li, Y . Huang, et al., “JailGuard: A universal detection framework for prompt-based attacks on LLM systems,” InACM Trans. Softw. Eng. Methodol., 2025
2025
-
[32]
Mllm-protector: Ensuring mllm’s safety without hurting performance,
R. Pi, T. Han, Y . Xie, R. Pan, et al., “Mllm-protector: Ensuring mllm’s safety without hurting performance,” InProceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2024
2024
-
[33]
Vlmguard: Defending vlms against malicious prompts via unlabeled data,
X. Du, G. Reshmi, S. Robert, et al., “Vlmguard: Defending vlms against malicious prompts via unlabeled data,” InArXiv e-prints, 2024
2024
-
[34]
Mistral-7B-v0.1,
D. Chaplot. A. Q. jiang, A. Sablayrolles, et al., “Mistral-7B-v0.1,” In ArXiv e-prints, 2023
2023
-
[35]
Judging LLM-as-a-Judge with MT-bench and chatbot arena,
L. Zheng, W. Chiang, Y . Sheng, et al., “Judging LLM-as-a-Judge with MT-bench and chatbot arena,” InProceedings of Advances in Neural Information Processing Systems(NeurIPS), vol. 36, 2023
2023
-
[36]
Discriminative blur detection features,
J. Shi, L. Xu, and J. Jia, “Discriminative blur detection features,” In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2965-2972, 2014
2014
-
[37]
Detecting AutoAttack perturba- tions in the frequency domain,
P. Lorenz, P. Harder, D. Straßel, et al., “Detecting AutoAttack perturba- tions in the frequency domain,” InArXiv e-prints, 2021
2021
-
[38]
Frequency-domain blind quality assessment of blurred and blocking-artefact images using Gaussian Pro- cess Regression model,
M. Viqar, M. Athar, K. Ekram, et al., “Frequency-domain blind quality assessment of blurred and blocking-artefact images using Gaussian Pro- cess Regression model,” InSignal Processing: Image Communication, vol. 103, 2022
2022
-
[39]
Reducing hallucinations in large vision-language models via latent space steering,
S. Liu, H. Ye, J. Zou, “Reducing hallucinations in large vision-language models via latent space steering,” InProceedings of 12th International Conference on Learning Representations(ICLR), 2025
2025
-
[40]
Xstest: A test suite for identifying exaggerated safety behaviours in large language models,
P. R ¨ottger, H. Kirk, B. Vidgen, et al., “Xstest: A test suite for identifying exaggerated safety behaviours in large language models,” InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics(NAACL), pp. 5377–5400, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.