pith. sign in

arxiv: 2604.06502 · v1 · submitted 2026-04-07 · 💻 cs.LG

VLMShield: Efficient and Robust Defense of Vision-Language Models against Malicious Prompts

Pith reviewed 2026-05-10 18:33 UTC · model grok-4.3

classification 💻 cs.LG
keywords vision-language modelsmalicious prompt detectionmultimodal feature extractionsafety defenseplug-and-play detectorrobustnessefficiency
0
0 comments X

The pith

VLMShield detects malicious prompts in vision-language models by identifying distinct patterns in unified multimodal features

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that vision-language models can be defended against malicious prompt attacks with a lightweight add-on detector. It starts by creating a way to pull combined features from images and long texts into one representation. Analysis of these features reveals that safe and harmful prompts form separate groups. The resulting detector slots in without changing the original model and runs quickly while catching attacks reliably. This would matter if true because it removes the usual tradeoff between safety and speed in multimodal systems.

Core claim

The authors establish that the Multimodal Aggregated Feature Extraction framework produces unified representations in which benign and malicious prompts exhibit distinct distributional patterns. This separation directly supports VLMShield as a plug-and-play safety detector that identifies multimodal malicious attacks with superior robustness, efficiency, and preserved utility across tested conditions.

What carries the argument

The Multimodal Aggregated Feature Extraction (MAFE) framework, which enables handling of long text and fuses visual and textual data into unified representations that expose attack patterns

If this is right

  • Existing vision-language models gain protection from malicious prompts with only minimal added computation.
  • The original model requires no retraining or modification to gain the defense.
  • Performance remains high across multiple attack types and model variants in the reported tests.
  • Normal task accuracy and response quality of the vision-language model stay unchanged.
  • The approach offers a direct route to safer multimodal AI deployment in practical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern-based separation might apply to detecting other forms of adversarial input in multimodal systems.
  • If the patterns prove stable, the detector could be tested for real-time use in live applications.
  • Layering this method with other safety checks could build stronger protections at low extra cost.

Load-bearing premise

The distinct distributional patterns between benign and malicious prompts in the extracted features must hold consistently across different models, attack variants, and real-world inputs.

What would settle it

Applying the detector to a new vision-language model and a novel malicious prompt set and finding that the feature distributions of benign and malicious cases overlap rather than separate would disprove the central claim.

Figures

Figures reproduced from arXiv: 2604.06502 by Cheng Hong, Jialin Wu, Kunsheng Tang, Nenghai Yu, Peigui Qi, Weiming Zhang, Wenbo Zhou, Yanpu Yu, Yide Song, Zhicong Huang.

Figure 1
Figure 1. Figure 1: Prompt Examples of benign, direct malicious, and jailbreak attacks against VLMs. denoising operations to identify adversarial perturbations (Xu et al., 2024), while MirrorCheck identifies the attacks by comparing embeddings between original and denoised im￾ages to detect inconsistencies caused by adversarial modifi￾cations (Fares et al., 2024). SelfReminder prevents jailbreak attacks by wrapping user queri… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CLIP-based MAFE framework for processing multimodal prompts through progressive text aggregation and cross-modal feature fusion. Cross-Modal Feature Fusion. We extract the image [CLS] embedding using CLIP’s image encoder, where ⊕ denotes concatenation: Eimage = CLIPimage(I)[CLS] ∈ R 768 . (5) We then combine the aggregated text embedding and image embedding through concatenation: Ejoint = E… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of MAFE-extracted features showing clear separation between benign prompts (green) and malicious attacks (red, blue, and orange) in t-SNE visualizations. The PCA visual￾ization result in Appendix A.1 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Defense workflow using VLMShield: multimodal inputs first undergo MAFE feature extraction, then VLMShield performs safety detection to either block malicious prompts or forward benign ones to VLMs, and (b) detailed architecture and training pipeline of VLMShield. state-of-the-art defenses from both categories in Sec. 2.2. Internal Defenses. We evaluate against ASTRA (Wang et al., 2025) and VLMGuard (Du… view at source ↗
Figure 5
Figure 5. Figure 5: Comprehensive distributional analysis of MAFE-extracted features showing clear separation between benign prompts (green) and malicious attacks (red, blue, orange) across PCA with density estimation (left) and t-SNE visualization (right). These visualizations demonstrate that our MAFE successfully transforms multimodal prompts into a unified feature space where safety-relevant patterns naturally emerge. The… view at source ↗
Figure 6
Figure 6. Figure 6: Distributional analysis without long text processing showing poor separation between prompt categories due to information loss from text truncation [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distributional analysis with text-only processing showing incomplete separation patterns due to missing visual information, particularly affecting detection of image-based attacks. aggregation mechanism while ignoring visual content. While some clustering patterns emerge due to textual semantic differences, the separation remains insufficient for reliable safety detection, particularly for image-based atta… view at source ↗
Figure 8
Figure 8. Figure 8: Distributional analysis with image-only processing showing inadequate separation due to missing textual information, particularly affecting detection of text-based jailbreak attacks. Traditional Feature Extraction VLM Internal Representations MAFE [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cross-category distributional analysis comparing three feature extraction approaches. Only MAFE (right) achieves clear separation between benign (green) and malicious categories (red, blue, orange) while maintaining distinct attack-type clustering. Traditional features (left) show complete category intermixing, while VLM representations (middle) achieve only partial separation. A.3.1. CROSS-CATEGORY DISCRI… view at source ↗
Figure 10
Figure 10. Figure 10: Within-category analysis for image-based jailbreak datasets. MAFE (right) demonstrates semantic convergence where three datasets employing different visual attack techniques cluster together. Traditional features (left) scatter randomly while VLM representations (middle) show fragmented grouping, highlighting MAFE’s superior ability to capture attack semantics beyond dataset artifacts. A.3.2. WITHIN-CATEG… view at source ↗
Figure 11
Figure 11. Figure 11: Within-category analysis for text-based jailbreak datasets. MAFE (right) achieves unified clustering despite different manipulation strategies, demonstrating semantic understanding. Traditional and VLM-based features fail to recognize shared malicious intent across datasets. Traditional Feature Extraction VLM Internal Representations MAFE [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Within-category analysis for direct malicious datasets. MAFE (right) demonstrates convergence across evaluation benchmarks with clear benign-malicious separation, while alternative approaches show inconsistent patterns. semantically-grounded features essential for generalizable VLM safety detection. B. More Results on VLMShield This section provides comprehensive supplementary experimental results for VLM… view at source ↗
Figure 13
Figure 13. Figure 13: Representative false positive cases. E. System Prompt for GPT-5-mini Content Moderation System Prompt for GPT-5-mini Moderation You are a content moderation system designed to evaluate text content for safety violations. Your task is to determine whether the given content contains unsafe material according to the following 15 categories: 1. Hate Speech, Offensive Language 2. Discrimination, Stereotypes, a… view at source ↗
Figure 14
Figure 14. Figure 14: Attention evolution of the EOS token across CLIP text encoder layers. The EOS token progressively concentrates attention from uniform distribution (early layers) to semantically salient tokens like “bigger,” “garage,” and “living room” (deep layers), while the SOS token maintains self-attention throughout. This demonstrates EOS token’s semantic aggregation property [PITH_FULL_IMAGE:figures/full_fig_p025_… view at source ↗
Figure 15
Figure 15. Figure 15: Attention evolution of the CLS token across CLIP vision encoder layers. The CLS token shifts from uniform spatial attention (early layers) to focused attention on discriminative regions corresponding to the garage and living room (deep layers). Image patches are grouped into 4×4 spatial regions for visualization clarity, with one representative patch per region shown. to function words. In final layers (L… view at source ↗
read the original abstract

Vision-Language Models (VLMs) face significant safety vulnerabilities from malicious prompt attacks due to weakened alignment during visual integration. Existing defenses suffer from efficiency and robustness. To address these challenges, we first propose the Multimodal Aggregated Feature Extraction (MAFE) framework that enables CLIP to handle long text and fuse multimodal information into unified representations. Through empirical analysis of MAFE-extracted features, we discover distinct distributional patterns between benign and malicious prompts. Building upon this finding, we develop VLMShield, a lightweight safety detector that efficiently identifies multimodal malicious attacks as a plug-and-play solution. Extensive experiments demonstrate superior performance across multiple dimensions, including robustness, efficiency, and utility. Through our work, we hope to pave the way for more secure multimodal AI deployment. Code is available at [this https URL](https://github.com/pgqihere/VLMShield).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the Multimodal Aggregated Feature Extraction (MAFE) framework, which modifies CLIP to process long texts and produce unified multimodal representations. Empirical analysis of these features reveals distinct distributional patterns separating benign from malicious prompts. Building on this observation, the authors introduce VLMShield, a lightweight plug-and-play safety detector for identifying multimodal malicious attacks on vision-language models. The abstract claims that extensive experiments demonstrate superior performance in robustness, efficiency, and utility compared to existing defenses.

Significance. If the MAFE-derived distributional separation proves stable, VLMShield would provide a computationally lightweight, model-agnostic defense layer that could be deployed without retraining the underlying VLM. The public code release supports reproducibility and would allow the community to test the claimed efficiency gains directly.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (MAFE and feature analysis): The central claim that MAFE produces stable, separable benign/malicious feature distributions is load-bearing for all downstream robustness and utility assertions, yet the manuscript provides no cross-model transfer experiments, no evaluation on unseen VLMs or attack generators, and no quantitative measures (e.g., Wasserstein distance or decision-margin statistics) to show the separation is not an artifact of the chosen CLIP variants and datasets.
  2. [Abstract and experimental section] Abstract and experimental section: The assertion of 'superior performance across multiple dimensions' is unsupported by any visible quantitative results, baseline comparisons, dataset descriptions, or statistical significance tests in the provided text, preventing verification that the data actually underwrite the robustness and efficiency claims.
minor comments (1)
  1. [Abstract] Abstract: The GitHub link appears as the placeholder '[this https URL]'; replace with the actual repository URL.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, committing to revisions where they strengthen the work without misrepresenting our current results.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (MAFE and feature analysis): The central claim that MAFE produces stable, separable benign/malicious feature distributions is load-bearing for all downstream robustness and utility assertions, yet the manuscript provides no cross-model transfer experiments, no evaluation on unseen VLMs or attack generators, and no quantitative measures (e.g., Wasserstein distance or decision-margin statistics) to show the separation is not an artifact of the chosen CLIP variants and datasets.

    Authors: We agree that quantitative measures would make the separation evidence more rigorous. In the revised manuscript we will add Wasserstein distances and decision-margin statistics to §3 to quantify the observed distributional differences. Our existing experiments already cover multiple CLIP variants and show consistent patterns, but we acknowledge the lack of tests on entirely unseen VLMs or attack generators. We will add an explicit limitations paragraph noting this scope constraint and outlining it as future work rather than claiming broader transfer. revision: partial

  2. Referee: [Abstract and experimental section] Abstract and experimental section: The assertion of 'superior performance across multiple dimensions' is unsupported by any visible quantitative results, baseline comparisons, dataset descriptions, or statistical significance tests in the provided text, preventing verification that the data actually underwrite the robustness and efficiency claims.

    Authors: The full manuscript contains quantitative results, baseline comparisons, and dataset descriptions in §4. To improve clarity and verifiability we will revise the experimental section to foreground these elements, add statistical significance tests (e.g., paired t-tests with p-values), and ensure all tables and metrics are explicitly referenced from the abstract onward. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observation and experimental validation

full rationale

The paper proposes the MAFE framework to aggregate multimodal features from CLIP-based VLMs, empirically observes distributional differences between benign and malicious prompts, and constructs VLMShield as a lightweight detector on that basis. No equations, parameter fits, or self-citations are presented that reduce the central claims (robustness, efficiency, plug-and-play utility) to tautological redefinitions or inputs by construction. The work is self-contained via reported experiments rather than any load-bearing loop.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the detector likely depends on an implicit classification threshold or decision boundary fitted to the observed feature distributions.

free parameters (1)
  • detection threshold or boundary
    A cutoff or classifier parameter separating benign and malicious feature clusters must be chosen or fitted, though its exact form is not stated.

pith-pipeline@v0.9.0 · 5479 in / 1194 out tokens · 75959 ms · 2026-05-10T18:33:39.073569+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    Gong, Y ., Ran, D., Liu, J., Wang, C., Cong, T., and et al

    URL https://openreview.net/forum? id=S1RKWSyZ2Y. Gong, Y ., Ran, D., Liu, J., Wang, C., Cong, T., and et al. Fig- step: Jailbreaking large vision-language models via typo- graphic visual prompts. InProceedings of the AAAI Con- ference on Artificial Intelligence, volume 39, pp. 23951– 23959, 2025. Gou, Y ., Chen, K., Liu, Z., Hong, L., Xu, H., Li, Z., Yeun...

  2. [2]

    URL https: //doi.org/10.1145/3719027.3744835

    doi: 10.1145/3719027.3744835. URL https: //doi.org/10.1145/3719027.3744835. Qwen. Qwen2.5-vl, January 2025. URL https:// qwenlm.github.io/blog/qwen2.5-vl/. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning trans- ferable visual models from natural...

  3. [3]

    Tang, K., Zhou, W., Zhang, J., Liu, A., Deng, G., Li, S., Qi, P., Zhang, W., Zhang, T., and Yu, N

    URL https://openreview.net/forum? id=plmBsXHxgR. Tang, K., Zhou, W., Zhang, J., Liu, A., Deng, G., Li, S., Qi, P., Zhang, W., Zhang, T., and Yu, N. Gendercare: A comprehensive framework for assessing and reducing gender bias in large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt...

  4. [4]

    URL https: //doi.org/10.1145/3658644.3670284

    doi: 10.1145/3658644.3670284. URL https: //doi.org/10.1145/3658644.3670284. Wang, H., Wang, G., and Zhang, H. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, pp. 29947–29957. Computer Visio...

  5. [5]

    Nature Machine Intelligence , author =

    doi: 10.1038/S42256-023-00765-8. URL https: //doi.org/10.1038/s42256-023-00765-8. Xu, Y ., Qi, X., Qin, Z., and Wang, W. Defending jailbreak attack in vlms via cross-modality information detector. CoRR, abs/2407.21659, 2024. doi: 10.48550/ARXIV . 2407.21659. URLhttps://doi.org/10.48550/ arXiv.2407.21659. Yi, J., Xie, Y ., Zhu, B., Kiciman, E., Sun, G., Xi...

  6. [6]

    Dickerson

    URL https://openreview.net/forum? id=KOTutrSR2y. Zhang, T., Zhang, C., Morris, J. X., Bagdasaryan, E., and Shmatikov, V . Soft prompts go hard: Steering visual language models with hidden meta-instructions. CoRR, abs/2407.08970, 2024. doi: 10.48550/ARXIV . 2407.08970. URLhttps://doi.org/10.48550/ arXiv.2407.08970. Zhang, X., Zhang, C., Li, T., Huang, Y .,...

  7. [7]

    Hate Speech, Offensive Language

  8. [8]

    Discrimination, Stereotypes, and Injustice

  9. [9]

    Violence, Aiding and Abetting, Incitement

  10. [10]

    Financial Crime, Property Crime, Theft

  11. [11]

    Drug Abuse, Weapons, Banned Substances

  12. [12]

    Non-Violent Unethical Behavior

  13. [13]

    Sexually Explicit, Adult Content

  14. [14]

    Controversial Topics, Politics

  15. [15]

    Misinformation regarding ethics, laws, and safety

  16. [16]

    Terrorism, Organized Crime

  17. [17]

    Yes” if the content violates any safety rules, or “No

    Other Unsafe Content Instructions: - Carefully analyze the provided content for any violations of the above safety categories - Consider both explicit and implicit harmful content - Prioritize safety and err on the side of caution - Focus on potential harm, discrimination, illegal activities, and inappropriate content - Consider context but maintain stric...