pith. sign in

arxiv: 2605.25194 · v1 · pith:3QSICUOKnew · submitted 2026-05-24 · 💻 cs.LG

Localization then Neutralization: Gradient-guided Token Suppression against Visual Prompt Injection Attack

Pith reviewed 2026-06-30 12:16 UTC · model grok-4.3

classification 💻 cs.LG
keywords visual prompt injectionmultimodal large language modelsadversarial defensegradient-based attributiontoken maskingjailbreak attackshidden state gradients
0
0 comments X

The pith

Masking a few gradient-identified image tokens neutralizes visual prompt injection attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that visual prompt injections succeed by exploiting only a small subset of critical image tokens rather than the full image. It introduces Gradient Token Masking to locate these tokens by computing the norm of gradients on hidden states and then zeroing them out. This localization is proven to rank tokens consistently with the full adversarial loss gradient. The method runs in a single forward-backward pass and is tested on both prompt injection and jailbreak attacks.

Core claim

Successful adversarial attacks on multimodal models depend on a small subset of critical image tokens. Gradient Token Masking attributes influence to tokens via the Hidden-State Gradient Norm score under adversarial inputs, proves this score's ranking matches the full adversarial loss gradient, and neutralizes the attack by masking the top-scoring tokens.

What carries the argument

Gradient Token Masking (GTM) using the Hidden-State Gradient Norm score, which ranks tokens by their influence on generation and enables masking after one forward-backward pass.

If this is right

  • Defense requires only one forward-backward pass with no model retraining.
  • Attack success rates drop to near zero on both prompt injection and multimodal jailbreak cases.
  • Clean input performance is preserved with negligible overhead.
  • The same localization principle applies across different attack types tested in the work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attack paths in vision-language models appear sparse enough that gradient ranking can isolate them without full loss computation.
  • The masking threshold or token count could be set dynamically from the gradient score distribution rather than fixed in advance.
  • Similar single-pass gradient attribution might extend to localizing vulnerabilities in other input modalities.

Load-bearing premise

That attacks always concentrate on a small number of tokens whose masking stops the attack while leaving clean inputs unaffected.

What would settle it

An experiment in which the top-ranked tokens by hidden-state gradient norm are masked yet the attack success rate remains high on the same adversarial inputs.

Figures

Figures reproduced from arXiv: 2605.25194 by Dongpeng Zhang, Gaozheng Pei, Ke Ma, Longtao Huang, Qianqian Xu, Qingming Huang, Yangbangyan Jiang.

Figure 1
Figure 1. Figure 1: Comparison of defensive scopes. Existing methods pri￾marily address explicit jailbreak attacks, leaving models vulnerable to a broader range of general prompt injection attacks. Our method provides comprehensive protection against both. and neutralizes these tokens at inference time. GTM lever￾ages gradients of hidden-state representations to accurately identify high-influence tokens under adversarial inpu… view at source ↗
Figure 2
Figure 2. Figure 2: Results of the sliding-window masking experiment. We used a universal prompt injection image generated by ImgHijack. The target model is Qwen2-VL. The perturbation is constrained to l∞ = 32. The attack targets a specific string. 4. Proposed Method 4.1. Sparse Adversarial Token Dependence: A Masking-Based Analysis Existing adversarial attacks against deep learning models, particularly prompt injection attac… view at source ↗
Figure 3
Figure 3. Figure 3: Category scores on MMVet. Higher scores indicate stronger corresponding capabilities. Overall, our extensive evaluations highlight the generality of the proposed method. Across diverse attack settings, in￾cluding specific string-matching attacks, context leakage, and jailbreak attacks, our approach consistently disrupts adversarial prompt injection mechanisms. This broad effec￾tiveness suggests that the de… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on the masked token ratio. The left axis (ASR) indicates defense performance, and the right axis (MMVet) demonstrates utility preservation on benign tasks. Ablation on the Masking Ratio [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Results of the sliding-window masking experiment. We used a universal prompt injection image generated by ImgHijack. The target model is Qwen2-VL. The perturbation is constrained to P X = 100. The attack targets a specific string. 0 10 20 30 40 50 60 Embedding Index (Centered) 0.0 0.2 0.4 0.6 0.8 1.0 ASR Value Interval 1 Interval 2 Interval 4 Interval 8 Interval 16 Interval 32 [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 6
Figure 6. Figure 6: Results of the sliding-window masking experiment. We used a universal prompt injection image generated for Jailbreak. The target model is Qwen2-VL. The perturbation is constrained to l∞ = 64. B. Proof We explicitly state the assumptions used in the proof. Assumption B1 (High-Confidence First Token). For the clean input pair (x, yclean), the target LVLM π assigns near-unit probability to a token τ at the fi… view at source ↗
Figure 7
Figure 7. Figure 7: Failure case of DiffPure on MM-Vet v1-38. The upper panel displays the original input versus the purified image. A comparison reveals that DiffPure severely corrupts the text content. The lower panel presents the model outputs for the purified input, the undefended baseline, and our method. DiffPure fails to answer OCR-related questions. In contrast, both our method and the undefended model provide correct… view at source ↗
Figure 8
Figure 8. Figure 8: Failure case of DiffPure on MM-Vet v1-11. The upper panel displays the original input versus the purified image. The lower panel presents the outputs from the purified input, the undefended baseline, and our method. The severe text distortion by DiffPure prevents the LVLM from performing math and spatial reasoning tasks. This failure leads to incorrect responses. In contrast, our method correctly answers t… view at source ↗
read the original abstract

Adversarial images pose a severe security threat to multimodal large language models through prompt injection. Existing defenses largely lack a principled understanding of the underlying mechanisms and struggle to balance efficiency and defense utility. In this work, we show that successful adversarial attacks do not rely on the entire image uniformly but instead depend on a small subset of critical image tokens. Based on this insight, we propose Gradient Token Masking (GTM), which localizes these tokens via gradient analysis and neutralizes them through masking. We find that attribution based on the first generated token's output probability fails when attacks preserve the predicted token. To overcome this, GTM utilizes the Hidden-State Gradient Norm score for generation-influence attribution under adversarial inputs. We prove that its ranking is consistent with that of the full adversarial loss gradient, providing a theoretical guarantee for accurate localization. Our method requires only a single forward-backward pass to identify and zero out a small number of high-scoring tokens, effectively disrupting the adversarial attack path. Extensive experiments on prompt injection and multimodal jailbreak attacks demonstrate that our approach reduces attack success rates (ASR) to near zero while preserving model utility with negligible computational overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Gradient Token Masking (GTM) to defend multimodal LLMs against visual prompt injection and jailbreak attacks. It claims that successful attacks depend on only a small subset of critical image tokens; these are localized via the Hidden-State Gradient Norm score, whose ranking is proved consistent with the full adversarial loss gradient. Masking a small number of high-scoring tokens in a single forward-backward pass is asserted to reduce attack success rate to near zero while preserving clean utility and incurring negligible overhead.

Significance. If the ranking-consistency proof and the empirical neutralization results hold under the stated assumptions, the work would supply an efficient, single-pass defense with a theoretical localization guarantee. This addresses the efficiency-utility trade-off noted for prior defenses and supplies a concrete, falsifiable mechanism (gradient-norm ranking plus masking) that could be tested on additional attack types.

major comments (2)
  1. [Abstract] Abstract and Method: the proof that the Hidden-State Gradient Norm ranking is consistent with the full adversarial loss gradient is asserted but no derivation, assumptions, or intermediate steps are supplied in the available text. This is load-bearing for the central claim of a 'theoretical guarantee for accurate localization.'
  2. [Abstract] Abstract and Method: the number of tokens k to mask (explicitly listed as a free parameter) has no selection rule, fixed value, adaptive criterion, or ablation study described. Because neutralization success and clean-accuracy preservation both depend on this choice, the claimed 'near-zero ASR with negligible overhead' cannot be shown to follow from the ranking consistency alone.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'first generated token's output probability fails when attacks preserve the predicted token' is stated without a concrete counter-example or reference to the failure mode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments identify load-bearing elements of the central claims. We will revise the manuscript to supply the missing derivation and to address the choice of k with additional analysis and experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Method: the proof that the Hidden-State Gradient Norm ranking is consistent with the full adversarial loss gradient is asserted but no derivation, assumptions, or intermediate steps are supplied in the available text. This is load-bearing for the central claim of a 'theoretical guarantee for accurate localization.'

    Authors: We agree that the proof is central and that its absence from the submitted text weakens the claim. The full manuscript contains a brief statement of the result but omits the derivation. In the revision we will insert a dedicated subsection (or appendix) that states the assumptions, provides the intermediate steps, and shows why the ranking of the Hidden-State Gradient Norm is consistent with the ranking induced by the full adversarial loss gradient. revision: yes

  2. Referee: [Abstract] Abstract and Method: the number of tokens k to mask (explicitly listed as a free parameter) has no selection rule, fixed value, adaptive criterion, or ablation study described. Because neutralization success and clean-accuracy preservation both depend on this choice, the claimed 'near-zero ASR with negligible overhead' cannot be shown to follow from the ranking consistency alone.

    Authors: We acknowledge that k is treated as a hyper-parameter without an explicit selection procedure or supporting ablation in the current version. In the revision we will add (i) an ablation study over a range of k values on both attack success rate and clean accuracy, (ii) a simple heuristic rule (e.g., mask the top 1–2 % of tokens or until the gradient-norm score drops below a threshold), and (iii) discussion of how this choice trades off defense strength against utility. These additions will make the empirical claims traceable to concrete choices of k. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on independent gradient ranking proof

full rationale

The paper's core derivation is a claimed mathematical proof that the Hidden-State Gradient Norm produces a ranking consistent with the full adversarial loss gradient; this is presented as a theoretical guarantee independent of any fitted parameters or self-referential definitions. The neutralization step (zeroing a small number of tokens) is described as following from this ranking without evidence in the abstract that the token count k or threshold is chosen via fitting that would make the ASR reduction tautological. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems from the same authors are invoked in the provided text. The method is self-contained against external benchmarks via the single forward-backward pass and empirical ASR/utility results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only. The central claim rests on the unverified assertion that a small subset of tokens drives attacks and that the gradient-norm ranking matches the full loss gradient.

free parameters (1)
  • number of tokens to mask (k)
    The abstract states a small number of high-scoring tokens are zeroed; the exact count or selection threshold is a tunable choice not specified in the abstract.
axioms (1)
  • domain assumption Hidden-state gradient norm ranking is consistent with full adversarial loss gradient ranking
    Abstract claims this is proven, but the proof is not provided.

pith-pipeline@v0.9.1-grok · 5750 in / 1301 out tokens · 31719 ms · 2026-06-30T12:16:55.566389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    URLhttps://arxiv.org/abs/2404.14219. Bagdasaryan, E., Hsieh, T.-Y ., Nassi, B., and Shmatikov, V . Abusing images and sounds for indirect instruc- tion injection in multi-modal llms.arXiv preprint arXiv:2307.10490,

  2. [2]

    Vpi-bench: Visual prompt injection attacks for computer-use agents.arXiv preprint arXiv:2506.02456,

    Cao, T., Lim, B., Liu, Y ., Sui, Y ., Li, Y ., Deng, S., Lu, L., Oo, N., Yan, S., and Hooi, B. Vpi-bench: Visual prompt injection attacks for computer-use agents.arXiv preprint arXiv:2506.02456,

  3. [3]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Chen, B., Lyu, X., Yuan, S., Song, J., Shen, H. T., and Gao, L. SafePTR: Token-level jailbreak defense in multimodal LLMs via prune-then-restore mechanism. InThe Thirty-ninth Annual Conference on Neural In- formation Processing Systems, 2025a. URL https: //openreview.net/forum?id=MNSiBGNAvx. Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krish- ...

  4. [4]

    A survey of attacks on large vision-language models: Resources, advances, and future trends, 2024a

    Liu, D., Yang, M., Qu, X., Zhou, P., Cheng, Y ., and Hu, W. A survey of attacks on large vision-language models: Resources, advances, and future trends, 2024a. URL https://arxiv.org/abs/2407.07403. Liu, H., Li, C., Wu, Q., and Lee, Y . J. Visual instruction tun- ing.Advances in neural information processing systems, 36:34892–34916,

  5. [5]

    Ma, F., Zhang, C., Ren, L., Wang, J., Wang, Q., Wu, W., Quan, X., and Song, D

    URL https://openreview.net/forum? id=nc5GgFAvtk. Ma, F., Zhang, C., Ren, L., Wang, J., Wang, Q., Wu, W., Quan, X., and Song, D. Xprompt: Exploring the extreme of prompt tuning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11033–11047, 2022a. Ma, K., Xu, Q., Zeng, J., Li, G., Cao, X., and Huang, Q. A tale of...

  6. [6]

    In: CVPR

    Pei, G., Lyu, S., Chen, G., Ma, K., Xu, Q., Sun, Y ., and Huang, Q. Divide and conquer: Heterogeneous noise integration for diffusion-based adversarial purification. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 29268–29277, 2025a. doi: 10.1109/CVPR52734.2025.02725. Pei, G., Ma, K., Sun, Y ., Xu, Q., and Huang, Q. Diffu...

  7. [7]

    net/forum?id=wvFnqVVUhN

    URL https://openreview. net/forum?id=wvFnqVVUhN. Shi, T., Zhu, K., Wang, Z., Jia, Y ., Cai, W., Liang, W., Wang, H., Alzahrani, H., Lu, J., Kawaguchi, K., et al. Promp- tarmor: Simple yet effective prompt injection defenses. arXiv preprint arXiv:2507.15219,

  8. [8]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in- side convolutional networks: Visualising image clas- sification models and saliency maps.arXiv preprint arXiv:1312.6034,

  9. [9]

    In: CVPR

    Wang, H., Wang, G., and Zhang, H. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 29947–29957, 2025a. doi: 10.1109/CVPR52734.2025. 02787. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J.,...

  10. [10]

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L

    URL https://arxiv.org/ abs/2406.04031. Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: evaluating large multimodal models for integrated capabilities. InProceedings of the 41st International Conference on Machine Learning, pp. 57730–57754,

  11. [11]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    URL https: //arxiv.org/abs/2307.15043. 11 Localization then Neutralization: Gradient-guided Token Suppression against Visual Prompt Injection Attack A. Extended Sliding-Window Masking Experiments 0 50 100 150 200 250 Embedding Index (Centered) 0.0 0.2 0.4 0.6 0.8 1.0ASR Value Interval 1 Interval 2 Interval 4 Interval 8 Interval 16 Interval 32 Figure 5.Res...