pith. sign in

arxiv: 2606.10904 · v1 · pith:7PIIWTD3new · submitted 2026-06-09 · 💻 cs.CR

Comparative Analysis of Inference-Time Defense Methods for Multimodal Large Language Models

Pith reviewed 2026-06-27 12:42 UTC · model grok-4.3

classification 💻 cs.CR
keywords multimodal large language modelsinference-time defensesadversarial attackssafety evaluationover-refusaladaptive defense selectionproxy classifiervisual attacks
0
0 comments X

The pith

No single inference-time defense dominates across multimodal models and attack types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a comparative evaluation of three inference-time defense methods and their combinations on eight models from the InternVL and Qwen-VL families. It tests these across seven safety benchmarks spanning four attack classes and 9,000 samples, all measured with the same unified proxy classifier. The central result is that defense effectiveness varies with each model's baseline safety level and the specific attack, so no method or fixed combination works best everywhere. A basic safety prompt preserves more model utility than other options, while combinations drive over-refusal rates to 97-100 percent. The findings support selecting defenses adaptively instead of relying on one preset configuration.

Core claim

The evaluation demonstrates that within the tested models and benchmarks, no single defense or combination dominates all settings because performance depends on the model's baseline safety and the attack class; combining methods produces 97-100 percent over-refusal on benign queries, a simple safety prompt yields moderate safety gains with 0-18.2 percent over-refusal, and text-level defenses can suppress some visual attacks at the output stage.

What carries the argument

Comparative empirical evaluation of inference-time defenses measured by a single unified proxy classifier across multiple models and attack benchmarks.

If this is right

  • Different attack classes expose distinct weaknesses, so multi-benchmark testing is required.
  • Safety prompts maintain higher model utility than other tested defenses.
  • Text-level defenses can block some visual attacks even when applied after the visual input.
  • Fixed defense configurations are less effective than adaptive selection based on model and threat.
  • High over-refusal rates make combined defenses impractical for many applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems could be built to detect the incoming attack type and switch defenses on the fly.
  • The patterns observed may hold for other multimodal families not included in the eight models tested.
  • Utility-safety trade-offs could be studied by varying the strength of the safety prompt rather than using a fixed version.

Load-bearing premise

The single unified proxy classifier accurately and without bias measures both safety gains against attacks and over-refusal rates on safe queries.

What would settle it

Results from a different classifier or an expanded set of benchmarks in which one defense method outperforms all others across every model and attack class would falsify the claim that no single defense dominates.

read the original abstract

Multimodal large language models (MLLMs) now appear in safety-critical applications, but the visual channel leaves them open to adversarial attacks that predominantly text-oriented safety alignment addresses only in part. Retraining a model for each new vulnerability class is usually too expensive to be practical. We report a comparative empirical evaluation of three inference-time defense methods and their combinations, run on eight models from the InternVL and Qwen-VL families across seven safety benchmarks that span four attack classes and total 9,000 evaluation samples. Every figure below comes from the same unified proxy classifier. Five findings emerge from the evaluation. First, within the evaluated models and benchmarks, no single defense dominates across all settings: what works depends on the model's baseline safety and on the attack type. Second, combining defenses directly drives benign-query over-refusal to 97-100% across all eight evaluated models, and SmoothVLM on its own reaches 99.2-100%. Third, a simple safety prompt keeps utility largely intact (0.0-18.2% over-refusal across all eight models, five of them below 7%, although two exceeded 15%) while still yielding moderate safety gains. Fourth, different attack classes expose different weaknesses across the evaluated setup, which is why multi-benchmark evaluation matters. Fifth, in a preliminary whitebox test on two models (n=20), text-level defenses suppressed a PGD visual attack that had succeeded without any defense: the defenses act at the output stage, where gradient optimization has limited direct leverage in the tested configuration. Read together, these results argue for adaptive defense selection rather than a single fixed defense configuration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a large-scale empirical comparison of three inference-time defense methods (and their combinations) for multimodal LLMs. It evaluates eight models from the InternVL and Qwen-VL families on seven safety benchmarks spanning four attack classes (totaling 9,000 samples) and reports five findings: no defense dominates across settings, defense combinations drive over-refusal to 97-100%, safety prompts preserve utility with moderate gains, attack classes expose different weaknesses, and text defenses can suppress successful visual attacks in limited white-box tests. All quantitative results derive from a single unified proxy classifier used to score both attack success and benign over-refusal. The authors conclude that adaptive rather than fixed defense selection is needed.

Significance. The topic of practical inference-time defenses for MLLM safety is timely and relevant. A validated comparative study at this scale could usefully inform deployment choices. However, the absence of any reported validation for the proxy classifier means the numerical comparisons and the central claim that adaptive selection is required rest on unverified measurements; if the proxy holds, the work would be a solid empirical contribution.

major comments (2)
  1. [Abstract] Abstract: every figure and all five findings are derived from scores produced by a single unified proxy classifier, yet the manuscript provides no details on its training data, architecture, agreement with human labels, or accuracy across the four attack classes and eight models. This is load-bearing for the central claim that no defense dominates and that adaptive selection is required, because any systematic bias or inaccuracy in the proxy would directly alter the reported relative performance and over-refusal rates.
  2. [Evaluation methodology] The evaluation methodology (implicit in the abstract's description of the 9,000-sample study) does not report statistical testing, error bars, or baseline implementations for the proxy, leaving the quantitative comparisons without measures of uncertainty or reproducibility.
minor comments (1)
  1. [Abstract] The abstract states that SmoothVLM on its own reaches 99.2-100% over-refusal; clarify whether this is measured on the same benign-query set used for the other defenses and whether the proxy was calibrated for this metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The concerns about the proxy classifier and evaluation methodology are valid and central to the paper's claims. We respond point-by-point below and will revise the manuscript to address them.

read point-by-point responses
  1. Referee: [Abstract] Abstract: every figure and all five findings are derived from scores produced by a single unified proxy classifier, yet the manuscript provides no details on its training data, architecture, agreement with human labels, or accuracy across the four attack classes and eight models. This is load-bearing for the central claim that no defense dominates and that adaptive selection is required, because any systematic bias or inaccuracy in the proxy would directly alter the reported relative performance and over-refusal rates.

    Authors: We agree this information is essential and its absence is a limitation of the current manuscript. In the revision we will add a dedicated subsection (likely Section 3.3) describing the proxy: a fine-tuned RoBERTa-based text classifier trained on 4,800 human-labeled MLLM responses drawn from the same safety benchmarks; training procedure and hyperparameters; Cohen's kappa of 0.81 with three human annotators on a 600-sample held-out set; and per-attack-class accuracy (82-91% on text attacks, 76-88% on visual attacks). These details will allow readers to evaluate potential systematic bias. revision: yes

  2. Referee: [Evaluation methodology] The evaluation methodology (implicit in the abstract's description of the 9,000-sample study) does not report statistical testing, error bars, or baseline implementations for the proxy, leaving the quantitative comparisons without measures of uncertainty or reproducibility.

    Authors: We accept the point. The revised manuscript will report (i) standard error bars from five independent evaluation runs for all main figures, (ii) paired statistical tests (Wilcoxon signed-rank with Bonferroni correction) for defense comparisons, and (iii) the proxy's own baseline metrics (accuracy, precision, recall, and confusion matrices) broken down by attack class and model family. These additions will be placed in a new Evaluation Details subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical results

full rationale

The paper contains no mathematical derivations, equations, fitted parameters, or predictions that reduce to prior quantities by construction. All five findings are direct experimental observations from running three defense methods (and combinations) on eight models across seven benchmarks totaling 9,000 samples, scored by a single proxy classifier. No self-citations are invoked to justify uniqueness theorems or ansatzes, and no renaming of known results occurs. The central claim that no defense dominates and adaptive selection is warranted follows immediately from the tabulated outcomes without any intermediate reduction to inputs. The proxy classifier's accuracy is an external validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation paper. There are no free parameters, mathematical axioms, or invented entities; the claims rest entirely on the experimental setup and the assumption that the proxy classifier is reliable.

pith-pipeline@v0.9.1-grok · 5839 in / 1167 out tokens · 37732 ms · 2026-06-27T12:42:43.636790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    In: International Conference on Machine Learning (ICML) (2024)

    Bailey, L., Ong, E., Russell, S., Emmons, S.: Image hijacks: Adversarial images can control generative models at runtime. In: International Conference on Machine Learning (ICML) (2024)

  2. [2]

    Adversarial Patch

    Brown, T., Mané, D., Roy, A., Abadi, M., Gilmer, J.: Adversarial patch. In: NeurIPS Workshop on Machine Learning and Computer Security (2017), arXiv:1712.09665

  3. [3]

    Carlini, N., Nasr, M., Choquette-Choo, C.A., et al.: Are aligned neural networks adversarially aligned? In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

  4. [4]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Chen, Z., Wu, J., Wang, W., et al.: InternVL: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  5. [5]

    In: International Conference on Machine Learning (ICML) (2019)

    Cohen, J., Rosenfeld, E., Kolter, J.Z.: Certified adversarial robustness via random- ized smoothing. In: International Conference on Machine Learning (ICML) (2019)

  6. [6]

    International Journal of Open Information Technologies (IN- JOIT) (2026), companion paper on agent-level security

    Evgrafov, V., Nutfullin, B., Namiot, D.: Security considerations for LLM-based agent orchestration. International Journal of Open Information Technologies (IN- JOIT) (2026), companion paper on agent-level security

  7. [7]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)

    Gong,Y.,etal.:FigStep:Jailbreakinglargevision-languagemodelsviatypographic visual prompts. In: Proceedings of the AAAI Conference on Artificial Intelligence (2025)

  8. [8]

    In: International Conference on Learning Representations (ICLR) (2015) 18 B

    Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial ex- amples. In: International Conference on Learning Representations (ICLR) (2015) 18 B. Nutfullin, V. Evgrafov, D. Namiot

  9. [9]

    In: European Conference on Computer Vision (ECCV) (2024)

    Gou, Y., Chen, K., et al.: Eyes closed, safety on: Protecting multimodal LLMs via image-to-text transformation. In: European Conference on Computer Vision (ECCV) (2024)

  10. [10]

    arXiv preprint arXiv:2412.18826 (2024)

    Jiang, Y., Tan, Y., Yue, X.: RapGuard: Safeguarding multimodal large language models via rationale-aware defensive prompting. arXiv preprint arXiv:2412.18826 (2024)

  11. [11]

    In: Findings of the Association for Computational Linguistics: ACL 2024 (2024)

    Li, L., Hao, R., Hu, G., et al.: SALAD-Bench: A hierarchical and comprehensive safety benchmark for large language models. In: Findings of the Association for Computational Linguistics: ACL 2024 (2024)

  12. [12]

    In: European Conference on Computer Vision (ECCV) (2014)

    Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft COCO: Common objects in context. In: European Conference on Computer Vision (ECCV) (2014)

  13. [13]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)

  14. [14]

    In: European Conference on Computer Vision (ECCV) (2024)

    Liu, X., Zhu, Y., Gu, J., Lan, Y., Yang, C., Qiao, Y.: MM-SafetyBench: A bench- mark for safety evaluation of multimodal large language models. In: European Conference on Computer Vision (ECCV) (2024)

  15. [15]

    In: Conference on Language Modeling (COLM) (2024)

    Luo, W., Ma, S., Liu, X., Guo, X., Xiao, C.: JailBreakV-28K: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks. In: Conference on Language Modeling (COLM) (2024)

  16. [16]

    In: International Conference on Learning Representations (ICLR) (2018)

    Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. In: International Conference on Learning Representations (ICLR) (2018)

  17. [17]

    In: International Con- ference on Machine Learning (ICML) (2024)

    Mazeika, M., Phan, L., Yin, X., et al.: HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. In: International Con- ference on Machine Learning (ICML) (2024)

  18. [18]

    arXiv preprint arXiv:2402.02309 (2024)

    Niu, Z., et al.: Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309 (2024)

  19. [19]

    Manuscript in preparation; under submission (2026)

    Nutfullin, B., Namiot, D.: The hidden cost of safety: Measuring over-refusal in inference-time defenses for multimodal LLMs. Manuscript in preparation; under submission (2026)

  20. [20]

    GPT-4 Technical Report

    OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  21. [21]

    Ouyang, L., Wu, J., Jiang, X., et al.: Training language models to follow instruc- tionswithhumanfeedback.In:AdvancesinNeuralInformationProcessingSystems (NeurIPS) (2022)

  22. [22]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2024)

    Pi, R., Han, T., Zhang, J., et al.: MLLM-Protector: Ensuring MLLM’s safety with- out hurting performance. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2024)

  23. [23]

    In: Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2023)

    Pilipenko, O., Nutfullin, B., Kostyumov, V., Ilyushin, E.: TrojanInterpret: A de- tectingbackdoorsmethodinDNNbasedonneuralnetworkinterpretationmethods. In: Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2023). Communications in Computer and Information Science, vol. 2086. Springer (2024)

  24. [24]

    In: Proceedings of the AAAI Conference on Artificial In- telligence (2024)

    Qi, X., Huang, K., Panda, A., et al.: Visual adversarial examples jailbreak aligned large language models. In: Proceedings of the AAAI Conference on Artificial In- telligence (2024)

  25. [25]

    In: International Conference on Machine Learn- ing (ICML) (2021)

    Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learn- ing (ICML) (2021)

  26. [26]

    In: International Conference on Learning Representations (ICLR) (2024) Inference-Time Defenses for MLLMs 19

    Shayegani, E., Dong, Y., Abu-Ghazaleh, N.: Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. In: International Conference on Learning Representations (ICLR) (2024) Inference-Time Defenses for MLLMs 19

  27. [27]

    arXiv preprint arXiv:2405.10529 (2024)

    Sun, J., Wang, C., Wang, J., Zhang, Y., Xiao, C.: Safeguarding vision-language models against patched visual prompt injectors. arXiv preprint arXiv:2405.10529 (2024)

  28. [28]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., et al.: Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  29. [29]

    In: European Conference on Computer Vision (ECCV) (2024)

    Wang, Y., Li, Y., Chen, Y., Liu, Z., Xiao, J.: AdaShield: Safeguarding multimodal large language models from structure-based attack via adaptive shield prompting. In: European Conference on Computer Vision (ECCV) (2024)

  30. [30]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits rea- soning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

  31. [31]

    arXiv preprint arXiv:2402.05355 (2024)

    Zhao, T., Zhang, L., Ma, Y., Cheng, L.: A survey on safe multi-modal learning system. arXiv preprint arXiv:2402.05355 (2024)