pith. machine review for the scientific record. sign in

arxiv: 2604.06950 · v2 · submitted 2026-04-08 · 💻 cs.CV

Recognition: no theorem link

Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords adversarial smugglingMLLM content moderationperceptual blindnessreasoning blockadeSmuggleBenchvision encoderOCR robustnessmultimodal attacks
0
0 comments X

The pith

MLLMs used for content moderation can be evaded by encoding harmful content in human-readable visuals that the models fail to detect or understand.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that multimodal large language models deployed as automated content moderators are open to adversarial smuggling attacks. These attacks present harmful material in visual formats that humans can read but the models cannot process effectively, allowing the content to slip through filters. The authors introduce two attack types, perceptual blindness that breaks text recognition and reasoning blockade that blocks understanding even after text is read, and back this up with a new benchmark of 1,700 instances. Evaluations show attack success rates above 90 percent on leading models including GPT-5 and Qwen3-VL. The work traces the failures to limits in vision encoders, gaps in OCR robustness, and a lack of relevant training examples, while testing initial mitigation ideas such as chain-of-thought prompting and supervised fine-tuning.

Core claim

Adversarial smuggling attacks encode harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection in MLLM-based content moderation. The attacks operate through two pathways: perceptual blindness that disrupts text recognition and reasoning blockade that inhibits semantic understanding despite successful recognition. On the SmuggleBench benchmark of 1,700 instances, both proprietary and open-source state-of-the-art MLLMs exhibit attack success rates exceeding 90 percent. Analysis identifies the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples as the root causes.

What carries the argument

Adversarial smuggling attacks, which exploit the human-AI capability gap by placing harmful content in visual formats readable by people but not by models, executed via perceptual blindness and reasoning blockade.

If this is right

  • State-of-the-art MLLMs cannot serve as reliable standalone moderators against visual smuggling attempts.
  • Improvements to vision encoders and OCR systems are required to reduce the identified perceptual and reasoning failures.
  • Training data must include more domain-specific adversarial examples to address the scarcity problem.
  • Test-time methods such as chain-of-thought reasoning and adversarial training via supervised fine-tuning can lower success rates of these attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Moderation pipelines may need to combine MLLMs with separate text-extraction tools or human review for visual content.
  • The same smuggling approach could undermine safety filters in other multimodal tasks such as image generation or captioning.
  • Future work should test whether scaling model size alone closes the gap or whether architectural changes to vision components are necessary.

Load-bearing premise

The smuggling instances in the benchmark represent realistic attempts to hide harmful content and that the identified root causes drive the vulnerability rather than being artifacts of how the benchmark was built.

What would settle it

A drop in attack success rates below 50 percent on the same benchmark after models receive targeted training on smuggling examples or after their vision encoders and OCR components are strengthened to close the human-AI gap.

Figures

Figures reproduced from arXiv: 2604.06950 by Bing Li, Bo Li, Chunfeng Yuan, Jianing Zhang, Jun Gao, Weiming Hu, Xiaolei Lv, Yuntong Pan, Zhiheng Li, Ziqi Zhang, Zongyang Ma.

Figure 1
Figure 1. Figure 1: A typical example of Adversarial Smug￾gling Attacks (ASA). While the AI moderator is blinded by the benign visual texture (classifying it as a “Safe Forest”), the human user immediately recognizes the hidden violent harmful content (“KILL ALL”). guage Models (MLLMs) (Achiam et al., 2023; Gemini Team et al., 2023; Anthropic, 2025) have become the cornerstone of automated content mod￾eration, widely deployed… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of adversarial attack types against MLLMs. (A) Adversarial Perturbations use imperceptible noise to induce misclassification ("Make MLLM Dumb"). (B) Adversarial Jailbreaks employ explicit malicious instructions to override safety guardrails ("Make MLLM Bad"). (C) Adversarial Smuggling embeds harmful content into benign visual carriers (e.g., latte art), exploiting the Human-AI perception gap to … view at source ↗
Figure 3
Figure 3. Figure 3: Two Attack pathways of Adversarial Smug￾gling Attacks (ASA). vulnerability, leaving robust defense as an impera￾tive for future work. In summary, our main contributions are as fol￾lows: • We formally identify a new adversarial threat in content moderation scenarios: Adversarial Smuggling Attacks (ASA)and categorize ASA into two attack pathways : Perceptual Blindness and Reasoning Blockade. • We systematize… view at source ↗
Figure 4
Figure 4. Figure 4: The construction pipeline of SMUG￾GLEBENCH. (A) Data-driven taxonomy discovery via clustering. (B) Hybrid data curation combining in-the￾wild collection and automated synthesis. 3.1 Data-Driven Taxonomy Construction To ensure our benchmark reflects real-world threats, we derived our taxonomy through a data-driven dis￾covery pipeline (illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the 9 adversarial smuggling techniques defined in SMUGGLEBENCH. In each panel, the harmful keyword "KILL/KILL ALL" serves as a demonstrative placeholder hidden via distinct smuggling techniques. In practice, it can be substituted with arbitrary harmful content. sis specifically for Low Contrast and AI Illusions. These techniques rely on precise manipulation of visual thresholds to induce percep… view at source ↗
Figure 6
Figure 6. Figure 6: The standard system prompt used for calculating ASR and TER metrics in the main evaluation. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The detailed System Prompt used for the Chain-of-Thought (CoT) defense mechanism. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the Automated Data Synthesis Pipelines. (A) AI Illusion Generation Pipeline: Illustrates the process of using ControlNet and Stable Diffusion to inject structural patterns into natural scenes via latent denoising. (B) Low-Contrast Synthesis Pipeline: Demonstrates the pixel-level manipulation for embedding text via adaptive alpha blending and structure-aware Voronoi noise. Hyperparameter Value B… view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces adversarial smuggling attacks on MLLMs for content moderation, which encode harmful content into human-readable visual formats that evade AI detection due to the human-AI capability gap. Attacks are partitioned into perceptual blindness (disrupting text recognition) and reasoning blockade (inhibiting semantic understanding). The authors construct SmuggleBench, a benchmark of 1,700 instances, report attack success rates (ASR) exceeding 90% on proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) models, identify three root causes (vision encoder limits, OCR robustness gaps, scarcity of domain-specific examples), and preliminarily explore mitigations via chain-of-thought (CoT) test-time scaling and supervised fine-tuning (SFT). Code is released publicly.

Significance. If the SmuggleBench instances faithfully represent real-world harmful-content smuggling attempts, the results would demonstrate a practically significant vulnerability in deployed MLLM moderation systems, highlighting risks from the human-AI perceptual gap. The public code release and empirical evaluations on held-out models support independent verification and extension. The work is empirically grounded rather than circular, but its broader impact depends on whether the reported root causes and high ASRs generalize beyond the specific benchmark design.

major comments (3)
  1. [§3] §3 (SmuggleBench Construction): The manuscript provides insufficient detail on the data selection criteria, visual encoding templates, and diversity controls used to generate the 1,700 instances. Without explicit documentation of how these instances were chosen to reflect realistic human-readable harmful content (as opposed to encodings tuned to known model weaknesses), it is unclear whether the >90% ASR on GPT-5 and Qwen3-VL reflects fundamental model limitations or benchmark artifacts, directly undermining the central vulnerability claim.
  2. [§4.2] §4.2 (Root Cause Analysis): The attribution of failures to the three root causes (vision encoder limits, OCR robustness gap, scarcity of domain-specific examples) is presented without quantitative ablation studies, controlled experiments, or statistical tests isolating each factor's contribution. This leaves open the possibility that other aspects of attack generation or model prompting drive the results, weakening the explanatory power of the root-cause analysis.
  3. [§4.1] §4.1 (Evaluation Protocol): The reported ASR figures lack accompanying details on variance across multiple runs, confidence intervals, or controls for prompt sensitivity and data partitioning. Given that the headline result (ASR >90% across models) is load-bearing for the threat assessment, these omissions make it difficult to evaluate the reliability and reproducibility of the vulnerability findings.
minor comments (2)
  1. [Figure 2] Figure 2 (Attack Taxonomy): The visual distinction between perceptual blindness and reasoning blockade pathways would be clearer with explicit arrows or annotations linking example images to the two categories.
  2. [§2] Related Work: The discussion of prior adversarial attacks on MLLMs could be expanded to include more recent work on vision-language jailbreaks and OCR robustness to better situate the novelty of smuggling attacks.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to enhance transparency, rigor, and reproducibility of our findings on adversarial smuggling attacks.

read point-by-point responses
  1. Referee: [§3] §3 (SmuggleBench Construction): The manuscript provides insufficient detail on the data selection criteria, visual encoding templates, and diversity controls used to generate the 1,700 instances. Without explicit documentation of how these instances were chosen to reflect realistic human-readable harmful content (as opposed to encodings tuned to known model weaknesses), it is unclear whether the >90% ASR on GPT-5 and Qwen3-VL reflects fundamental model limitations or benchmark artifacts, directly undermining the central vulnerability claim.

    Authors: We appreciate the referee's emphasis on benchmark transparency. In the revised manuscript, we will expand §3 with explicit documentation: data selection criteria will detail sourcing from ethically filtered public harmful-content corpora, human-readability validation via pilot studies, and sampling to cover diverse categories (violence, hate, misinformation) with balanced difficulty. Visual encoding templates will be fully specified, including parameters for text rendering, layout variations, and image manipulations. Diversity controls will include quantitative distributions across content types, visual styles, and encoding complexities, plus controls to avoid over-optimization for known weaknesses (e.g., via held-out model testing during construction). These additions will clarify that high ASRs reflect the human-AI gap rather than artifacts. revision: yes

  2. Referee: [§4.2] §4.2 (Root Cause Analysis): The attribution of failures to the three root causes (vision encoder limits, OCR robustness gap, scarcity of domain-specific examples) is presented without quantitative ablation studies, controlled experiments, or statistical tests isolating each factor's contribution. This leaves open the possibility that other aspects of attack generation or model prompting drive the results, weakening the explanatory power of the root-cause analysis.

    Authors: We acknowledge that the root-cause analysis would benefit from greater quantification. Our current attribution draws from error pattern analysis and cross-model comparisons, but we agree formal ablations are needed. In revision, we will add controlled experiments: proxy comparisons of vision-encoder performance on encoded vs. clean inputs; OCR accuracy metrics on attack instances versus baselines; and ASR results from models with varying domain-specific training data volumes. Statistical tests (e.g., paired t-tests or ANOVA) will isolate factor contributions where possible. For proprietary models, full internal ablations are limited by API constraints, but we will report all feasible evidence and discuss these boundaries explicitly. revision: partial

  3. Referee: [§4.1] §4.1 (Evaluation Protocol): The reported ASR figures lack accompanying details on variance across multiple runs, confidence intervals, or controls for prompt sensitivity and data partitioning. Given that the headline result (ASR >90% across models) is load-bearing for the threat assessment, these omissions make it difficult to evaluate the reliability and reproducibility of the vulnerability findings.

    Authors: We agree that additional statistical and procedural details are required to substantiate the headline ASR results. In the revised manuscript, we will report ASR with standard deviations across multiple independent runs (minimum three seeds), 95% confidence intervals for all key metrics, and explicit controls for prompt sensitivity by evaluating a range of moderation prompt variants with reported ASR ranges. Data partitioning will be detailed, including how the 1,700 instances were split for mitigation experiments and verification of no leakage. These enhancements will allow rigorous assessment of result reliability and support reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluations on external models

full rationale

The paper introduces Adversarial Smuggling Attacks as a new threat category and constructs SmuggleBench (1700 instances) to measure Attack Success Rates directly on held-out models including GPT-5 and Qwen3-VL. The central claims (ASR >90%, three root causes) are experimental measurements and post-hoc analysis of those measurements, not derivations that reduce to author-fitted parameters or self-citations by construction. No equations appear in the provided text, and the code release enables independent reproduction. This is a standard non-circular empirical contribution; the skeptic concern about benchmark representativeness is a validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical attack construction and model evaluation rather than mathematical derivation; no explicit free parameters, axioms, or invented entities are introduced beyond standard assumptions in adversarial ML research.

pith-pipeline@v0.9.0 · 5609 in / 1142 out tokens · 40093 ms · 2026-05-10T18:13:38.138283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    BERTopic: Neural topic modeling with a class-based TF-IDF procedure

    Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90. Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. Preprint, arXiv:2203.05794. Aaron Hurst, Adam Le...

  2. [2]

    InPro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    VLM as policy: Common-law content mod- eration framework for short video platform. InPro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 4682–4693. Siyuan Ma, Weidi Luo, Yu Wang, and Xiaogeng Liu

  3. [3]

    Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773, 2024

    Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representatio...

  4. [4]

    hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2(11):205. Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Ag...

  5. [5]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Aligning large multimodal models with factu- ally augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dat...

  6. [6]

    jailbreaking

    and Gemini (Comanici et al., 2025), as well as open-source frameworks like LLaV A (Li et al., 2024a) and Qwen-VL, predominantly adopt a modular architecture. This paradigm typi- cally integrates a pre-trained vision encoder (e.g., ViT(Dosovitskiy, 2020) or SigLIP(Zhai et al., 2023)) with a Large Language Model (LLM) through a lightweight projection module...

  7. [7]

    visual key

    demonstrated that optimizing visual adver- sarial examples can bypass textual safety filters, effectively acting as a "visual key" to unlock harm- ful model behaviors. Furthermore, recent works have explored bi-modal adversarial optimization (Yi et al., 2024; Ma et al., 2024; Li et al., 2024b; Chen et al., 2025; Feng et al., 2025; Ying et al., 2025), wher...

  8. [8]

    It uses a con- cise two-step logic to decouple perception (OCR) from reasoning (Violation Check)

    Standard Evaluation Prompt (Prompt 6): Employed in our main experiments (Section 4.2) to quantify ASR and TER. It uses a con- cise two-step logic to decouple perception (OCR) from reasoning (Violation Check)

  9. [9]

    A bouquet of wildflowers tied with placed in a jar

    CoT Defense Prompt (Prompt 7):Em- ployed specifically for the defense strategy analysis (Section 4.3.2), utilizing a granu- lar four-step reasoning process to maximize safety enforcement. B.2 SFT Training Configurations We implemented full-parameter fine-tuning on Qwen2.5-VL-7B-Instructusing the LLaMA- Factory(Zheng et al., 2024) framework. The train- Nat...

  10. [10]

    As shown in Table 6, even flagship models likeGPT-5and Claude-Sonnet-4.5struggle significantly in these categories

    Visual Encoder Bottleneck.Tasks such as Tiny Text(/compress) andAI Illusions( ♂¶agic) exploit the in- herent resolution limits and semantic loss of vi- sual encoders (e.g., CLIP, SigLIP). As shown in Table 6, even flagship models likeGPT-5and Claude-Sonnet-4.5struggle significantly in these categories. This indicates that current visual en- coders compres...

  11. [11]

    While models are proficient at reading clean digital text, they fail to generalize to text that is partially obstructed or stylistically distorted

    Insufficient OCR Robustness.The high suc- cess rates inOccluded Text( /eye-slash),Handwritten Style( ♂pen-fancy), andArtistic Text( ♂palette) highlight a lack of OCR robustness in noisy or non-standard scenarios. While models are proficient at reading clean digital text, they fail to generalize to text that is partially obstructed or stylistically distort...

  12. [12]

    Absence of Adversarial Knowledge.The vul- nerabilities exposed in Group B (Reasoning Block- ade), particularly inSemantic Camouflage( ♂¶ask) andVisual Puzzles( ♂puzzle-piece), point to a critical lack of adversarial knowledge. Even models equipped with advanced Chain-of-Thought (CoT) capabil- ities, such as theQwen3-VL-Thinkingseries, achieve high ASRs (e...