Recognition: no theorem link
Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation
Pith reviewed 2026-05-10 18:13 UTC · model grok-4.3
The pith
MLLMs used for content moderation can be evaded by encoding harmful content in human-readable visuals that the models fail to detect or understand.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adversarial smuggling attacks encode harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection in MLLM-based content moderation. The attacks operate through two pathways: perceptual blindness that disrupts text recognition and reasoning blockade that inhibits semantic understanding despite successful recognition. On the SmuggleBench benchmark of 1,700 instances, both proprietary and open-source state-of-the-art MLLMs exhibit attack success rates exceeding 90 percent. Analysis identifies the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples as the root causes.
What carries the argument
Adversarial smuggling attacks, which exploit the human-AI capability gap by placing harmful content in visual formats readable by people but not by models, executed via perceptual blindness and reasoning blockade.
If this is right
- State-of-the-art MLLMs cannot serve as reliable standalone moderators against visual smuggling attempts.
- Improvements to vision encoders and OCR systems are required to reduce the identified perceptual and reasoning failures.
- Training data must include more domain-specific adversarial examples to address the scarcity problem.
- Test-time methods such as chain-of-thought reasoning and adversarial training via supervised fine-tuning can lower success rates of these attacks.
Where Pith is reading between the lines
- Moderation pipelines may need to combine MLLMs with separate text-extraction tools or human review for visual content.
- The same smuggling approach could undermine safety filters in other multimodal tasks such as image generation or captioning.
- Future work should test whether scaling model size alone closes the gap or whether architectural changes to vision components are necessary.
Load-bearing premise
The smuggling instances in the benchmark represent realistic attempts to hide harmful content and that the identified root causes drive the vulnerability rather than being artifacts of how the benchmark was built.
What would settle it
A drop in attack success rates below 50 percent on the same benchmark after models receive targeted training on smuggling examples or after their vision encoders and OCR components are strengthened to close the human-AI gap.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces adversarial smuggling attacks on MLLMs for content moderation, which encode harmful content into human-readable visual formats that evade AI detection due to the human-AI capability gap. Attacks are partitioned into perceptual blindness (disrupting text recognition) and reasoning blockade (inhibiting semantic understanding). The authors construct SmuggleBench, a benchmark of 1,700 instances, report attack success rates (ASR) exceeding 90% on proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) models, identify three root causes (vision encoder limits, OCR robustness gaps, scarcity of domain-specific examples), and preliminarily explore mitigations via chain-of-thought (CoT) test-time scaling and supervised fine-tuning (SFT). Code is released publicly.
Significance. If the SmuggleBench instances faithfully represent real-world harmful-content smuggling attempts, the results would demonstrate a practically significant vulnerability in deployed MLLM moderation systems, highlighting risks from the human-AI perceptual gap. The public code release and empirical evaluations on held-out models support independent verification and extension. The work is empirically grounded rather than circular, but its broader impact depends on whether the reported root causes and high ASRs generalize beyond the specific benchmark design.
major comments (3)
- [§3] §3 (SmuggleBench Construction): The manuscript provides insufficient detail on the data selection criteria, visual encoding templates, and diversity controls used to generate the 1,700 instances. Without explicit documentation of how these instances were chosen to reflect realistic human-readable harmful content (as opposed to encodings tuned to known model weaknesses), it is unclear whether the >90% ASR on GPT-5 and Qwen3-VL reflects fundamental model limitations or benchmark artifacts, directly undermining the central vulnerability claim.
- [§4.2] §4.2 (Root Cause Analysis): The attribution of failures to the three root causes (vision encoder limits, OCR robustness gap, scarcity of domain-specific examples) is presented without quantitative ablation studies, controlled experiments, or statistical tests isolating each factor's contribution. This leaves open the possibility that other aspects of attack generation or model prompting drive the results, weakening the explanatory power of the root-cause analysis.
- [§4.1] §4.1 (Evaluation Protocol): The reported ASR figures lack accompanying details on variance across multiple runs, confidence intervals, or controls for prompt sensitivity and data partitioning. Given that the headline result (ASR >90% across models) is load-bearing for the threat assessment, these omissions make it difficult to evaluate the reliability and reproducibility of the vulnerability findings.
minor comments (2)
- [Figure 2] Figure 2 (Attack Taxonomy): The visual distinction between perceptual blindness and reasoning blockade pathways would be clearer with explicit arrows or annotations linking example images to the two categories.
- [§2] Related Work: The discussion of prior adversarial attacks on MLLMs could be expanded to include more recent work on vision-language jailbreaks and OCR robustness to better situate the novelty of smuggling attacks.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to enhance transparency, rigor, and reproducibility of our findings on adversarial smuggling attacks.
read point-by-point responses
-
Referee: [§3] §3 (SmuggleBench Construction): The manuscript provides insufficient detail on the data selection criteria, visual encoding templates, and diversity controls used to generate the 1,700 instances. Without explicit documentation of how these instances were chosen to reflect realistic human-readable harmful content (as opposed to encodings tuned to known model weaknesses), it is unclear whether the >90% ASR on GPT-5 and Qwen3-VL reflects fundamental model limitations or benchmark artifacts, directly undermining the central vulnerability claim.
Authors: We appreciate the referee's emphasis on benchmark transparency. In the revised manuscript, we will expand §3 with explicit documentation: data selection criteria will detail sourcing from ethically filtered public harmful-content corpora, human-readability validation via pilot studies, and sampling to cover diverse categories (violence, hate, misinformation) with balanced difficulty. Visual encoding templates will be fully specified, including parameters for text rendering, layout variations, and image manipulations. Diversity controls will include quantitative distributions across content types, visual styles, and encoding complexities, plus controls to avoid over-optimization for known weaknesses (e.g., via held-out model testing during construction). These additions will clarify that high ASRs reflect the human-AI gap rather than artifacts. revision: yes
-
Referee: [§4.2] §4.2 (Root Cause Analysis): The attribution of failures to the three root causes (vision encoder limits, OCR robustness gap, scarcity of domain-specific examples) is presented without quantitative ablation studies, controlled experiments, or statistical tests isolating each factor's contribution. This leaves open the possibility that other aspects of attack generation or model prompting drive the results, weakening the explanatory power of the root-cause analysis.
Authors: We acknowledge that the root-cause analysis would benefit from greater quantification. Our current attribution draws from error pattern analysis and cross-model comparisons, but we agree formal ablations are needed. In revision, we will add controlled experiments: proxy comparisons of vision-encoder performance on encoded vs. clean inputs; OCR accuracy metrics on attack instances versus baselines; and ASR results from models with varying domain-specific training data volumes. Statistical tests (e.g., paired t-tests or ANOVA) will isolate factor contributions where possible. For proprietary models, full internal ablations are limited by API constraints, but we will report all feasible evidence and discuss these boundaries explicitly. revision: partial
-
Referee: [§4.1] §4.1 (Evaluation Protocol): The reported ASR figures lack accompanying details on variance across multiple runs, confidence intervals, or controls for prompt sensitivity and data partitioning. Given that the headline result (ASR >90% across models) is load-bearing for the threat assessment, these omissions make it difficult to evaluate the reliability and reproducibility of the vulnerability findings.
Authors: We agree that additional statistical and procedural details are required to substantiate the headline ASR results. In the revised manuscript, we will report ASR with standard deviations across multiple independent runs (minimum three seeds), 95% confidence intervals for all key metrics, and explicit controls for prompt sensitivity by evaluating a range of moderation prompt variants with reported ASR ranges. Data partitioning will be detailed, including how the 1,700 instances were split for mitigation experiments and verification of no leakage. These enhancements will allow rigorous assessment of result reliability and support reproducibility. revision: yes
Circularity Check
No circularity: empirical benchmark evaluations on external models
full rationale
The paper introduces Adversarial Smuggling Attacks as a new threat category and constructs SmuggleBench (1700 instances) to measure Attack Success Rates directly on held-out models including GPT-5 and Qwen3-VL. The central claims (ASR >90%, three root causes) are experimental measurements and post-hoc analysis of those measurements, not derivations that reduce to author-fitted parameters or self-citations by construction. No equations appear in the provided text, and the code release enables independent reproduction. This is a standard non-circular empirical contribution; the skeptic concern about benchmark representativeness is a validity question, not a circularity issue.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
BERTopic: Neural topic modeling with a class-based TF-IDF procedure
Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90. Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. Preprint, arXiv:2203.05794. Aaron Hurst, Adam Le...
work page internal anchor Pith review arXiv 2022
-
[2]
InPro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V
VLM as policy: Common-law content mod- eration framework for short video platform. InPro- ceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 4682–4693. Siyuan Ma, Weidi Luo, Yu Wang, and Xiaogeng Liu
-
[3]
Visual-roleplay: Universal jailbreak attack on multimodal large language models via role-playing image character.arXiv preprint arXiv:2405.20773. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representatio...
-
[4]
hdbscan: Hierarchical density based clustering. J. Open Source Softw., 2(11):205. Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Ag...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Aligning large multimodal models with factu- ally augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13088–13110. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foun- dat...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
jailbreaking
and Gemini (Comanici et al., 2025), as well as open-source frameworks like LLaV A (Li et al., 2024a) and Qwen-VL, predominantly adopt a modular architecture. This paradigm typi- cally integrates a pre-trained vision encoder (e.g., ViT(Dosovitskiy, 2020) or SigLIP(Zhai et al., 2023)) with a Large Language Model (LLM) through a lightweight projection module...
2025
-
[7]
visual key
demonstrated that optimizing visual adver- sarial examples can bypass textual safety filters, effectively acting as a "visual key" to unlock harm- ful model behaviors. Furthermore, recent works have explored bi-modal adversarial optimization (Yi et al., 2024; Ma et al., 2024; Li et al., 2024b; Chen et al., 2025; Feng et al., 2025; Ying et al., 2025), wher...
2024
-
[8]
It uses a con- cise two-step logic to decouple perception (OCR) from reasoning (Violation Check)
Standard Evaluation Prompt (Prompt 6): Employed in our main experiments (Section 4.2) to quantify ASR and TER. It uses a con- cise two-step logic to decouple perception (OCR) from reasoning (Violation Check)
-
[9]
A bouquet of wildflowers tied with placed in a jar
CoT Defense Prompt (Prompt 7):Em- ployed specifically for the defense strategy analysis (Section 4.3.2), utilizing a granu- lar four-step reasoning process to maximize safety enforcement. B.2 SFT Training Configurations We implemented full-parameter fine-tuning on Qwen2.5-VL-7B-Instructusing the LLaMA- Factory(Zheng et al., 2024) framework. The train- Nat...
2024
-
[10]
As shown in Table 6, even flagship models likeGPT-5and Claude-Sonnet-4.5struggle significantly in these categories
Visual Encoder Bottleneck.Tasks such as Tiny Text(/compress) andAI Illusions( ♂¶agic) exploit the in- herent resolution limits and semantic loss of vi- sual encoders (e.g., CLIP, SigLIP). As shown in Table 6, even flagship models likeGPT-5and Claude-Sonnet-4.5struggle significantly in these categories. This indicates that current visual en- coders compres...
-
[11]
While models are proficient at reading clean digital text, they fail to generalize to text that is partially obstructed or stylistically distorted
Insufficient OCR Robustness.The high suc- cess rates inOccluded Text( /eye-slash),Handwritten Style( ♂pen-fancy), andArtistic Text( ♂palette) highlight a lack of OCR robustness in noisy or non-standard scenarios. While models are proficient at reading clean digital text, they fail to generalize to text that is partially obstructed or stylistically distort...
-
[12]
Absence of Adversarial Knowledge.The vul- nerabilities exposed in Group B (Reasoning Block- ade), particularly inSemantic Camouflage( ♂¶ask) andVisual Puzzles( ♂puzzle-piece), point to a critical lack of adversarial knowledge. Even models equipped with advanced Chain-of-Thought (CoT) capabil- ities, such as theQwen3-VL-Thinkingseries, achieve high ASRs (e...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.