Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

Yang Liu; Yani Wang; Yilong Yang; Zhuo Ma; Zhuzhu Wang; Zuobin Ying

arxiv: 2606.01837 · v1 · pith:3R4R44DHnew · submitted 2026-06-01 · 💻 cs.CR

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

Yani Wang , Yilong Yang , Yang Liu , Zhuzhu Wang , Zuobin Ying , Zhuo Ma This is my paper

Pith reviewed 2026-06-28 14:15 UTC · model grok-4.3

classification 💻 cs.CR

keywords jailbreakingmultimodal large language modelscross-modal attacksdistributed semantic recompositionsafety guardrailsbenign inputsharmful outputsutility-safety paradox

0 comments

The pith

Splitting harmful intents into benign text and image parts lets multimodal models recombine them into attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that harmful intents can be decomposed into benign textual and visual primitives using Distributed Semantic Recomposition. These components are then processed by multimodal large language models, which fuse them into harmful outputs during cross-modal inference. This bypasses traditional safety filters that scan for explicit harmful content in inputs. The result is high attack success with low input toxicity on commercial systems. This reveals how the models' reasoning capabilities can be exploited.

Core claim

DSR decomposes harmful intent into a set of benign textual and visual primitives. By exploiting the model's reasoning ability, DSR enables the latent fusion of these seemingly innocent components into harmful outputs during the cross-modal inference phase. This achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate, uncovering a Utility-Safety Paradox in MLLMs.

What carries the argument

Distributed Semantic Recomposition (DSR), a framework that decomposes harmful intent into benign textual and visual primitives for latent fusion during cross-modal inference.

If this is right

DSR achieves superior attack success rates on multiple commercial MLLM pipelines.
Input toxicity remains extremely low or negligible while producing harmful outputs.
The model's instruction-following proficiency facilitates its own cognitive exploitation.
Existing unimodal textual safety guardrails are vulnerable to cross-modal attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses would need to monitor cross-modal reasoning steps rather than inputs alone.
Similar decomposition techniques could apply to other cross-modal tasks beyond attacks.
If models gain better intent detection during fusion, the attack surface would shrink.

Load-bearing premise

Commercial MLLMs will reliably perform the latent fusion of distributed benign primitives into harmful content without safety filters detecting the intent during cross-modal reasoning.

What would settle it

An experiment where the models refuse to fuse the benign components into harmful content or detect the harmful intent before generating outputs.

Figures

Figures reproduced from arXiv: 2606.01837 by Yang Liu, Yani Wang, Yilong Yang, Zhuo Ma, Zhuzhu Wang, Zuobin Ying.

**Figure 1.** Figure 1: Overview of DSR To address this limitation, we propose DSR, a novel harmless cross-modal jailbreak framework. Unlike existing CMA methods, DSR does not directly include harmful content in either image or text inputs. Instead, it decomposes prohibited intent into multiple benign visual primitives and combines them with semantically related prompts. For example, fragmented images such as a black man, a cotto… view at source ↗

**Figure 2.** Figure 2: Overview of the DSR workflow. instructs M𝑜𝑝𝑡 to translate the original adversarial logic 𝑥𝑚𝑎𝑙 into innocuous linguistic structures, rather than enforcing strict vocabulary restrictions. It ensures that the generated prompt remains below the detection threshold of textual guardrails while preserving its underlying guiding semantics. The synthesized prompt 𝑝𝑜𝑝𝑡 operates through two synergistic clauses: Inn… view at source ↗

**Figure 3.** Figure 3: Performance evaluation under different ablation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental effects of harmful and harmless prompts. Through keeping the role, scene and sensitive object across [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Attribute Ablation Study and Performance Analysis. This figure details the impact of ablating individual attribute [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: System Prompt for Violence and Gore Detection. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 9.** Figure 9: Harmful Prompt Rewriting. Rewriting) elucidates the visual detoxification constraints employed during the safety iteration phase to balance standard guardrail compliance with visual-semantic fidelity [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in content synthesis and autonomous reasoning. Previous safety guardrails are primarily designed for unimodal textual input interception, leaving them vulnerable to cross-modal jailbreak attacks. However, regardless unimodal textual attack or cross-modal jailbreak, typically inclusive part of explicit harmful or sensitive content at the input level, which is called Harm-Bearing. It allow the model's safety filters to detect and block such content easily. To address this limitations, we propose Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreak framework that decomposes harmful intent into a set of benign textual and visual primitives. By exploiting the model's reasoning ability, DSR enables the latent fusion of these seemingly innocent components into harmful outputs during the cross-modal inference phase. Extensive experiments on multiple commercial MLLMs pipelines demonstrate that DSR achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate. Our findings uncover a critical Utility-Safety Paradox in MLLMs, where the model's instruction-following proficiency facilitates its own cognitive exploitation. Content Warning: This paper contains harmful model responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a cross-modal jailbreak called DSR that splits harmful intent into benign text and image pieces, but the abstract supplies no experiments, numbers, or method details to check whether the recombination actually works.

read the letter

The core idea is to decompose a harmful request into separate benign textual and visual primitives so that input filters see nothing toxic, then rely on the model's cross-modal reasoning to put them back together into a harmful output. This targets a gap in current guardrails that mostly scan for explicit bad content at the input stage.

The paper does flag a plausible weakness: multimodal models are built to integrate information across modalities, so safety that only looks at single-modality toxicity can miss distributed intent. The Utility-Safety Paradox point is stated clearly enough.

Everything else is thin. No attack examples, no list of primitives, no datasets, no success-rate tables, and no description of how the cross-modal fusion is prompted or measured. The claim of superior attack success rates with negligible input toxicity is asserted without any numbers or controls. The key assumption—that the model will reliably recombine the pieces during inference without any intermediate safety check firing—is left unexamined in the text.

The stress-test concern lands directly: if the primitives are truly decoupled, nothing in the description shows why the model's own reasoning steps would not trigger refusal when the pieces start to cohere. Without that mechanism or data, the central result stays unverified.

People working on multimodal safety evaluations might want to see the full experiments if they exist. A reader looking for concrete attack recipes or reproducible results will not find them here.

I would not send this to peer review in its current form. The idea is worth a note if the authors can supply the missing experimental section, but the abstract alone does not justify referee time.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Distributed Semantic Recomposition (DSR), a cross-modal jailbreak framework for Multimodal Large Language Models (MLLMs). Harmful intent is decomposed into benign textual and visual primitives; the model is then expected to perform latent fusion of these components into harmful outputs during cross-modal inference. The abstract claims this yields superior attack success rates while keeping input toxicity extremely low or negligible, and identifies a Utility-Safety Paradox in which instruction-following proficiency enables cognitive exploitation of the model.

Significance. If the reported results are reproducible and the attack generalizes, the work would be significant for exposing a gap in unimodal safety guardrails and for demonstrating how cross-modal reasoning can be exploited without explicit harmful content at input. The emphasis on negligible input toxicity distinguishes it from prior Harm-Bearing attacks and could motivate new evaluation protocols for multimodal safety.

major comments (2)

[Abstract] Abstract: the claim that DSR 'achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate' is presented without any experimental setup, datasets, metrics, baselines, or quantitative results, rendering the central performance claim unevaluable from the manuscript.
[Abstract] Abstract: the method rests on the assumption that commercial MLLMs will reliably recombine distributed benign primitives into harmful content via latent fusion without safety filters intervening during cross-modal reasoning; no mechanism, analysis, or ablation is supplied to show why intermediate reasoning steps remain below refusal thresholds.

minor comments (1)

[Abstract] Abstract contains grammatical issues: 'regardless unimodal' should read 'regardless of unimodal'; 'It allow' should be 'It allows'; 'this limitations' should be 'these limitations'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and propose revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that DSR 'achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate' is presented without any experimental setup, datasets, metrics, baselines, or quantitative results, rendering the central performance claim unevaluable from the manuscript.

Authors: The abstract is a high-level summary of results whose supporting details appear in Sections 4 and 5 of the manuscript, which describe the datasets, metrics (including ASR and toxicity scores), baselines, and quantitative outcomes across multiple commercial MLLMs. The claim is therefore evaluable from the full manuscript. To improve clarity, we will revise the abstract to include a concise reference to the evaluation protocol. revision: partial
Referee: [Abstract] Abstract: the method rests on the assumption that commercial MLLMs will reliably recombine distributed benign primitives into harmful content via latent fusion without safety filters intervening during cross-modal reasoning; no mechanism, analysis, or ablation is supplied to show why intermediate reasoning steps remain below refusal thresholds.

Authors: We agree that the manuscript does not supply a dedicated mechanistic analysis or ablation explaining why safety filters do not trigger on intermediate cross-modal steps. The current work focuses on empirical demonstration of the attack. We will add a new subsection discussing observed failure modes of unimodal guardrails during latent fusion and include a limited ablation on input ordering and modality balance to address this gap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or self-referential claims

full rationale

The paper proposes an empirical cross-modal jailbreak technique (DSR) that decomposes harmful intents into benign textual/visual primitives and reports experimental attack success rates on commercial MLLMs. No equations, mathematical derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or method description. The central claims rest on observed experimental outcomes (high ASR, low input toxicity) rather than any reduction of results to inputs by construction. This is a standard empirical security paper whose validity is externally falsifiable via replication on the tested models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review limited to abstract; ledger entries are inferred from stated method assumptions only.

axioms (1)

domain assumption MLLMs possess reasoning capabilities that can fuse distributed cross-modal benign inputs into coherent harmful outputs
Central to the DSR mechanism described in the abstract.

invented entities (1)

Distributed Semantic Recomposition (DSR) no independent evidence
purpose: Framework for decomposing harmful intent into benign textual and visual primitives
Newly introduced method in the abstract.

pith-pipeline@v0.9.1-grok · 5747 in / 1202 out tokens · 25020 ms · 2026-06-28T14:15:27.376131+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 13 canonical work pages · 3 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image hijacks: Adversarial images can control generative models at runtime.arXiv preprint arXiv:2309.00236(2023)

work page arXiv 2023
[3]

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. 2024. The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024(2024), 13590–13618

2024
[4]

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Ma- hesh Pasupuleti. 2024. Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414(2024)

work page arXiv 2024
[5]

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
[6]

Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence45, 9 (2023), 10850–10869

2023
[7]

Yimo Deng and Huangxun Chen. 2023. Divide-and-conquer attack: Harnessing the power of llm to bypass the censorship of text-to-image generation model. arXiv preprint arXiv:2312.071301, 2 (2023)

work page arXiv 2023
[8]

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959

2025
[9]

Yuxin Gou, Xiaoning Dong, Qin Li, Shishen Gu, Richang Hong, and Wenbo Hu. 2025. SURE: Safety Understanding and Reasoning Enhancement for Multi- modal Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 7563–7604

2025
[10]

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu. 2025. Perception-guided jailbreak against text-to-image models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 26238– 26247

2025
[11]

Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, and Eunho Yang
[12]

InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of- Distribution Strategy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 29937–29946

2025
[13]

Songze Li, Jiameng Cheng, Yiming Li, Xiaojun Jia, and Dacheng Tao. 2025. Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography.CoRRabs/2512.20168 (2025)

work page arXiv 2025
[14]

Xiaoming Li, Xinyu Hou, and Chen Change Loy. 2024. When stylegan meets stable diffusion: a w+ adapter for personalized image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2187–2196

2024
[15]

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. 2024. Mm- safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision. Springer, 386–403

2024
[16]

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. 2022. Diffusion Models for Adversarial Purification. InInternational Conference on Machine Learning (ICML). PMLR, 16805–16827

2022
[17]

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

2024
[18]

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr
[19]

Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

Red-teaming the stable diffusion safety filter. arXiv.arXiv preprint arXiv:2210.04610(2022)

work page arXiv 2022
[20]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Ma Teng, Xiaojun Jia, Ranjie Duan, Xinfeng Li, Yihao Huang, Zhixuan Chu, Yang Liu, and Wenqi Ren. 2024. Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models.CoRRabs/2412.05934 (2024)

work page arXiv 2024
[22]

Corban Villa, Shujaat Mirza, and Christina Pöpper. 2025. Exposing the Guardrails: Reverse-Engineering and Jailbreaking Safety Filters in DALL-E Text-to-Image Pipelines. In34th USENIX Security Symposium (USENIX Security 25). 897–916

2025
[23]

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. 2024. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE sym- posium on security and privacy (SP). IEEE, 897–912

2024
[24]

Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025. Distraction is All You Need for Multimodal Large Language Model Jailbreaking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 9467–9476

2025
[25]

Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, and Sujuan Qin. 2026. Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs.arXiv preprint arXiv:2601.15698 (2026)

work page arXiv 2026
[26]

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, and He He. 2025. Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949(2025)

work page arXiv 2025
[27]

Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, and Yu Tian. 2026. SafeSteer: A Decoding-level Defense Mecha- nism for Multimodal Large Language Models.arXiv preprint arXiv:2605.11716 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Guoqing Jin, and Anan Liu
[29]

In Proceedings of the AAAI Conference on Artificial Intelligence, Vol

Reason2attack: Jailbreaking text-to-image models via llm reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 36030–36038
[30]

Cong Zhang, Tianze Zhang, Liruo Wang, Ruohui Chen, Wei Li, and Aishan Liu
[31]

InProceedings of the AAAI Conference on Artificial Intelligence

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model. InProceedings of the AAAI Conference on Artificial Intelligence
[32]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models.Findings of the Association for Computational Linguistics: ACL 2024(2024), 12401–12430. Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition Conference acronym ’XX...

2024
[33]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models.Findings of the Association for Computational Linguistics: ACL 2024(2024), 12401–12430

2024
[34]

Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. 2025. Jailbreaking multimodal large language models via shuffle inconsistency. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2045–2054

2025
[35]

Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. 2025. Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency.CoRRabs/2501.04931 (2025)

work page arXiv 2025
[36]

Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, and Muhao Chen. 2025. OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning.arXiv preprint arXiv:2512.02306(2025). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang, Zuobin Ying, and Zhuo Ma Benign Inputs, Harmful Outpu...

work page arXiv 2025
[37]

Perform a line-by-line semantic analysis of the input text corpus
[38]

Evaluate each instance against the aforementioned taxonomic criteria
[39]

stabbing

Extract the exact verbatim text of any instance classified as positive for violent/gory content. [Output Specification] Return exclusively an isolated list of the flagged prompts, preserving the original syntax and formatting one entry per line. Omit all conversational wrappers, introductory phrasing, or concluding meta-commentary. If no instances meet th...

2018

[1] [1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Luke Bailey, Euan Ong, Stuart Russell, and Scott Emmons. 2023. Image hijacks: Adversarial images can control generative models at runtime.arXiv preprint arXiv:2309.00236(2023)

work page arXiv 2023

[3] [3]

Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. 2024. The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024(2024), 13590–13618

2024

[4] [4]

Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Ma- hesh Pasupuleti. 2024. Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414(2024)

work page arXiv 2024

[5] [5]

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah

[6] [6]

Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence45, 9 (2023), 10850–10869

2023

[7] [7]

Yimo Deng and Huangxun Chen. 2023. Divide-and-conquer attack: Harnessing the power of llm to bypass the censorship of text-to-image generation model. arXiv preprint arXiv:2312.071301, 2 (2023)

work page arXiv 2023

[8] [8]

Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959

2025

[9] [9]

Yuxin Gou, Xiaoning Dong, Qin Li, Shishen Gu, Richang Hong, and Wenbo Hu. 2025. SURE: Safety Understanding and Reasoning Enhancement for Multi- modal Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 7563–7604

2025

[10] [10]

Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu. 2025. Perception-guided jailbreak against text-to-image models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 26238– 26247

2025

[11] [11]

Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, and Eunho Yang

[12] [12]

InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of- Distribution Strategy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 29937–29946

2025

[13] [13]

Songze Li, Jiameng Cheng, Yiming Li, Xiaojun Jia, and Dacheng Tao. 2025. Odysseus: Jailbreaking Commercial Multimodal LLM-integrated Systems via Dual Steganography.CoRRabs/2512.20168 (2025)

work page arXiv 2025

[14] [14]

Xiaoming Li, Xinyu Hou, and Chen Change Loy. 2024. When stylegan meets stable diffusion: a w+ adapter for personalized image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2187–2196

2024

[15] [15]

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. 2024. Mm- safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision. Springer, 386–403

2024

[16] [16]

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. 2022. Diffusion Models for Adversarial Purification. InInternational Conference on Machine Learning (ICML). PMLR, 16805–16827

2022

[17] [17]

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536

2024

[18] [18]

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr

[19] [19]

Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

Red-teaming the stable diffusion safety filter. arXiv.arXiv preprint arXiv:2210.04610(2022)

work page arXiv 2022

[20] [20]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Ma Teng, Xiaojun Jia, Ranjie Duan, Xinfeng Li, Yihao Huang, Zhixuan Chu, Yang Liu, and Wenqi Ren. 2024. Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models.CoRRabs/2412.05934 (2024)

work page arXiv 2024

[22] [22]

Corban Villa, Shujaat Mirza, and Christina Pöpper. 2025. Exposing the Guardrails: Reverse-Engineering and Jailbreaking Safety Filters in DALL-E Text-to-Image Pipelines. In34th USENIX Security Symposium (USENIX Security 25). 897–916

2025

[23] [23]

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. 2024. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE sym- posium on security and privacy (SP). IEEE, 897–912

2024

[24] [24]

Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025. Distraction is All You Need for Multimodal Large Language Model Jailbreaking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 9467–9476

2025

[25] [25]

Mingyu Yu, Lana Liu, Zhehao Zhao, Wei Wang, and Sujuan Qin. 2026. Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs.arXiv preprint arXiv:2601.15698 (2026)

work page arXiv 2026

[26] [26]

Chen Yueh-Han, Nitish Joshi, Yulin Chen, Maksym Andriushchenko, Rico Angell, and He He. 2025. Monitoring decomposition attacks in llms with lightweight sequential monitors.arXiv preprint arXiv:2506.10949(2025)

work page arXiv 2025

[27] [27]

Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, and Yu Tian. 2026. SafeSteer: A Decoding-level Defense Mecha- nism for Multimodal Large Language Models.arXiv preprint arXiv:2605.11716 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Guoqing Jin, and Anan Liu

[29] [29]

In Proceedings of the AAAI Conference on Artificial Intelligence, Vol

Reason2attack: Jailbreaking text-to-image models via llm reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 36030–36038

[30] [30]

Cong Zhang, Tianze Zhang, Liruo Wang, Ruohui Chen, Wei Li, and Aishan Liu

[31] [31]

InProceedings of the AAAI Conference on Artificial Intelligence

T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model. InProceedings of the AAAI Conference on Artificial Intelligence

[32] [32]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models.Findings of the Association for Computational Linguistics: ACL 2024(2024), 12401–12430. Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition Conference acronym ’XX...

2024

[33] [33]

Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models.Findings of the Association for Computational Linguistics: ACL 2024(2024), 12401–12430

2024

[34] [34]

Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. 2025. Jailbreaking multimodal large language models via shuffle inconsistency. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2045–2054

2025

[35] [35]

Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. 2025. Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency.CoRRabs/2501.04931 (2025)

work page arXiv 2025

[36] [36]

Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, and Muhao Chen. 2025. OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning.arXiv preprint arXiv:2512.02306(2025). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang, Zuobin Ying, and Zhuo Ma Benign Inputs, Harmful Outpu...

work page arXiv 2025

[37] [37]

Perform a line-by-line semantic analysis of the input text corpus

[38] [38]

Evaluate each instance against the aforementioned taxonomic criteria

[39] [39]

stabbing

Extract the exact verbatim text of any instance classified as positive for violent/gory content. [Output Specification] Return exclusively an isolated list of the flagged prompts, preserving the original syntax and formatting one entry per line. Omit all conversational wrappers, introductory phrasing, or concluding meta-commentary. If no instances meet th...

2018