Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition
Pith reviewed 2026-06-28 14:15 UTC · model grok-4.3
The pith
Splitting harmful intents into benign text and image parts lets multimodal models recombine them into attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DSR decomposes harmful intent into a set of benign textual and visual primitives. By exploiting the model's reasoning ability, DSR enables the latent fusion of these seemingly innocent components into harmful outputs during the cross-modal inference phase. This achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate, uncovering a Utility-Safety Paradox in MLLMs.
What carries the argument
Distributed Semantic Recomposition (DSR), a framework that decomposes harmful intent into benign textual and visual primitives for latent fusion during cross-modal inference.
If this is right
- DSR achieves superior attack success rates on multiple commercial MLLM pipelines.
- Input toxicity remains extremely low or negligible while producing harmful outputs.
- The model's instruction-following proficiency facilitates its own cognitive exploitation.
- Existing unimodal textual safety guardrails are vulnerable to cross-modal attacks.
Where Pith is reading between the lines
- Defenses would need to monitor cross-modal reasoning steps rather than inputs alone.
- Similar decomposition techniques could apply to other cross-modal tasks beyond attacks.
- If models gain better intent detection during fusion, the attack surface would shrink.
Load-bearing premise
Commercial MLLMs will reliably perform the latent fusion of distributed benign primitives into harmful content without safety filters detecting the intent during cross-modal reasoning.
What would settle it
An experiment where the models refuse to fuse the benign components into harmful content or detect the harmful intent before generating outputs.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in content synthesis and autonomous reasoning. Previous safety guardrails are primarily designed for unimodal textual input interception, leaving them vulnerable to cross-modal jailbreak attacks. However, regardless unimodal textual attack or cross-modal jailbreak, typically inclusive part of explicit harmful or sensitive content at the input level, which is called Harm-Bearing. It allow the model's safety filters to detect and block such content easily. To address this limitations, we propose Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreak framework that decomposes harmful intent into a set of benign textual and visual primitives. By exploiting the model's reasoning ability, DSR enables the latent fusion of these seemingly innocent components into harmful outputs during the cross-modal inference phase. Extensive experiments on multiple commercial MLLMs pipelines demonstrate that DSR achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate. Our findings uncover a critical Utility-Safety Paradox in MLLMs, where the model's instruction-following proficiency facilitates its own cognitive exploitation. Content Warning: This paper contains harmful model responses.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Distributed Semantic Recomposition (DSR), a cross-modal jailbreak framework for Multimodal Large Language Models (MLLMs). Harmful intent is decomposed into benign textual and visual primitives; the model is then expected to perform latent fusion of these components into harmful outputs during cross-modal inference. The abstract claims this yields superior attack success rates while keeping input toxicity extremely low or negligible, and identifies a Utility-Safety Paradox in which instruction-following proficiency enables cognitive exploitation of the model.
Significance. If the reported results are reproducible and the attack generalizes, the work would be significant for exposing a gap in unimodal safety guardrails and for demonstrating how cross-modal reasoning can be exploited without explicit harmful content at input. The emphasis on negligible input toxicity distinguishes it from prior Harm-Bearing attacks and could motivate new evaluation protocols for multimodal safety.
major comments (2)
- [Abstract] Abstract: the claim that DSR 'achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate' is presented without any experimental setup, datasets, metrics, baselines, or quantitative results, rendering the central performance claim unevaluable from the manuscript.
- [Abstract] Abstract: the method rests on the assumption that commercial MLLMs will reliably recombine distributed benign primitives into harmful content via latent fusion without safety filters intervening during cross-modal reasoning; no mechanism, analysis, or ablation is supplied to show why intermediate reasoning steps remain below refusal thresholds.
minor comments (1)
- [Abstract] Abstract contains grammatical issues: 'regardless unimodal' should read 'regardless of unimodal'; 'It allow' should be 'It allows'; 'this limitations' should be 'these limitations'.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below and propose revisions where appropriate to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that DSR 'achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate' is presented without any experimental setup, datasets, metrics, baselines, or quantitative results, rendering the central performance claim unevaluable from the manuscript.
Authors: The abstract is a high-level summary of results whose supporting details appear in Sections 4 and 5 of the manuscript, which describe the datasets, metrics (including ASR and toxicity scores), baselines, and quantitative outcomes across multiple commercial MLLMs. The claim is therefore evaluable from the full manuscript. To improve clarity, we will revise the abstract to include a concise reference to the evaluation protocol. revision: partial
-
Referee: [Abstract] Abstract: the method rests on the assumption that commercial MLLMs will reliably recombine distributed benign primitives into harmful content via latent fusion without safety filters intervening during cross-modal reasoning; no mechanism, analysis, or ablation is supplied to show why intermediate reasoning steps remain below refusal thresholds.
Authors: We agree that the manuscript does not supply a dedicated mechanistic analysis or ablation explaining why safety filters do not trigger on intermediate cross-modal steps. The current work focuses on empirical demonstration of the attack. We will add a new subsection discussing observed failure modes of unimodal guardrails during latent fusion and include a limited ablation on input ordering and modality balance to address this gap. revision: yes
Circularity Check
No circularity: empirical method with no derivations or self-referential claims
full rationale
The paper proposes an empirical cross-modal jailbreak technique (DSR) that decomposes harmful intents into benign textual/visual primitives and reports experimental attack success rates on commercial MLLMs. No equations, mathematical derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the abstract or method description. The central claims rest on observed experimental outcomes (high ASR, low input toxicity) rather than any reduction of results to inputs by construction. This is a standard empirical security paper whose validity is externally falsifiable via replication on the tested models.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MLLMs possess reasoning capabilities that can fuse distributed cross-modal benign inputs into coherent harmful outputs
invented entities (1)
-
Distributed Semantic Recomposition (DSR)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [2]
-
[3]
Davide Caffagni, Federico Cocchi, Luca Barsellotti, Nicholas Moratelli, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara. 2024. The revolution of multimodal large language models: A survey.Findings of the association for computational linguistics: ACL 2024(2024), 13590–13618
2024
-
[4]
Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Ma- hesh Pasupuleti. 2024. Llama guard 3 vision: Safeguarding human-ai image understanding conversations.arXiv preprint arXiv:2411.10414(2024)
-
[5]
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
-
[6]
Diffusion models in vision: A survey.IEEE transactions on pattern analysis and machine intelligence45, 9 (2023), 10850–10869
2023
- [7]
-
[8]
Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. 2025. Figstep: Jailbreaking large vision- language models via typographic visual prompts. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 23951–23959
2025
-
[9]
Yuxin Gou, Xiaoning Dong, Qin Li, Shishen Gu, Richang Hong, and Wenbo Hu. 2025. SURE: Safety Understanding and Reasoning Enhancement for Multi- modal Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 7563–7604
2025
-
[10]
Yihao Huang, Le Liang, Tianlin Li, Xiaojun Jia, Run Wang, Weikai Miao, Geguang Pu, and Yang Liu. 2025. Perception-guided jailbreak against text-to-image models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 26238– 26247
2025
-
[11]
Joonhyun Jeong, Seyun Bae, Yeonsung Jung, Jaeryong Hwang, and Eunho Yang
-
[12]
InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of- Distribution Strategy. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 29937–29946
2025
- [13]
-
[14]
Xiaoming Li, Xinyu Hou, and Chen Change Loy. 2024. When stylegan meets stable diffusion: a w+ adapter for personalized image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2187–2196
2024
-
[15]
Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. 2024. Mm- safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision. Springer, 386–403
2024
-
[16]
Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. 2022. Diffusion Models for Adversarial Purification. InInternational Conference on Machine Learning (ICML). PMLR, 16805–16827
2022
-
[17]
Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. 2024. Visual adversarial examples jailbreak aligned large language models. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 21527–21536
2024
-
[18]
Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr
-
[19]
Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022
Red-teaming the stable diffusion safety filter. arXiv.arXiv preprint arXiv:2210.04610(2022)
-
[20]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [21]
-
[22]
Corban Villa, Shujaat Mirza, and Christina Pöpper. 2025. Exposing the Guardrails: Reverse-Engineering and Jailbreaking Safety Filters in DALL-E Text-to-Image Pipelines. In34th USENIX Security Symposium (USENIX Security 25). 897–916
2025
-
[23]
Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. 2024. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE sym- posium on security and privacy (SP). IEEE, 897–912
2024
-
[24]
Zuopeng Yang, Jiluan Fan, Anli Yan, Erdun Gao, Xin Lin, Tao Li, Kanghua Mo, and Changyu Dong. 2025. Distraction is All You Need for Multimodal Large Language Model Jailbreaking. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 9467–9476
2025
- [25]
- [26]
-
[27]
Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, and Yu Tian. 2026. SafeSteer: A Decoding-level Defense Mecha- nism for Multimodal Large Language Models.arXiv preprint arXiv:2605.11716 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Chenyu Zhang, Lanjun Wang, Yiwen Ma, Wenhui Li, Guoqing Jin, and Anan Liu
-
[29]
In Proceedings of the AAAI Conference on Artificial Intelligence, Vol
Reason2attack: Jailbreaking text-to-image models via llm reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 36030–36038
-
[30]
Cong Zhang, Tianze Zhang, Liruo Wang, Ruohui Chen, Wei Li, and Aishan Liu
-
[31]
InProceedings of the AAAI Conference on Artificial Intelligence
T2I-RiskyPrompt: A Benchmark for Safety Evaluation, Attack, and Defense on Text-to-Image Model. InProceedings of the AAAI Conference on Artificial Intelligence
-
[32]
Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models.Findings of the Association for Computational Linguistics: ACL 2024(2024), 12401–12430. Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition Conference acronym ’XX...
2024
-
[33]
Duzhen Zhang, Yahan Yu, Jiahua Dong, Chenxing Li, Dan Su, Chenhui Chu, and Dong Yu. 2024. Mm-llms: Recent advances in multimodal large language models.Findings of the Association for Computational Linguistics: ACL 2024(2024), 12401–12430
2024
-
[34]
Shiji Zhao, Ranjie Duan, Fengxiang Wang, Chi Chen, Caixin Kang, Shouwei Ruan, Jialing Tao, YueFeng Chen, Hui Xue, and Xingxing Wei. 2025. Jailbreaking multimodal large language models via shuffle inconsistency. InProceedings of the IEEE/CVF International Conference on Computer Vision. 2045–2054
2025
- [35]
-
[36]
Boyu Zhu, Xiaofei Wen, Wenjie Jacky Mo, Tinghui Zhu, Yanan Xie, Peng Qi, and Muhao Chen. 2025. OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning.arXiv preprint arXiv:2512.02306(2025). Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang, Zuobin Ying, and Zhuo Ma Benign Inputs, Harmful Outpu...
-
[37]
Perform a line-by-line semantic analysis of the input text corpus
-
[38]
Evaluate each instance against the aforementioned taxonomic criteria
-
[39]
stabbing
Extract the exact verbatim text of any instance classified as positive for violent/gory content. [Output Specification] Return exclusively an isolated list of the flagged prompts, preserving the original syntax and formatting one entry per line. Omit all conversational wrappers, introductory phrasing, or concluding meta-commentary. If no instances meet th...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.