Dynamic Optimization and Safety Indicator Injection for Jailbreaking Text-to-Image Models with Multimodal Safety Filters
read the original abstract
Text-to-image (T2I) models can generate not-safe-for-work (NSFW) content, motivating multi-stage safety pipelines with both text and image filters. Newer LLM-based filters detect latent intent beyond keywords, making token-level perturbation attacks unreliable. Our evaluation further shows that existing jailbreak methods exhibit a sharp trade-off between filter evasion and semantic fidelity, while also requiring excessive queries to succeed. We introduce \textbf{OptJail}, an automated jailbreak framework that combines dynamic prompt optimization with multimodal feedback. It consists of two key components: (i) \textit{Dynamic Optimization}, an iterative process that leverages text-filter feedback and semantic consistency to rewrite prompts into adversarial variants; and (ii) \textit{Adaptive Safety Indicator Injection}, which formulates the injection of benign visual cues as a reinforcement learning problem to bypass image-level filters. OptJail achieves state-of-the-art performance, increasing the ShieldLM-7B bypass rate from 8.9\% (Sneakyprompt) to 99.0\%, improving CLIP score from 0.2637 to 0.2762. Moreover, it generalizes to unseen filters and successfully jailbreaks DALL E 3 in our evaluation. Mechanistic analysis reveals why these defenses fail: optimized prompts are projected into the ``safe'' region of the filter's representation space yet remain nearly stationary in the generative model's semantic space, and injected safety indicators redirect image detectors' attention away from NSFW content toward benign visual cues. This study reveals systemic vulnerabilities in current multimodal defenses and motivates stronger adaptive defenses.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework
ConceptAgent is a black-box multi-agent system that awakens erased concepts in diffusion models by initializing denoising trajectories from surrogate-guided noisy states.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.