Red-teaming the stable diffusion safety filter

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tram` er · 2022 · arXiv 2210.04610

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

representative citing papers

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

cs.CR · 2026-04-07 · unverdicted · novelty 7.0

HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.

Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.

Closed-Form Concept Erasure via Double Projections

cs.LG · 2026-04-11 · unverdicted · novelty 6.0

A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.

Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.

Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

DDiffusion uses semantic retrieval on prompt embeddings and localized editing inside the diffusion process to suppress NSFW content while avoiding binary allow/block signals.

SHIFT: Steering Hidden Intermediates in Flow Transformers

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.

citing papers explorer

Showing 7 of 7 citing papers.

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization cs.CV · 2026-04-10 · unverdicted · none · ref 30
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization cs.CR · 2026-04-07 · unverdicted · none · ref 13
HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM cs.LG · 2026-04-28 · unverdicted · none · ref 43
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
Closed-Form Concept Erasure via Double Projections cs.LG · 2026-04-11 · unverdicted · none · ref 51
A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models cs.CV · 2026-04-06 · unverdicted · none · ref 24
Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation cs.CV · 2026-05-01 · unverdicted · none · ref 13
DDiffusion uses semantic retrieval on prompt embeddings and localized editing inside the diffusion process to suppress NSFW content while avoiding binary allow/block signals.
SHIFT: Steering Hidden Intermediates in Flow Transformers cs.CV · 2026-04-10 · unverdicted · none · ref 26
SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.

Red-teaming the stable diffusion safety filter

fields

years

verdicts

representative citing papers

citing papers explorer