hub

Red-teaming the stable diffusion safety filter

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr · 2022 · arXiv 2210.04610

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Model

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

Introduces a layered intervention framework for knowledge infusion in multimodal generative models and empirically demonstrates complementarity of layers in a safety-alignment task with diffusion models.

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

DivIn samples initial noise from a guidance potential posterior via Langevin dynamics to improve diversity in class-to-image and text-to-image generation.

Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.

FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

FlowErase-RL applies GRPO to reformulate concept erasure in flow matching models as reward optimization using a dynamic dual-path mechanism for target suppression and non-target preservation.

Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

cs.CR · 2026-04-07 · unverdicted · novelty 7.0

HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

cs.CV · 2026-06-15 · unverdicted · novelty 6.0

REINS uses supervised PCA on safety-labeled activations to find a linear direction that, when added to hidden states at roughly 50% depth in video diffusion transformers, redirects generations from unsafe to safe content across multiple models.

Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

UVR is a training-free framework that uses attention modulation based on identified information flow stages in multimodal DiT attention to erase unsafe semantics in image synthesis and editing at 91% and 77% rates while preserving quality.

RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing

cs.CR · 2026-06-04 · unverdicted · novelty 6.0

RedEdit finds that fewer than two photo edits on average let 76.2% of unsafe images evade detectors while retaining 93.0% of malicious semantics.

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

cs.CR · 2026-06-01 · unverdicted · novelty 6.0

DSR decomposes harmful intents into benign textual and visual primitives that MLLMs fuse into harmful outputs, achieving high attack success with low input toxicity.

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

BEAP is a black-box embedding-aware prompting attack using LLM-guided search that raises attack success rate over 60% against unlearned diffusion models while keeping prompts undetectable.

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47.83% and generalizing across seven harm categories without supervised pairs or extra

Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.

Closed-Form Concept Erasure via Double Projections

cs.LG · 2026-04-11 · unverdicted · novelty 6.0

A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.

Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.

VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model

cs.CV · 2024-06-20 · conditional · novelty 6.0

VLBiasBench is a new large-scale benchmark with 128,342 samples covering nine social bias categories plus two intersectional ones to evaluate biases in LVLMs.

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

cs.LG · 2023-10-19 · conditional · novelty 6.0

SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.

Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

DDiffusion uses semantic retrieval on prompt embeddings and localized editing inside the diffusion process to suppress NSFW content while avoiding binary allow/block signals.

SHIFT: Steering Hidden Intermediates in Flow Transformers

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

cs.LG · 2026-05-29 · unverdicted · novelty 4.0

GEM bridges trajectory-based unlearning and teacher-guided erasure to create a geometric guidance objective for targeted concept suppression in Rectified Flow models.

citing papers explorer

Showing 12 of 12 citing papers after filters.

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior cs.CV · 2026-06-01 · unverdicted · none · ref 30
DivIn samples initial noise from a guidance potential posterior via Langevin dynamics to improve diversity in class-to-image and text-to-image generation.
Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation cs.CV · 2026-05-28 · unverdicted · none · ref 44
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models cs.CV · 2026-05-19 · unverdicted · none · ref 32 · 2 links
FlowErase-RL applies GRPO to reformulate concept erasure in flow matching models as reward optimization using a dynamic dual-path mechanism for target suppression and non-target preservation.
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization cs.CV · 2026-04-10 · unverdicted · none · ref 30
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering cs.CV · 2026-06-15 · unverdicted · none · ref 17
REINS uses supervised PCA on safety-labeled activations to find a linear direction that, when added to hidden states at roughly 50% depth in video diffusion transformers, redirects generations from unsafe to safe content across multiple models.
Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows cs.CV · 2026-06-05 · unverdicted · none · ref 11
UVR is a training-free framework that uses attention modulation based on identified information flow stages in multimodal DiT attention to erase unsafe semantics in image synthesis and editing at 91% and 77% rates while preserving quality.
Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models cs.CV · 2026-05-25 · unverdicted · none · ref 35
BEAP is a black-box embedding-aware prompting attack using LLM-guided search that raises attack success rate over 60% against unlearned diffusion models while keeping prompts undetectable.
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training cs.CV · 2026-05-18 · unverdicted · none · ref 20
SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47.83% and generalizing across seven harm categories without supervised pairs or extra
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models cs.CV · 2026-04-06 · unverdicted · none · ref 24
Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model cs.CV · 2024-06-20 · conditional · none · ref 55
VLBiasBench is a new large-scale benchmark with 128,342 samples covering nine social bias categories plus two intersectional ones to evaluate biases in LVLMs.
Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation cs.CV · 2026-05-01 · unverdicted · none · ref 13
DDiffusion uses semantic retrieval on prompt embeddings and localized editing inside the diffusion process to suppress NSFW content while avoiding binary allow/block signals.
SHIFT: Steering Hidden Intermediates in Flow Transformers cs.CV · 2026-04-10 · unverdicted · none · ref 26
SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.

Red-teaming the stable diffusion safety filter

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer