Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
hub
Red-teaming the stable diffusion safety filter
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.
SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47.83% and generalizing across seven harm categories without supervised pairs or extra
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.
Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
VLBiasBench is a new large-scale benchmark with 128,342 samples covering nine social bias categories plus two intersectional ones to evaluate biases in LVLMs.
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
DDiffusion uses semantic retrieval on prompt embeddings and localized editing inside the diffusion process to suppress NSFW content while avoiding binary allow/block signals.
SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.
GEM bridges trajectory-based unlearning and teacher-guided erasure to create a geometric guidance objective for targeted concept suppression in Rectified Flow models.
citing papers explorer
-
Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation
Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
-
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
-
Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization
HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.
-
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47.83% and generalizing across seven harm categories without supervised pairs or extra
-
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
-
Closed-Form Concept Erasure via Double Projections
A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.
-
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
-
VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
VLBiasBench is a new large-scale benchmark with 128,342 samples covering nine social bias categories plus two intersectional ones to evaluate biases in LVLMs.
-
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation
SalUn uses gradient-based weight saliency to achieve effective machine unlearning of data, classes, or concepts in image classification and generation, narrowing the gap to exact retraining.
-
Disciplined Diffusion: Text-to-Image Diffusion Model against NSFW Generation
DDiffusion uses semantic retrieval on prompt embeddings and localized editing inside the diffusion process to suppress NSFW content while avoiding binary allow/block signals.
-
SHIFT: Steering Hidden Intermediates in Flow Transformers
SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.
-
Geometric Erasure by Contrastive Velocity Matching in Rectified Flows
GEM bridges trajectory-based unlearning and teacher-guided erasure to create a geometric guidance objective for targeted concept suppression in Rectified Flow models.
- FlowErase-RL: Rethinking Concept Erasure as Reward Optimization in Flow Matching Models