pith. sign in

hub

Red-teaming the stable diffusion safety filter

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

hub tools

representative citing papers

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SafeDiffusion-R1 uses online GRPO with CLIP embedding steering to cut inappropriate content from 48.9% to 18.07% and nudity detections from 646 to 15 in diffusion models while raising GenEval scores from 42.08% to 47.83% and generalizing across seven harm categories without supervised pairs or extra

Closed-Form Concept Erasure via Double Projections

cs.LG · 2026-04-11 · unverdicted · novelty 6.0

A training-free double-projection linear transformation erases target concepts from generative models by computing a proxy projection then applying a constrained update in the left null space of known directions.

SHIFT: Steering Hidden Intermediates in Flow Transformers

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

SHIFT learns and applies steering vectors to selected layers and timesteps in DiT models to suppress concepts, shift styles, or bias objects while keeping image quality and prompt adherence intact.

citing papers explorer

Showing 13 of 13 citing papers.