pith. sign in

arxiv: 2605.18719 · v1 · pith:QWE7CP4Jnew · submitted 2026-05-18 · 💻 cs.CV

SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training

Pith reviewed 2026-05-20 11:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords safe diffusiononline reinforcement learningCLIP steeringGRPOcontent moderationgenerative AI safetypost-training
0
0 comments X

The pith

By steering CLIP embeddings during online RL, diffusion models reduce inappropriate content generation without paired data or tuned reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that an online reinforcement learning process using Group Relative Policy Optimization can make diffusion models generate far less inappropriate content by rewarding shifts in CLIP text embeddings toward safe directions. This would matter if true because earlier safe-training techniques demand large amounts of paired unsafe-safe data that are expensive to create and often cause the model to lose its ability to produce high-quality images. The method works by updating the policy on both positive and negative prompts in real time, letting the model see diverse and even explicit content during training. If the approach holds, it would mean safer image generators can be produced at larger scale and that the safety benefits carry over to entirely new categories of harmful prompts.

Core claim

The authors claim that online post-training with GRPO on positive and negative text prompts, guided by a steering reward that exploits CLIP embedding properties to direct representations away from unsafe content, reduces inappropriate content to 18.07% compared to 48.9% for the base SD v1.4 model, lowers nudity detections from 646 to 15, and raises GenEval compositional quality from 42.08% to 47.83%, with these improvements generalizing to out-of-domain prompts across seven harm categories without any supervised paired data or reward model tuning.

What carries the argument

The steering reward mechanism, which shifts text representations in CLIP embedding space toward positive safety directions and away from negative ones to provide the reward signal for the online GRPO policy updates.

Load-bearing premise

CLIP embeddings contain reliable directions that correspond to safety versus unsafety for a wide range of text prompts, so that moving along those directions provides a useful reward signal for the reinforcement learning updates.

What would settle it

Running the method on a collection of prompts where human raters disagree with the CLIP-based safety directions and finding no safety improvement or quality drop would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18719 by Abhishek Basu, Ankan Deria, Fahad Shamshad, Hisham Cholakkal, Karthik Nandakumar, Komal Kumar.

Figure 1
Figure 1. Figure 1: Effect of post-training reward design on safety–utility trade-off. Each curve tracks HPSv2 (Wu et al., 2023) over GRPO (Shao et al., 2024; Xue et al., 2025) training steps; annotations report GenEval, Nudity Rate, and Inappropriate rate at key checkpoints. Horizontal lines denote static baselines (SD v1.4, Safe-DPO, RECE). Safety Prompt Scaling uses diverse safety prompts (harassment, shocking, nudity, etc… view at source ↗
Figure 2
Figure 2. Figure 2: GRPO-based reward steering framework. Given a prompt, the policy samples candidate outputs whose embeddings are evaluated via CLIP. Safe and unsafe anchors define a steering vector computed from embedding differences. The steered target representation modifies reward computation, yielding a z-score normalized advantage used in policy loss. Example outputs illustrate how steering shifts rewards toward safer… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on challenging unsafe prompts. We compare our method with prior concept erasure and safety alignment approaches on representative prompts containing explicit or sensitive content. While existing methods either fail to suppress unsafe attributes or degrade visual fidelity, our approach consistently generates safe, semantically coherent, and high-quality images, demonstrating effective… view at source ↗
Figure 4
Figure 4. Figure 4: Scheduler ablation for safety-aligned diffusion. Mean unsafe score using NudeNet (Bedapudi, 2022) across training epochs for 9 distinct schedulers. Solid lines with circle markers denote stochastic schedulers; dashed lines with square markers denote deterministic ones. All schedulers converge to near-zero unsafe content by epoch 300, but deterministic schedulers (Heun, LMS, PNDM) achieve faster safety redu… view at source ↗
Figure 5
Figure 5. Figure 5: ablates the effect of steering direction and strength (α) on the safety–utility trade-off across three prompt perturbation strategies: synonyms, keyword-minimal, and negation. The left panels show UMAP embeddings of safe and unsafe prompt clusters alongside the mean (µsafe, µunsafe) and steering anchors (vsafe, pmix), while the center and right panels report the resulting safety score s = z · vsafe as a fu… view at source ↗
Figure 6
Figure 6. Figure 6: Safety steering in embedding space. (A) Safe (blue) and unsafe (red) text embeddings form distinct clusters; their mean difference defines a normalized safety direction vsafe pointing from unsafe to safe concepts. (B) For an unsafe prompt embedding zT , adding αvsafe and renormalizing rotates the representation toward the safety direction on the unit hypersphere. The steered embedding z ′ T is used exclusi… view at source ↗
Figure 7
Figure 7. Figure 7: Despite training exclusively on nudity-focused prompts, all categories exhibit monotonically decreasing inappropriate rates, demonstrating the strong OOD generalization of our steering reward formulation. Method showing safer content wins. D.2 Utility Degradation under Negative-Only Training [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on nudity-focused I2P prompts. Each row shows outputs for a single prompt across methods. Our SafeDiffusion-R1 (rightmost column) consistently generates safe, high-fidelity images. Methods such as EraseDiff and Ablating CA frequently fail to suppress explicit content, while ESD-x and FMN introduce degradation. AdvUnlearn and SAeUron are competitive on safety but exhibit over-smoothed… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on benign compositional prompts (GenEval). Rows correspond to representative GenEval tasks: single object, two objects with attributes, spatial relations, and color binding. SafeDiffusion-R1 produces generations that are both semantically accurate and visually coherent. RECE degrades compositional accuracy (notably in two-object and relational tasks), whereas our method maintains or … view at source ↗
Figure 10
Figure 10. Figure 10: Unsafe score progression with negative-only reward training. Training with only a negative CLIP penalty (−1× CLIP score against nudity prompts) achieves the lowest unsafe score but causes severe utility collapse, as evidenced by degraded FID and CLIP-T scores. The model learns to generate degenerate images that match no prompt rather than learning to generate safe, semantically appropriate content. I Prom… view at source ↗
Figure 11
Figure 11. Figure 11: Utility comparison: negative-only reward vs. steering reward. Top row: Outputs from the model trained with −1× CLIP penalty (negative-only). Images are heavily degraded with loss of structure, unnatural colors, and semantic incoherence. Bottom row: Outputs from our steering reward model. High visual quality and semantic alignment are preserved on benign prompts, demonstrating that positive anchors are cri… view at source ↗
Figure 12
Figure 12. Figure 12: Utility quality comparison for SafeCLIP (positive + negative) variants. We show generated images for benign prompts across SafeCLIP configurations (2K, 7K, 100K positive prompts) alongside our steering reward. The steering reward produces sharper, more compositionally accurate images while achieving the lowest unsafe score, confirming that anchor-based geometric steering outperforms direct positive/negati… view at source ↗
Figure 13
Figure 13. Figure 13: Safety suppression quality for SafeCLIP (positive + negative) variants. Outputs on nudity-focused prompts from I2P. SafeCLIP variants with only positive/negative CLIP supervision show residual explicit content at higher frequencies than our steering reward, particularly for prompts with mixed safe and unsafe semantic content. Our method steers the reward signal geometrically, enabling more robust suppress… view at source ↗
Figure 14
Figure 14. Figure 14: Safety suppression under positive-only SafeCLIP training. Training with only positive prompt alignment does not sufficiently suppress unsafe content — explicit generations are common even after 300 epochs. This highlights the necessity of negative anchors for constructing a meaningful safety direction in embedding space. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Utility preservation under SafeCLIP vs. SafeDiffusion-R1. Both methods maintain similar image quality on benign prompts, but SafeCLIP’s weaker safety suppression (confirmed numerically in [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: SafeCLIP v1 configuration comparison. Early SafeCLIP configurations (v1) show inconsistent safety suppression, particularly on ambiguous prompts that contain both benign and unsafe semantic components. Our anchor-based steering reward addresses this by geometrically redirecting the reward signal, providing consistent suppression across the full spectrum of unsafe prompt types. 26 [PITH_FULL_IMAGE:figures… view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative comparison with LLaVA-augmented penalty. The SafeCLIP+LLaVA variant occasionally produces distorted outputs when the VLM penalty fires on borderline safe images, introducing optimization instability. Our continuous, geometry-based steering reward avoids this issue by modulating the reward smoothly via αvsafe rather than through discrete penalty thresholds [PITH_FULL_IMAGE:figures/full_fig_p02… view at source ↗
Figure 18
Figure 18. Figure 18: NSFW suppression progression across training steps. We show outputs for a representative nudity￾focused I2P prompt at checkpoints 50, 100, 200, and 600. The model progressively redirects its generation toward safe content: at step 50 explicit content is still visible; by step 200 the model generates conservative compositions; at step 600 outputs are fully appropriate with no nudity detected by NudeNet (th… view at source ↗
Figure 19
Figure 19. Figure 19: Utility preservation on benign prompts across training steps. We show GenEval-style compositional prompts (two objects, color attributes, spatial relations) at the same training checkpoints as [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Utility comparison: SafeCLIP (positive-only) across prompt scales. We evaluate image quality on benign compositional prompts for SafeCLIP models trained with 2K, 7K, and 100K positive prompts. While utility is broadly preserved across all scales, the 100K variant shows slight mode averaging artifacts on complex scenes. Our steering reward (using only 7K positive + 1.9K negative anchors) achieves superior … view at source ↗
read the original abstract

Diffusion models have been widely studied for removing unsafe content learned during pre-training. Existing methods require expensive supervised data, either unsafe-text paired with safe-image groundtruth or negative/positive image pairs, making them impractical to scale. Furthermore, offline reinforcement learning and supervised fine-tuning approaches that generate synthetic data offline suffer from catastrophic forgetting, degrading generation quality. We propose a novel online reinforcement learning framework that addresses both data scarcity and model degradation through post-training with Group Relative Policy Optimization (GRPO) on both negative and positive text prompts. To eliminate the need for fine-tuning specialized safe/unsafe reward models, we introduce a \textit{steering reward mechanism} that exploits an inherent property of CLIP embeddings: steering text representations toward positive safety directions and away from negative ones in the embedding space. Our online-policy approach enables the model to learn from diverse prompts, including explicit unsafe content, without catastrophic forgetting. Extensive experiments demonstrate that our method reduces inappropriate content to 18.07\% (vs. 48.9\% for SD v1.4) and nudity detections to 15 (vs. 646 baseline) while improving compositional generation quality from 42.08\% to 47.83\% on GenEval. Remarkably, these safety gains generalize to out-of-domain unsafe prompts across seven harm categories, achieving state-of-the-art performance without supervised paired data or reward tuning. Github: https://github.com/MAXNORM8650/SafeDiffusion-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SafeDiffusion-R1, an online RL post-training framework for diffusion models that applies Group Relative Policy Optimization (GRPO) guided by a steering reward derived directly from CLIP text embeddings. The approach steers embeddings toward positive safety directions and away from negative ones to reduce unsafe content generation without supervised paired data, fine-tuned reward models, or offline synthetic data generation. Experiments report reductions in inappropriate content to 18.07% (vs. 48.9% for SD v1.4) and nudity detections to 15 (vs. 646), alongside GenEval compositional quality gains from 42.08% to 47.83%, with claimed generalization to out-of-domain prompts across seven harm categories.

Significance. If the central results hold under rigorous verification, the work would offer a scalable alternative to existing safe diffusion methods by avoiding catastrophic forgetting through online policy updates and eliminating the need for reward model training or paired supervision. The use of pre-existing CLIP properties for the reward signal and the reported out-of-domain generalization could have practical value for deploying safer generative models, provided the embedding-space steering produces consistent gradients.

major comments (2)
  1. [§3.2] §3.2 (Steering Reward Mechanism): The central claim that CLIP embeddings contain stable, separable positive/negative safety directions sufficient to supply a functional GRPO reward signal without any reward model training is load-bearing for all reported safety gains and out-of-domain generalization. The manuscript provides no ablation or analysis demonstrating that these directions remain reliable across the seven harm categories or for the explicit unsafe prompts used; if the separation is prompt-dependent or weak, the observed reductions (e.g., inappropriate content to 18.07%) could arise from prompt selection rather than the claimed mechanism.
  2. [§4] Experimental Results (quantitative tables and §4): The reported point estimates for safety metrics (18.07% inappropriate content, 15 nudity detections) and GenEval improvement (to 47.83%) lack any mention of run-to-run variance, statistical significance tests, exact baseline reproduction details, or data exclusion criteria. These omissions make it impossible to determine whether the gains over SD v1.4 are robust or could be artifacts, directly affecting the strength of the SOTA and generalization claims.
minor comments (2)
  1. [Abstract and §4] The abstract and §4 refer to 'seven harm categories' for out-of-domain generalization but do not list the categories or provide per-category breakdowns; adding this would improve clarity.
  2. [§3.2] Notation for the steering reward (positive/negative direction vectors in CLIP space) should be formalized with an equation in §3.2 to make the GRPO reward computation explicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below with point-by-point responses. Where the concerns identify gaps in the current manuscript, we have revised the text and added supporting material in the updated version.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Steering Reward Mechanism): The central claim that CLIP embeddings contain stable, separable positive/negative safety directions sufficient to supply a functional GRPO reward signal without any reward model training is load-bearing for all reported safety gains and out-of-domain generalization. The manuscript provides no ablation or analysis demonstrating that these directions remain reliable across the seven harm categories or for the explicit unsafe prompts used; if the separation is prompt-dependent or weak, the observed reductions (e.g., inappropriate content to 18.07%) could arise from prompt selection rather than the claimed mechanism.

    Authors: We agree that explicit verification of the stability and separability of the safety directions in CLIP space is necessary to support the central mechanism. The reported out-of-domain generalization across seven harm categories provides supporting evidence that the directions are not prompt-specific, as the evaluation prompts were drawn from a held-out set distinct from training. Nevertheless, to strengthen this claim we have added a new ablation subsection in §3.2 that projects the steering vectors onto each harm category and measures the resulting reward signal strength and downstream safety metric changes. This analysis shows consistent positive/negative separation and confirms that the observed reductions are attributable to the steering mechanism rather than evaluation prompt choice. revision: yes

  2. Referee: [§4] Experimental Results (quantitative tables and §4): The reported point estimates for safety metrics (18.07% inappropriate content, 15 nudity detections) and GenEval improvement (to 47.83%) lack any mention of run-to-run variance, statistical significance tests, exact baseline reproduction details, or data exclusion criteria. These omissions make it impossible to determine whether the gains over SD v1.4 are robust or could be artifacts, directly affecting the strength of the SOTA and generalization claims.

    Authors: We acknowledge that the original submission omitted variance estimates and reproducibility details. The reported numbers were obtained from single runs using the same random seeds and evaluation protocol as the reproduced SD v1.4 baseline (official Hugging Face weights, identical prompt sets, and the same filtering criteria for the inappropriate-content and nudity detectors). In the revised manuscript we have expanded §4 with results from three independent runs (different seeds), reporting means and standard deviations for all metrics. We have also added a reproducibility subsection detailing the exact baseline reproduction steps, data exclusion rules, and the statistical test (paired t-test) used to assess significance of the GenEval and safety improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external CLIP property and standard GRPO

full rationale

The paper's central mechanism defines a steering reward directly from the pre-existing geometry of CLIP text embeddings (steering toward positive safety directions and away from negative ones) without fitting any new parameters to the reported safety metrics or GenEval scores. GRPO is applied as a standard online RL algorithm on this externally supplied reward signal. No equations reduce the claimed safety gains (18.07% inappropriate content, 15 nudity detections) to quantities defined inside the paper by construction, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The empirical results are therefore independent of the method's own fitted constants.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CLIP embeddings contain usable safety directions that can be steered without additional training, plus the standard assumption that online policy optimization on mixed positive/negative prompts prevents catastrophic forgetting in diffusion models.

axioms (1)
  • domain assumption CLIP text embeddings contain separable positive and negative safety directions that can be exploited for reward steering without fine-tuning any reward model.
    Stated directly in the abstract as the basis for eliminating specialized reward models.

pith-pipeline@v0.9.0 · 5815 in / 1401 out tokens · 42816 ms · 2026-05-20T11:23:15.786635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 6 internal anchors

  1. [1]

    Adobe Blog , year=

    Responsible innovation in the age of generative AI , author=. Adobe Blog , year=

  2. [2]

    Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

    Improving image captioning with better use of caption , author=. Proceedings of the 58th annual meeting of the association for computational linguistics , pages=

  3. [3]

    European Conference on Computer Vision , pages=

    Adversarial diffusion distillation , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  4. [4]

    Advances in neural information processing systems , volume=

    Laion-5b: An open large-scale dataset for training next generation image-text models , author=. Advances in neural information processing systems , volume=

  5. [5]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Stereo: A two-stage framework for adversarially robust concept erasing from text-to-image diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    The privacy onion effect: Memorization is relative , author=. Advances in Neural Information Processing Systems , volume=

  7. [7]

    2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 , volume=

    Approximating the Kullback Leibler divergence between Gaussian mixture models , author=. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 , volume=. 2007 , organization=

  8. [8]

    Foundations and Trends in Privacy and Security , volume=

    Safety at scale: A comprehensive survey of large model and agent safety , author=. Foundations and Trends in Privacy and Security , volume=. 2026 , publisher=

  9. [9]

    for now , author=

    To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  10. [10]

    URL https://github

    Nudenet: Neural nets for nudity detection and censoring , author=. URL https://github. com/notAItech/NudeNet , year=

  11. [11]

    NSFW detection machine learning model , author=

  12. [12]

    Tutorial: How to remove the safety filter in 5 seconds , author=

  13. [13]

    CVPR , year =

    Mengyao Lyu and Yuhong Yang and Haiwen Hong and Hui Chen and Xuan Jin and Yuan He and Hui Xue and Jungong Han and Guiguang Ding , title =. CVPR , year =

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Selective amnesia: A continual learning approach to forgetting in deep generative models , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    NeurIPS , year =

    Hanul Shin and Jung Kwon Lee and Jaehong Kim and Jiwon Kim , title =. NeurIPS , year =

  16. [16]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  17. [17]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mace: Mass concept erasure in diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  18. [18]

    ICLR , year =

    Jaehong Yoon and Shoubin Yu and Vaidehi Patil and Huaxiu Yao and Mohit Bansal , title =. ICLR , year =

  19. [19]

    CoRR , volume =

    Daiki Miyake and Akihiro Iohara and Yu Saito and Toshiyuki Tanaka , title =. CoRR , volume =

  20. [20]

    arXiv preprint arXiv:2210.04610 , year=

    Red-teaming the stable diffusion safety filter , author=. arXiv preprint arXiv:2210.04610 , year=

  21. [21]

    arXiv preprint arXiv:2407.20516 , year=

    Machine unlearning in generative ai: A survey , author=. arXiv preprint arXiv:2407.20516 , year=

  22. [22]

    IEEE Internet of Things Journal , year=

    A survey of machine unlearning in generative ai models: Methods, applications, security, and challenges , author=. IEEE Internet of Things Journal , year=

  23. [23]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  24. [24]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Erasing undesirable influence in diffusion models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  25. [25]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Erasing concepts from diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  26. [26]

    2026 , url=

    Tatiana Gaintseva and Andreea-Maria Oncescu and Chengcheng Ma and Ziquan Liu and Martin Benning and Gregory Slabaugh and Jiankang Deng and Ismail Elezi , booktitle=. 2026 , url=

  27. [27]

    Advances in neural information processing systems , volume=

    Defensive unlearning with adversarial training for robust concept erasure in diffusion models , author=. Advances in neural information processing systems , volume=

  28. [28]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Forget-me-not: Learning to forget in text-to-image diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  29. [29]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    One-dimensional adapter to rule them all: Concepts diffusion models and erasing applications , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  30. [30]

    International Conference on Learning Representations , volume=

    Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation , author=. International Conference on Learning Representations , volume=

  31. [31]

    Advances in Neural Information Processing Systems , volume=

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , author=. Advances in Neural Information Processing Systems , volume=

  32. [32]

    CVPR , year =

    Dewei Zhou and You Li and Fan Ma and Xiaoting Zhang and Yi Yang , title =. CVPR , year =

  33. [33]

    CVPR , year =

    Ganggui Ding and Canyu Zhao and Wen Wang and Zhen Yang and Zide Liu and Hao Chen and Chunhua Shen , title =. CVPR , year =

  34. [34]

    Style Aligned Image Generation via Shared Attention , booktitle =

    Amir Hertz and Andrey Voynov and Shlomi Fruchter and Daniel Cohen. Style Aligned Image Generation via Shared Attention , booktitle =

  35. [35]

    CVPR , year =

    Cusuh Ham and Matthew Fisher and James Hays and Nicholas Kolkin and Yuchen Liu and Richard Zhang and Tobias Hinz , title =. CVPR , year =

  36. [36]

    Prompt-to-Prompt Image Editing with Cross-Attention Control , booktitle =

    Amir Hertz and Ron Mokady and Jay Tenenbaum and Kfir Aberman and Yael Pritch and Daniel Cohen. Prompt-to-Prompt Image Editing with Cross-Attention Control , booktitle =

  37. [37]

    ECCV , year =

    Rohit Gandikota and Joanna Materzynska and Tingrui Zhou and Antonio Torralba and David Bau , title =. ECCV , year =

  38. [38]

    CVPR , year =

    Gihyun Kwon and Jong Chul Ye , title =. CVPR , year =

  39. [39]

    ICML , year =

    Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , title =. ICML , year =

  40. [40]

    Language-Driven Image Style Transfer , journal =

    Tsu. Language-Driven Image Style Transfer , journal =. 2021 , url =

  41. [41]

    CoRR , year =

    Yunpeng Bai and Jiayue Liu and Chao Dong and Chun Yuan , title =. CoRR , year =

  42. [42]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , year =

  43. [43]

    ICLR , year =

    Dustin Podell and Zion English and Kyle Lacey and Andreas Blattmann and Tim Dockhorn and Jonas M. ICLR , year =

  44. [44]

    ECCV , year =

    Rohit Girdhar and Mannat Singh and Andrew Brown and Quentin Duval and Samaneh Azadi and Sai Saketh Rambhatla and Akbar Shah and Xi Yin and Devi Parikh and Ishan Misra , title =. ECCV , year =

  45. [45]

    Taming Transformers for High-Resolution Image Synthesis , booktitle =

    Patrick Esser and Robin Rombach and Bj. Taming Transformers for High-Resolution Image Synthesis , booktitle =

  46. [46]

    MICCAI , year =

    Olaf Ronneberger and Philipp Fischer and Thomas Brox , title =. MICCAI , year =

  47. [47]

    Gomez and Lukasz Kaiser and Illia Polosukhin , title =

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. NeurIPS , year =

  48. [48]

    Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

    Unified concept editing in diffusion models , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=

  49. [49]

    CVPR , year =

    Hang Li and Chengzhi Shen and Philip Torr and Volker Tresp and Jindong Gu , title =. CVPR , year =

  50. [50]

    ICLR , year =

    Mingi Kwon and Jaeseok Jeong and Youngjung Uh , title =. ICLR , year =

  51. [51]

    European Conference on Computer Vision , pages=

    Reliable and efficient concept erasure of text-to-image diffusion models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  52. [52]

    European Conference on Computer Vision , pages=

    Safe-clip: Removing nsfw concepts from vision-and-language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  53. [53]

    Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry , booktitle =

    Yong. Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry , booktitle =

  54. [54]

    CVPR , year =

    Chenyang Si and Ziqi Huang and Yuming Jiang and Ziwei Liu , title =. CVPR , year =

  55. [55]

    CVPR , year =

    Narek Tumanyan and Michal Geyer and Shai Bagon and Tali Dekel , title =. CVPR , year =

  56. [56]

    EMNLP , year =

    Jack Hessel and Ari Holtzman and Maxwell Forbes and Ronan Le Bras and Yejin Choi , title =. EMNLP , year =

  57. [57]

    NeurIPS , year =

    Martin Heusel and Hubert Ramsauer and Thomas Unterthiner and Bernhard Nessler and Sepp Hochreiter , title =. NeurIPS , year =

  58. [58]

    Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts , booktitle =

    Zhi. Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts , booktitle =

  59. [59]

    Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models? , booktitle =

    Yu. Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models? , booktitle =

  60. [60]

    MMA-Diffusion: MultiModal Attack on Diffusion Models , booktitle =

    Yijun Yang and Ruiyuan Gao and Xiaosen Wang and Tsung. MMA-Diffusion: MultiModal Attack on Diffusion Models , booktitle =

  61. [61]

    Smith , title =

    Yushi Hu and Benlin Liu and Jungo Kasai and Yizhong Wang and Mari Ostendorf and Ranjay Krishna and Noah A. Smith , title =. ICCV , year =

  62. [62]

    European conference on computer vision , pages=

    Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

  63. [63]

    and Shechtman, Eli and Wang, Oliver , booktitle=

    Zhang, Richard and Isola, Phillip and Efros, Alexei A. and Shechtman, Eli and Wang, Oliver , booktitle=. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , year=

  64. [64]

    GPT-4o , url=

    OpenAI , year=. GPT-4o , url=

  65. [65]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  66. [66]

    Goodman , title =

    Zhengxuan Wu and Atticus Geiger and Thomas Icard and Christopher Potts and Noah D. Goodman , title =. NeurIPS , year =

  67. [67]

    ICCV , year =

    William Peebles and Saining Xie , title =. ICCV , year =

  68. [68]

    ICCV , year =

    Mingdeng Cao and Xintao Wang and Zhongang Qi and Ying Shan and Xiaohu Qie and Yinqiang Zheng , title =. ICCV , year =

  69. [69]

    NeurIPS , year =

    Manuel Brack and Felix Friedrich and Dominik Hintersdorf and Lukas Struppek and Patrick Schramowski and Kristian Kersting , title =. NeurIPS , year =

  70. [70]

    CoRR , volume =

    Hongxiang Zhang and Yifeng He and Hao Chen , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02710 , eprinttype =. 2410.02710 , timestamp =

  71. [71]

    CoRR , volume =

    Huming Qiu and Guanxu Chen and Mi Zhang and Min Yang , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2411.10329 , eprinttype =. 2411.10329 , timestamp =

  72. [72]

    Efros , title =

    Tim Brooks and Aleksander Holynski and Alexei A. Efros , title =. CVPR , year =

  73. [73]

    CVPR , year =

    Shu Zhang and Xinyi Yang and Yihao Feng and Can Qin and Chia. CVPR , year =

  74. [74]

    Efficient Estimation of Word Representations in Vector Space , booktitle =

    Tom. Efficient Estimation of Word Representations in Vector Space , booktitle =

  75. [75]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes and John Healy , title =. CoRR , volume =. 2018 , url =. 1802.03426 , timestamp =

  76. [76]

    arXiv preprint arXiv:2501.18052 , year=

    Saeuron: Interpretable concept unlearning in diffusion models with sparse autoencoders , author=. arXiv preprint arXiv:2501.18052 , year=

  77. [77]

    European Conference on Computer Vision , pages=

    Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers , author=. European Conference on Computer Vision , pages=. 2024 , organization=

  78. [78]

    arXiv preprint arXiv:2506.22806 , year=

    Concept pinpoint eraser for text-to-image diffusion models via residual attention gate , author=. arXiv preprint arXiv:2506.22806 , year=

  79. [79]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Ablating concepts in text-to-image diffusion models , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  80. [80]

    ACM FAccT , year=

    Can Machines Help Us Answering Question 16 in Datasheets, and In Turn Reflecting on Inappropriate Content? , author=. ACM FAccT , year=

Showing first 80 references.