pith. sign in

arxiv: 2606.17257 · v1 · pith:ADYDZCI7new · submitted 2026-06-15 · 💻 cs.CV · cs.AI

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Pith reviewed 2026-06-27 03:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video diffusionsafety alignmentrepresentation steeringinference-time interventionsupervised PCAhidden-state activationstraining-free alignmentdiffusion transformers
0
0 comments X

The pith

Video diffusion models can be made to generate safe content at inference time by adding one linear direction to their hidden-state activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety-relevant structure exists as a linear direction in the hidden activations of video diffusion transformers. This direction is recovered by applying supervised PCA to activations labeled simply as safe or unsafe. Adding the direction at an intermediate layer during generation shifts trajectories away from harmful outputs toward related safe alternatives. The approach requires no model updates and adds negligible cost, addressing the limits of fine-tuning or post-hoc filters. The finding holds across multiple model scales and both text-to-video and image-to-video tasks.

Core claim

Safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead.

What carries the argument

The safety steering direction recovered by Supervised PCA on binary-labeled hidden-state activations, added to the activations at an intermediate (~50% depth) layer of the video diffusion transformer.

If this is right

  • The same direction steers outputs across nine different video diffusion models spanning 1.3B to 5B parameters.
  • Steering effectiveness is highest at intermediate transformer depth even though safety information continues to accumulate in deeper layers.
  • The method applies equally to text-to-video and image-to-video generation without any task-specific retraining.
  • No enumeration of unsafe concepts is required; the binary safe/unsafe labeling suffices to locate the direction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear-steering recipe could be tested on other controllable attributes such as artistic style or object count by swapping the binary label set.
  • If the direction is found to be largely orthogonal to the main generation manifold, combining multiple such directions for simultaneous control of safety and style becomes feasible.
  • The observed tradeoff between information accumulation and steerability at different depths suggests a general principle that could be checked in language or image diffusion transformers.

Load-bearing premise

A direction fitted once on a fixed collection of binary safety labels will continue to steer arbitrary new prompts toward safe outputs without degrading video quality or introducing artifacts.

What would settle it

Run the steering direction on a held-out set of prompts that were never seen during direction discovery and measure whether an independent classifier or human raters judge the resulting videos as unsafe at rates comparable to the unsteered baseline.

Figures

Figures reproduced from arXiv: 2606.17257 by Amit K. Roy-Chowdhury, Arindam Dutta, Athula Balachandran, Rohit Kundu, Sarosij Bose.

Figure 1
Figure 1. Figure 1: REINS aligns video diffusion models with safety constraints by steering their internal representations at inference time. (a) Safety-steered outputs: On unsafe baseline generations from CogVideoX and Wan2.1 (left of each pair), REINS redirects the model toward semantically related safe alternatives within its own manifold (right), with no weight updates, no concept enumeration, and no prompt-level filterin… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results: For each prompt, we show baseline generation and REINS-steered generation using identical random seeds for paired comparison. REINS redirects harmful content toward safe content. Some explicit content has been censored [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effective steering range across transformer layers. We apply REINS at a fixed steering strength across candidate layers and measure safety rate and VQ/MQ win rates. The shaded effective steering range highlights the band of intermediate layers (∼50% depth) where safety improves substantially while VQ and MQ remain near baseline. The operating region predicted by the information-steerability tradeoff. Outsi… view at source ↗
Figure 5
Figure 5. Figure 5: Rank ablation: A single direction (rank￾1) captures ∼68% of the safety-relevant variance and achieves saturation-level safety rates at suffi￾cient λ. Higher ranks do not improve metrics like safety rate, VQ or MQ significantly, confirming that safety is low-dimensional in hidden-state space [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representation shift under steering, validated by an OOD classifier. We project hidden states onto the top two SPCA components and connect each unsafe baseline generation (open red) to its steered counterpart (filled green/red) under the same prompt and seed. Labels come from an in-domain classifier (left) and an OOD classifier (right) unused during direction discovery. Steering consistently moves unsafe g… view at source ↗
read the original abstract

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces REINS, a training-free inference-time safety steering method for video diffusion models. It claims that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers; a single direction obtained via Supervised PCA on binary safety labels separates safe from unsafe trajectories and, when added to hidden states at an intermediate layer (~50% depth), redirects harmful generations to semantically related safe outputs. The method requires no weight updates or concept enumeration. Mechanistic analysis shows safety information accumulates monotonically with depth but steering peaks at intermediate layers. Evaluation spans 9 models (1.3B–5B parameters) in both text-to-video and image-to-video modes.

Significance. If the central claim holds, the work would be significant for providing a lightweight, training-free alternative to safety fine-tuning or external filters in open video diffusion models. The breadth of the evaluation suite (multiple scales and generation modes) and the identification of a depth-dependent tradeoff between information availability and steerability constitute concrete contributions to representation engineering for diffusion transformers. The absence of training and the single-direction simplicity are practical strengths.

major comments (2)
  1. [Method (Supervised PCA)] Method section on Supervised PCA: the safety direction is fitted on binary labels from a fixed prompt set; the manuscript does not state whether these prompts are held out from the steering evaluation set. This detail is load-bearing for the claim that the direction encodes a prompt-independent safety concept rather than features correlated with the particular label-collection procedure.
  2. [Experiments / Results] Evaluation and results: while the abstract states evaluation across nine models and two modes, the manuscript supplies no quantitative safety success rates, video quality metrics (e.g., FVD, CLIP similarity to safe references), or ablation tables on layer choice and steering coefficient. Without these, it is impossible to verify that steered outputs remain on-distribution or that quality is preserved, which directly affects the practical-utility claim.
minor comments (2)
  1. Clarify how the binary safety labels were obtained (human raters, automated classifier, or prompt templates) and report inter-rater agreement if applicable.
  2. [Mechanistic analysis] The mechanistic analysis paragraph on monotonic accumulation versus intermediate-layer peak would benefit from a figure plotting safety separability (e.g., PCA explained variance or linear probe accuracy) versus depth.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Method (Supervised PCA)] Method section on Supervised PCA: the safety direction is fitted on binary labels from a fixed prompt set; the manuscript does not state whether these prompts are held out from the steering evaluation set. This detail is load-bearing for the claim that the direction encodes a prompt-independent safety concept rather than features correlated with the particular label-collection procedure.

    Authors: We agree this clarification is necessary. The safety-labeled activations used to fit the Supervised PCA direction were collected from a prompt set that was held out from all evaluation prompts. We will revise the Method section to explicitly document this held-out split, thereby strengthening the claim that the discovered direction reflects a general safety concept. revision: yes

  2. Referee: [Experiments / Results] Evaluation and results: while the abstract states evaluation across nine models and two modes, the manuscript supplies no quantitative safety success rates, video quality metrics (e.g., FVD, CLIP similarity to safe references), or ablation tables on layer choice and steering coefficient. Without these, it is impossible to verify that steered outputs remain on-distribution or that quality is preserved, which directly affects the practical-utility claim.

    Authors: We acknowledge that the current manuscript presents primarily qualitative results and mechanistic analysis. To address this, the revised version will add quantitative safety success rates, video quality metrics including FVD and CLIP similarity to safe references, and ablation tables on layer depth and steering coefficient across the nine models and both generation modes. revision: yes

Circularity Check

1 steps flagged

Safety direction separation reduces to the Supervised PCA fit by construction

specific steps
  1. fitted input called prediction [Abstract]
    "Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories."

    Supervised PCA explicitly optimizes for a linear direction that best separates the two classes given by the binary safety labels. The assertion that this direction 'suffices to separate safe from unsafe generation trajectories' is therefore true by construction on the fitted label set rather than an independent result about the model's representations.

full rationale

The paper's central mechanistic claim is presented as an independent finding about linear encoding of safety in activations. However, the quoted key finding directly follows from applying Supervised PCA, whose objective is to produce a direction that separates the supplied binary labels. This makes the separation statement equivalent to the fitting step for the labeled data used, satisfying the fitted-input-called-prediction pattern. The subsequent steering application at inference time is not itself circular, but the load-bearing claim that a single such direction 'suffices to separate' trajectories is. No other patterns (self-citation chains, ansatz smuggling, or renaming) are evidenced in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that safety is linearly separable in activation space; no new physical entities are postulated. The direction vector itself is a fitted parameter derived from labeled data.

free parameters (1)
  • steering coefficient and target layer index
    The magnitude of the added direction and the precise transformer layer at which it is injected are chosen (implicitly via the reported peak at ~50% depth) to achieve the steering effect.
axioms (1)
  • domain assumption Safety information is linearly encoded and can be isolated by supervised PCA on binary labels
    Invoked in the key finding sentence of the abstract; the method presupposes that a single linear direction suffices to separate safe and unsafe trajectories.

pith-pipeline@v0.9.1-grok · 5776 in / 1408 out tokens · 48548 ms · 2026-06-27T03:41:22.606762+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 15 canonical work pages · 10 internal anchors

  1. [1]

    Safesora: Towards safety alignment of text2video generation via a human preference dataset.Advances in Neural Information Processing Systems, 37:17161–17214, 2024

    Juntao Dai, Tianle Chen, Xuyao Wang, Ziran Yang, Taiye Chen, Jiaming Ji, and Yaodong Yang. Safesora: Towards safety alignment of text2video generation via a human preference dataset.Advances in Neural Information Processing Systems, 37:17161–17214, 2024

  2. [2]

    Safewatch: An efficient safety-policy following video guardrail model with transparent explanations

    Zhaorun Chen, Francesco Pinto, Minzhou Pan, and Bo Li. Safewatch: An efficient safety-policy following video guardrail model with transparent explanations. In13th International Conference on Learning Represen- tations, ICLR 2025, pages 95666–95708. International Conference on Learning Representations, ICLR, 2025

  3. [3]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  4. [4]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  5. [5]

    Mochi 1.https://github.com/genmoai/models, 2024

    Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024

  6. [6]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  7. [7]

    Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

    Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

  8. [8]

    Videoguard: Protecting video content from unauthorized editing.arXiv preprint arXiv:2508.03480, 2025

    Junjie Cao, Kaizhou Li, Xinchun Yu, Hongxiang Li, and Xiaoping Zhang. Videoguard: Protecting video content from unauthorized editing.arXiv preprint arXiv:2508.03480, 2025

  9. [9]

    Sora by openai.https://openai.com/sora/, 2024

    OpenAI. Sora by openai.https://openai.com/sora/, 2024

  10. [10]

    Text driven video generation.https://research.runwayml.com/gen2, 2023

    Runway Research. Text driven video generation.https://research.runwayml.com/gen2, 2023

  11. [11]

    Kling-Omni Technical Report

    Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

  12. [12]

    HunyuanVideo 1.5 Technical Report

    Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

  13. [13]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  14. [14]

    Steerdiff: Steering towards safe text-to-image diffusion models.arXiv preprint arXiv:2410.02710, 2024

    Hongxiang Zhang, Yifeng He, and Hao Chen. Steerdiff: Steering towards safe text-to-image diffusion models.arXiv preprint arXiv:2410.02710, 2024

  15. [15]

    Mma-diffusion: Multimodal attack on diffusion models

    Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. Mma-diffusion: Multimodal attack on diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7737–7746, 2024

  16. [16]

    A divide-and-conquer approach to distributed attack identification

    Fabio Pasqualetti, Florian Dörfler, and Francesco Bullo. A divide-and-conquer approach to distributed attack identification. In2015 54th IEEE Conference on Decision and Control (CDC), pages 5801–5807. IEEE, 2015

  17. [17]

    Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

    Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr. Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

  18. [18]

    Safegen: Mitigating sexually explicit content generation in text-to-image models

    Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu. Safegen: Mitigating sexually explicit content generation in text-to-image models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4807–4821, 2024

  19. [19]

    Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.Pattern Recognition, 44(7):1357–1371, 2011

    Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.Pattern Recognition, 44(7):1357–1371, 2011

  20. [20]

    Polyjuice makes it real: Black-box, universal red teaming for synthetic image detectors.Advances in Neural Information Processing Systems, 2025

    Sepehr Dehdashtian, Mashrur M Morshed, Jacob H Seidman, Gaurav Bharaj, and Vishnu Boddeti. Polyjuice makes it real: Black-box, universal red teaming for synthetic image detectors.Advances in Neural Information Processing Systems, 2025

  21. [21]

    Stable diffusion safety checker

    CompVis/HuggingFace. Stable diffusion safety checker. https://huggingface.co/CompVis/ stable-diffusion-safety-checker, 2022. 10

  22. [22]

    Erasing concepts from diffusion models

    Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023

  23. [23]

    Unified concept editing in diffusion models

    Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5111–5120, 2024

  24. [24]

    Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models

    Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22522–22531, 2023

  25. [25]

    Self-discovering interpretable diffusion latent directions for responsible text-to-image generation

    Hang Li, Chengzhi Shen, Philip Torr, V olker Tresp, and Jindong Gu. Self-discovering interpretable diffusion latent directions for responsible text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12006–12016, 2024

  26. [26]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  27. [27]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

  28. [28]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  29. [29]

    Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

  30. [30]

    Measuring statistical dependence with hilbert-schmidt norms

    Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. InInternational conference on algorithmic learning theory, pages 63–77. Springer, 2005

  31. [31]

    Improving video generation with human feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  32. [32]

    Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

    Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

  33. [33]

    MMA-Diffusion: MultiModal Attack on Diffusion Models

    Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. MMA-Diffusion: MultiModal Attack on Diffusion Models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  34. [34]

    Latent guard: a safety framework for text-to-image generation

    Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, and Fabio Pizzati. Latent guard: a safety framework for text-to-image generation. InEuropean Conference on Computer Vision, pages 93–109. Springer, 2024

  35. [35]

    Sneakyprompt: Jailbreaking text-to-image generative models

    Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE symposium on security and privacy (SP), pages 897–912. IEEE, 2024

  36. [36]

    T2vsafetybench: Evaluating the safety of text-to-video generative models.Advances in Neural Information Processing Systems, 37: 63858–63872, 2024

    Yibo Miao, Yifan Zhu, Lijia Yu, Jun Zhu, Xiao-Shan Gao, and Yinpeng Dong. T2vsafetybench: Evaluating the safety of text-to-video generative models.Advances in Neural Information Processing Systems, 37: 63858–63872, 2024

  37. [37]

    Forget-me-not: Learning to forget in text-to-image diffusion models

    Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to forget in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1755–1764, 2024

  38. [38]

    Reliable and efficient concept erasure of text-to-image diffusion models

    Chao Gong, Kai Chen, Zhipeng Wei, Jingjing Chen, and Yu-Gang Jiang. Reliable and efficient concept erasure of text-to-image diffusion models. InEuropean Conference on Computer Vision, pages 73–88. Springer, 2024

  39. [39]

    SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

    Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Dennis Wei, and Sijia Liu. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation.arXiv preprint arXiv:2310.12508, 2023

  40. [40]

    The illusion of unlearning: The unstable nature of machine unlearning in text-to-image diffusion models

    Naveen George, Karthik Nandan Dasaraju, Rutheesh Reddy Chittepu, and Konda Reddy Mopuri. The illusion of unlearning: The unstable nature of machine unlearning in text-to-image diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13393–13402, 2025. 11

  41. [41]

    To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images

    Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now. InEuropean Conference on Computer Vision, pages 385–403. Springer, 2024

  42. [42]

    Cambridge university press, 2012

    Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

  43. [43]

    nudity” or “violence,

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 12 Appendix A Related Work Prompt and output filtering:The most common defenses operate outside the generative model. Prompt-level classifiers ...