Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

Dongwoo Kim; Saemi Moon; Seoyeon Lee; Suhyeon Jun

arxiv: 2605.25765 · v1 · pith:IFWOCJOLnew · submitted 2026-05-25 · 💻 cs.CV · cs.AI· cs.LG

Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

Saemi Moon , Suhyeon Jun , Seoyeon Lee , Dongwoo Kim This is my paper

Pith reviewed 2026-06-29 22:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords concept unlearningdiffusion modelscross-attentiontext-to-imagemachine unlearningadversarial prompts

0 comments

The pith

Representing concepts via cross-attention activations during denoising enables closed-form unlearning that resists paraphrased prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing closed-form unlearning methods fail on paraphrased prompts because they edit only from the text encoder's response to a few anchor prompts. It argues instead that the target concept should be captured in the cross-attention activation space inside the U-Net, where the model decides what to render. PURE constructs forget and retain bases from activations collected along a short denoising trajectory and applies one linear projector to the key and value weights. On a benchmark with ten concepts spanning styles, IP, celebrities, and NSFW content, this reduces leakage under paraphrased and adversarial prompts while keeping retain concepts close to the original model.

Core claim

PURE builds forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights, yielding the best overall forget-retain trade-off among evaluated methods by significantly reducing target leakage under paraphrased and adversarial prompts.

What carries the argument

The single linear projector applied to cross-attention key and value weights, with bases derived from per-layer activations along a short denoising trajectory rather than text-encoder embeddings.

If this is right

The edit generalizes to prompts that paraphrase or avoid naming the target concept.
Retain concepts stay close in performance to the unedited model across the tested categories.
The method adds no inference-time cost because it is a one-time closed-form weight change.
The approach covers artistic style, intellectual property, celebrity, and NSFW targets in one framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Activation-space editing may apply to other attention-based generative models beyond diffusion.
Collecting activations at more points along the trajectory could further improve robustness.
The same projection idea might help identify and edit unintended concepts discovered after training.

Load-bearing premise

The target concept is more reliably represented by cross-attention activations captured along a short denoising trajectory than by the text encoder's response to a few short anchor prompts and their paraphrases.

What would settle it

An experiment showing that paraphrased or adversarial prompts still produce the target concept at rates comparable to text-encoder-based methods on the same benchmark would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2605.25765 by Dongwoo Kim, Saemi Moon, Seoyeon Lee, Suhyeon Jun.

**Figure 2.** Figure 2: Binary-probing recall on natural prompts across categories (↑). For each forget concept we construct two candidate bases that differ only in the features fed to SVD. The text basis uses the text-encoder embedding of each anchor in Af ; the activation basis uses the spatial mean-pooled crossattention activation. In both cases we apply SVD to the resulting feature matrix, keep the top right singular vector… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on HUB forget and retain prompts. Each pair of consecutive rows [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Target proportion (↓) and within-category retention (↑) on Pikachu as the forget and retain anchor set sizes vary. Dashed lines denote the SD reference. Text basis Forget Pikachu Retain Mario SD Activation basis [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison as the forget anchor set size [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison as the retain anchor set size [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Additional qualitative examples of Style. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Additional qualitative examples of IP. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative examples of Celebrity. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper moves concept unlearning into cross-attention activation space collected along the denoising path and reports stronger paraphrase robustness than text-embedding baselines, but the short-trajectory sampling choice remains lightly justified.

read the letter

The main point is that PURE represents the forget concept through per-layer cross-attention activations sampled along a short denoising trajectory rather than through text-encoder outputs to a few anchor prompts. This produces a linear projector applied once to the key and value weights, and the abstract claims it yields the best forget-retain trade-off on a ten-concept benchmark that includes styles, IP, celebrities, and NSFW material, with reduced leakage on paraphrased and adversarial prompts.

The shift to activation space is the actual novelty. Text embeddings describe the input prompt; activations describe what the model is about to draw, so they can capture rendering behavior that paraphrases do not trigger in the text encoder. Collecting them during denoising is a direct way to get that signal without retraining.

The results look useful on the surface because prior closed-form methods are known to leak under prompt variation, and this one appears to close more of that gap while leaving retain performance close to the original model.

The soft spot is the sampling assumption. A short trajectory covers only a narrow slice of the denoising path, and nothing in the abstract shows that the collected activations span the variations that appear at other timesteps or under prompts that steer the path differently. Without an ablation on trajectory length or timestep selection, it is unclear whether the reported gains are stable or tied to the particular short window they chose. The lack of statistical detail on the benchmark numbers also makes it harder to judge how reliable the ranking is.

This is for groups working on practical safety edits to diffusion models. A reader who needs a closed-form method that handles paraphrase leakage better than existing baselines will find the core move worth examining. It deserves a serious referee because the technical framing is new enough and the empirical claim is concrete enough to test, even if the sampling justification needs more work.

Referee Report

3 major / 2 minor

Summary. The paper proposes PURE, a closed-form method for erasing target concepts from pretrained text-to-image diffusion models. It constructs forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory (rather than text-encoder embeddings from anchor prompts), then applies a single linear projector to the cross-attention K/V weights. On a holistic benchmark with ten concepts spanning artistic style, IP, celebrity, and NSFW categories, the method is claimed to reduce target leakage under paraphrased and adversarial prompts while preserving retain concepts near the unedited baseline, yielding the best overall forget-retain trade-off among compared approaches.

Significance. If the central empirical claim holds after addressing sampling and reporting gaps, the work supplies a practical, inference-free edit that improves paraphrase robustness over prior closed-form unlearning techniques. The shift from text-embedding to activation-space representation is a clear conceptual contribution for safety-oriented editing of generative models, and the multi-category benchmark evaluation provides a useful stress test of generalization.

major comments (3)

[§3] §3 (Method, trajectory description): the central claim that activations along a short denoising trajectory suffice to span a paraphrase-robust forget basis is load-bearing, yet the manuscript provides no analysis, ablation, or justification of trajectory length, timestep selection, or coverage of the activation distribution; this directly engages the sampling concern that residual leakage may remain when concept signatures appear outside the sampled slice.
[Experiments / Results] Experiments / Results section (benchmark reporting): the abstract and results claim statistically significant improvements and best overall trade-off, but supply no quantitative details on statistical significance testing, exact baseline re-implementations, variance across runs, or ablation of the activation-capture procedure itself; without these, the strength of the forget-retain superiority cannot be assessed.
[§4] §4 (Evaluation protocol): the paper does not report whether the short trajectory was held fixed across all ten concepts or tuned per concept; if the latter, the method is no longer parameter-free in the sense advertised and the cross-concept comparison is weakened.

minor comments (2)

[Method] Notation: the precise definition of the per-layer activation matrix (e.g., concatenation over heads or timesteps) is not stated explicitly enough to allow immediate reproduction from the equations alone.
[Figure 2] Figure clarity: the activation-projection diagram would benefit from an explicit indication of which timesteps are sampled and how the forget/retain bases are formed from them.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3] §3 (Method, trajectory description): the central claim that activations along a short denoising trajectory suffice to span a paraphrase-robust forget basis is load-bearing, yet the manuscript provides no analysis, ablation, or justification of trajectory length, timestep selection, or coverage of the activation distribution; this directly engages the sampling concern that residual leakage may remain when concept signatures appear outside the sampled slice.

Authors: We agree that the manuscript would benefit from additional analysis on the trajectory. The choice of a short denoising trajectory (typically 5 timesteps) is motivated by the observation that cross-attention activations in early denoising steps capture the core concept rendering, as later steps focus on details. However, to address the concern, we will include an ablation study varying the trajectory length and timestep selection in the revised version, demonstrating that performance stabilizes beyond a certain short length. This will also include discussion on coverage of the activation distribution. revision: yes
Referee: Experiments / Results section (benchmark reporting): the abstract and results claim statistically significant improvements and best overall trade-off, but supply no quantitative details on statistical significance testing, exact baseline re-implementations, variance across runs, or ablation of the activation-capture procedure itself; without these, the strength of the forget-retain superiority cannot be assessed.

Authors: We acknowledge that the current manuscript lacks detailed statistical reporting. In the revision, we will add: (1) results of statistical significance tests (e.g., paired t-tests) comparing PURE to baselines, (2) exact details on how baselines were re-implemented, (3) variance (standard deviation) across 3-5 independent runs, and (4) an ablation on the activation-capture procedure (e.g., number of steps, layers). These additions will allow better assessment of the superiority claim. revision: yes
Referee: [§4] §4 (Evaluation protocol): the paper does not report whether the short trajectory was held fixed across all ten concepts or tuned per concept; if the latter, the method is no longer parameter-free in the sense advertised and the cross-concept comparison is weakened.

Authors: The short trajectory parameters (timesteps and length) were held fixed across all ten concepts to ensure the method remains parameter-free and comparisons are fair. We will explicitly report this fixed setting in the revised §4, along with the specific values used (e.g., timesteps 0-4 or similar), to clarify the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: direct linear projection from observed activations

full rationale

The paper presents PURE as a closed-form linear projector computed directly from per-layer cross-attention activations collected along a short denoising trajectory. No equation reduces the forget/retain bases or the resulting performance to a quantity fitted inside the same experiment, nor does any load-bearing step rely on self-citation chains or imported uniqueness theorems. The derivation is self-contained against external benchmarks, consistent with the reader's assessment of score 2.0 (minor at most).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that cross-attention activations generalize to paraphrases better than text embeddings; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Cross-attention activations captured along a short denoising trajectory represent the target concept more robustly than text-encoder outputs to anchor prompts.
This premise is stated as the key motivation and is required for the projection to succeed on paraphrased prompts.

pith-pipeline@v0.9.1-grok · 5776 in / 1197 out tokens · 30142 ms · 2026-06-29T22:41:57.391198+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models

Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 3403–3417,

2023
[2]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Patrick Schramowski, Christopher Tauchmann, and Kristian Kersting. Can machines help us answer- ing question 16 in datasheets, and in turn reflecting on inappropriate content? InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1350–1361,

2022
[4]

{c}”, “painting by {c}

From these categories, we select ten forget concepts for evaluation: Van Gogh, Picasso, and Frida Kahlo from Style; Mickey Mouse, Pikachu, and Buzz Lightyear from IP; Emma Watson, Elon Musk, and Taylor Swift fromCelebrity; and Nudity fromNSFW. Table 3: Concepts included in each HUB category. Category Concepts Style Andy Warhol, Auguste Renoir, Claude Mone...

2023

[1] [1]

Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models

Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, and Yang Zhang. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 3403–3417,

2023

[2] [2]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Patrick Schramowski, Christopher Tauchmann, and Kristian Kersting. Can machines help us answer- ing question 16 in datasheets, and in turn reflecting on inappropriate content? InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1350–1361,

2022

[4] [4]

{c}”, “painting by {c}

From these categories, we select ten forget concepts for evaluation: Van Gogh, Picasso, and Frida Kahlo from Style; Mickey Mouse, Pikachu, and Buzz Lightyear from IP; Emma Watson, Elon Musk, and Taylor Swift fromCelebrity; and Nudity fromNSFW. Table 3: Concepts included in each HUB category. Category Concepts Style Andy Warhol, Auguste Renoir, Claude Mone...

2023