AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

Jian Zhang; Zhijun Zhang

arxiv: 2605.26460 · v1 · pith:BSOFX3A6new · submitted 2026-05-26 · 💻 cs.CV · cs.AI

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

Jian Zhang , Zhijun Zhang This is my paper

Pith reviewed 2026-06-29 18:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords concept groundingMM-DiTstraining-free groundingconcept leakageattention graph propagationself-attention refinementsemantic segmentationmulti-concept confusion

0 comments

The pith

AnchorDiff grounds concepts in MM-DiTs by selecting one high-confidence attention anchor and propagating it over a self-attention graph with cross-object suppression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets concept leakage in training-free grounding for multi-modal diffusion transformers, where attention responses spill from target objects onto visually similar non-targets. AnchorDiff separates initial localization from later refinement by picking a single reliable anchor from the concept-to-image attention map and diffusing it as a seed across a graph built from image self-attention. Within-object connections are strengthened by output-space similarity while a row-wise gate blocks edges between different objects. The method delivers competitive results on ImageNet-Segmentation and PascalVOC while cutting leakage on a new dataset built to expose confusion between similar concepts. Readers should care because it shows how a lightweight graph step can make existing attention maps more selective without any retraining.

Core claim

AnchorDiff decouples semantic localization from structural refinement by selecting a high-confidence anchor from the concept-to-image attention map and propagating it as a one-hot seed over a hybrid graph derived from image-to-image self-attention; output-space similarity drives dense within-object propagation while a row-wise attention gate suppresses cross-object connections, yielding strong grounding performance on ImageNet-Segmentation and PascalVOC together with substantially reduced concept leakage on the Multi-Concept Confusion Dataset.

What carries the argument

Anchor-based graph propagation that uses self-attention to build a hybrid graph with similarity-based within-object edges and a row-wise gate to block cross-object edges.

If this is right

The method produces strong object masks on ImageNet-Segmentation and PascalVOC without training.
Concept leakage drops substantially on images containing multiple visually similar objects.
The new Multi-Concept Confusion Dataset provides explicit masks for measuring leakage between confusable concepts.
The approach remains fully training-free and works directly on existing MM-DiT attention maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor-plus-graph pattern could be tested on other transformer-based generators to see whether leakage reduction holds beyond DiTs.
If the gate reliably separates objects, it might be combined with existing editing techniques to let users control individual concepts more cleanly during generation.
The dataset construction itself suggests a template for creating harder test cases whenever new models begin to handle multi-object scenes.

Load-bearing premise

A single high-confidence anchor extracted from the concept-to-image attention map is enough to seed accurate separation once it is propagated across the self-attention graph, and the row-wise gate blocks harmful cross-object links without losing needed within-object connections.

What would settle it

Running AnchorDiff on the Multi-Concept Confusion Dataset and finding no measurable drop in concept leakage or segmentation accuracy compared with baseline attention methods.

Figures

Figures reproduced from arXiv: 2605.26460 by Jian Zhang, Zhijun Zhang.

**Figure 1.** Figure 1: Concept leakage in semantic grounding. On visually similar concepts, existing methods suffer from concept leakage, where target responses spill over to non-target objects. AnchorDiff mitigates this by anchoring the target response and propagating it over an object-aware affinity graph. over to non-target objects. We call this failure mode “concept leakage”. As shown in [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 2.** Figure 2: Overview of AnchorDiff. AnchorDiff extracts concept-to-image attention for anchor selection and image-to-image self-attention for graph construction. A one-hot seed at the semantic anchor is propagated over a hybrid graph combining output-space affinity with a row-wise attention gate, producing object-confined masks with reduced concept leakage. • We demonstrate that AnchorDiff achieves state-of-the-art tr… view at source ↗

**Figure 3.** Figure 3: Raw image-to-image attention is spatially local. Mean raw attention weights are measured over patch-pair distance bins and decay rapidly on the 64 × 64 grid. (a) Input (b) PCA (c) Clustering [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Output-space propagation with structural gating. Output-space affinity yields dense same-object responses but also activates similar non-target objects. After gating with Wattn, propagation from a one-hot anchor recovers the target object while reducing cross-object leakage. 4.1 Attention extraction Given a text prompt and target concepts C = {c1, . . . , cK}, we extract image-to-image self-attention for … view at source ↗

**Figure 6.** Figure 6: Representative samples from our Multi-Concept Confusion Dataset. Each image contains [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on representative samples from the Multi-Concept Confusion Dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Propagation step sensitivity on SD3 and SD3.5 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Layer-wise sensitivity analysis on the Multi-Concept Confusion Dataset. We evaluate [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Additional samples from the Multi-Concept Confusion Dataset. Each image contains two [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative comparisons on two-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Additional qualitative comparisons on two-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Additional qualitative comparisons on two-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Additional qualitative comparisons on two-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

**Figure 15.** Figure 15: Additional qualitative comparisons on three-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Additional qualitative comparisons on three-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Additional qualitative comparisons on three-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Additional qualitative comparisons on three-concept samples from the Multi-Concept [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗

**Figure 19.** Figure 19: Visualization of graph propagation with and without [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Visualization of graph propagation with and without [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Visualization of graph propagation with and without [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

read the original abstract

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AnchorDiff is a straightforward training-free tweak for cutting concept leakage in MM-DiT attention maps by anchoring and graph propagation, backed by a new confusion dataset.

read the letter

AnchorDiff picks a high-confidence anchor from the concept-to-image attention map and spreads it as a seed over a hybrid graph built from image-to-image self-attention. The graph uses output-space similarity to fill in within-object regions and a row-wise gate to block cross-object edges.

The approach targets a real, observable failure mode where attention leaks between visually similar objects. Separating the initial localization step from the refinement step is a sensible split, and the new Multi-Concept Confusion Dataset gives a direct way to measure leakage that standard segmentation benchmarks do not provide.

The paper does not supply numbers, ablations, or implementation details in the abstract, so the size of the gains on ImageNet-Segmentation, PascalVOC, and the new dataset cannot be checked. The single-anchor assumption could be brittle if the initial attention map is noisy or if objects have internal structure that the gate might cut.

This is aimed at researchers already working with attention-based grounding or editing in diffusion transformers. Someone who needs a quick, training-free patch for multi-concept scenes would get the most out of the dataset and the propagation idea.

The core claim is internally consistent and the dataset is a concrete addition, so the paper deserves a serious referee who can examine the full experiments and code.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes AnchorDiff, a training-free method for concept grounding in Multi-Modal Diffusion Transformers (MM-DiTs). It decouples semantic localization from structural refinement by selecting a high-confidence anchor from the concept-to-image attention map and propagating it as a one-hot seed over a hybrid graph derived from image-to-image self-attention, using output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. The authors introduce the Multi-Concept Confusion Dataset for explicit evaluation of concept leakage and claim strong grounding performance on ImageNet-Segmentation and PascalVOC while substantially reducing leakage on the new dataset.

Significance. If the empirical claims hold with proper quantification and ablations, the method offers a practical training-free improvement to attention-based grounding in diffusion models by mitigating concept leakage, a common failure mode. The introduction of a dedicated multi-concept dataset is a useful contribution for future benchmarking.

major comments (1)

Abstract: The abstract states performance improvements but supplies no quantitative numbers, error bars, ablation details, or description of how the graph is constructed or how the gate is implemented; only the abstract is available so the central claim cannot be verified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the referee's review. We address the major comment point-by-point below.

read point-by-point responses

Referee: [—] Abstract: The abstract states performance improvements but supplies no quantitative numbers, error bars, ablation details, or description of how the graph is constructed or how the gate is implemented; only the abstract is available so the central claim cannot be verified.

Authors: We agree that the abstract would be strengthened by including key quantitative results and a brief description of the core components. In the revised manuscript we will update the abstract to report specific metrics (e.g., mIoU on ImageNet-Segmentation and PascalVOC together with the leakage reduction on the Multi-Concept Confusion Dataset) and to concisely describe the anchor selection, hybrid graph construction from self-attention, output-space similarity, and row-wise attention gate. The full paper already contains the detailed numbers, error bars, ablations, and implementation details in Sections 3 and 4; the abstract revision will make the central claims verifiable without requiring the reader to consult the body. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a training-free algorithmic method that operates directly on attention maps produced by an unmodified MM-DiT backbone. No equations, parameters, or performance quantities are defined in terms of themselves, fitted to a subset and then re-predicted, or justified solely by self-citation chains. The central claims rest on empirical evaluation against external benchmarks (ImageNet-Segmentation, PascalVOC, and the introduced Multi-Concept Confusion Dataset) rather than any derivation that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on the existence of useful structure in the model's attention maps.

pith-pipeline@v0.9.1-grok · 5700 in / 1178 out tokens · 23677 ms · 2026-06-29T18:45:58.093332+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[2]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[3]

What the daam: Interpreting stable diffusion using cross attention.arXiv preprint arXiv:2210.04885, 2022

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the daam: Interpreting stable diffusion using cross attention.arXiv preprint arXiv:2210.04885, 2022

work page arXiv 2022
[4]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[6]

Conceptattention: Diffusion transformers learn highly interpretable features

Alec Helbling, Tuna Meral, Benjamin Hoover, Pinar Yanardag, and Polo Chau. Conceptattention: Diffusion transformers learn highly interpretable features. InInternational Conference on Machine Learning, 2025

2025
[7]

Seg4diff: Unveiling open-vocabulary semantic segmentation in text-to-image diffusion transformers

Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, and Seungryong Kim. Seg4diff: Unveiling open-vocabulary semantic segmentation in text-to-image diffusion transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[8]

Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv e-prints, pages arXiv–2403, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv e-prints, pages arXiv–2403, 2024

2024
[9]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[10]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791, 2021

2021
[11]

Grad-cam: Visual explanations from deep networks via gradient-based 10 localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based 10 localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017
[12]

Interpreting clip’s image representa- tion via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023

Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representa- tion via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023

work page arXiv 2023
[13]

Clip as rnn: Segment countless visual concepts without training endeavor

Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13171–13182, 2024

2024
[14]

Attention is all you need.arXiv, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.arXiv, 2017

2017
[15]

Diffusion model is secretly a training-free open vocabulary semantic segmenter.IEEE Transactions on Image Processing, 2025

Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter.IEEE Transactions on Image Processing, 2025

2025
[16]

Open-vocabulary attention maps with token optimization for semantic segmentation in diffusion models

Pablo Marcos-Manchón, Roberto Alcover-Couso, Juan C SanMiguel, and Jose M Martínez. Open-vocabulary attention maps with token optimization for semantic segmentation in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9242–9252, 2024

2024
[17]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations
[18]

Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014

Matthieu Guillaumin, Daniel Küttel, and Vittorio Ferrari. Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014

2014
[19]

Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 111(1):98–136, 2015

2015
[20]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xi- aochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations
[24]

Layer-wise relevance propagation for neural networks with local renormalization layers

Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, Klaus-Robert Müller, and Woj- ciech Samek. Layer-wise relevance propagation for neural networks with local renormalization layers. InInternational conference on artificial neural networks, pages 63–71. Springer, 2016

2016
[25]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. arxiv 2020. arXiv preprint arXiv:2005.00928, 10, 2005

work page arXiv 2020
[26]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. 2020

2020
[27]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 11

2021
[28]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

2024
[29]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024

2024
[30]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021. 12 A Technical appendices and supplementary material A.1 Additional dataset samples Figure 10 provides more examples from the Multi-Concept Confusion Dataset. Each image contains tw...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[1] [1]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[2] [2]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[3] [3]

What the daam: Interpreting stable diffusion using cross attention.arXiv preprint arXiv:2210.04885, 2022

Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the daam: Interpreting stable diffusion using cross attention.arXiv preprint arXiv:2210.04885, 2022

work page arXiv 2022

[4] [4]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[6] [6]

Conceptattention: Diffusion transformers learn highly interpretable features

Alec Helbling, Tuna Meral, Benjamin Hoover, Pinar Yanardag, and Polo Chau. Conceptattention: Diffusion transformers learn highly interpretable features. InInternational Conference on Machine Learning, 2025

2025

[7] [7]

Seg4diff: Unveiling open-vocabulary semantic segmentation in text-to-image diffusion transformers

Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, and Seungryong Kim. Seg4diff: Unveiling open-vocabulary semantic segmentation in text-to-image diffusion transformers. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[8] [8]

Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv e-prints, pages arXiv–2403, 2024

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv e-prints, pages arXiv–2403, 2024

2024

[9] [9]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[10] [10]

Transformer interpretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791, 2021

2021

[11] [11]

Grad-cam: Visual explanations from deep networks via gradient-based 10 localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based 10 localization. InProceedings of the IEEE international conference on computer vision, pages 618–626, 2017

2017

[12] [12]

Interpreting clip’s image representa- tion via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023

Yossi Gandelsman, Alexei A Efros, and Jacob Steinhardt. Interpreting clip’s image representa- tion via text-based decomposition.arXiv preprint arXiv:2310.05916, 2023

work page arXiv 2023

[13] [13]

Clip as rnn: Segment countless visual concepts without training endeavor

Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, and Siyang Li. Clip as rnn: Segment countless visual concepts without training endeavor. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13171–13182, 2024

2024

[14] [14]

Attention is all you need.arXiv, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.arXiv, 2017

2017

[15] [15]

Diffusion model is secretly a training-free open vocabulary semantic segmenter.IEEE Transactions on Image Processing, 2025

Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter.IEEE Transactions on Image Processing, 2025

2025

[16] [16]

Open-vocabulary attention maps with token optimization for semantic segmentation in diffusion models

Pablo Marcos-Manchón, Roberto Alcover-Couso, Juan C SanMiguel, and Jose M Martínez. Open-vocabulary attention maps with token optimization for semantic segmentation in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9242–9252, 2024

2024

[17] [17]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations

[18] [18]

Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014

Matthieu Guillaumin, Daniel Küttel, and Vittorio Ferrari. Imagenet auto-annotation with segmentation propagation.International Journal of Computer Vision, 110(3):328–348, 2014

2014

[19] [19]

Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: A retrospective.International Journal of Computer Vision, 111(1):98–136, 2015

2015

[20] [20]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xi- aochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Sam 2: Segment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InThe Thirteenth International Conference on Learning Representations

[24] [24]

Layer-wise relevance propagation for neural networks with local renormalization layers

Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, Klaus-Robert Müller, and Woj- ciech Samek. Layer-wise relevance propagation for neural networks with local renormalization layers. InInternational conference on artificial neural networks, pages 63–71. Springer, 2016

2016

[25] [25]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. arxiv 2020. arXiv preprint arXiv:2005.00928, 10, 2005

work page arXiv 2020

[26] [26]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. 2020

2020

[27] [27]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021. 11

2021

[28] [28]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

2024

[29] [29]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InInternational Conference on Learning Representations (ICLR), 2024

2024

[30] [30]

iBOT: Image BERT Pre-Training with Online Tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021. 12 A Technical appendices and supplementary material A.1 Additional dataset samples Figure 10 provides more examples from the Multi-Concept Confusion Dataset. Each image contains tw...

work page internal anchor Pith review Pith/arXiv arXiv 2021