arxiv: 2601.06338 · v2 · submitted 2026-01-09 · 💻 cs.AI · cs.CV· cs.LG

Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers

Binxu Wang , Jingxuan Fan , Xu Pan This is my paper

Pith reviewed 2026-05-16 15:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords diffusion transformersmechanistic interpretabilityspatial relationstext-to-image generationcross-attentiontext encoderscircuit analysis

0 comments

The pith

Diffusion transformers use different internal circuits for spatial relations depending on the text encoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains small DiTs from scratch to generate images of two objects with specified attributes and positions. It shows that models reach near-perfect accuracy in all cases, yet the flow of spatial information from text to image tokens takes radically different paths. With random text embeddings the circuit splits the work across two cross-attention heads that read relation and attributes separately. With a pretrained T5 encoder the model instead fuses both pieces of information inside a single text token and reads them together. These mechanistic differences produce similar in-domain performance but different robustness to prompt changes.

Core claim

Although all models learn the task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, the spatial-relation information is passed to image tokens through a two-stage circuit involving two cross-attention heads that separately read the spatial relation and single-object attributes. When using a pretrained T5 encoder, the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token.

What carries the argument

Two distinct circuits that route spatial-relation information from text tokens to image tokens: a two-stage cross-attention pathway versus fused representation inside a single text token.

Load-bearing premise

The circuits identified in small models trained from scratch accurately reflect or inform the mechanisms inside large-scale pretrained diffusion transformers used in applications.

What would settle it

Probing a large pretrained DiT on the same two-object spatial task and checking whether the two-stage cross-attention pattern or the single-token fusion pattern appears.

Figures

Figures reproduced from arXiv: 2601.06338 by Binxu Wang, Jingxuan Fan, Xu Pan.

**Figure 1.** Figure 1: Schematics of the model and task. Our T2I model architecture adopted the design of PixArt [5]. There are three main components: the text encoder that processes tokenized natural language prompts into text embeddings, the VAE that processes image inputs into image tokens, and the Diffusion Transformer (DiT) which is the backbone of the denoising diffusion process. The text information routes through the cr… view at source ↗

**Figure 2.** Figure 2: Training dynamics of the T2I models (DiT-B). A. and B. Both models trained with random token embedding (RTE) and T5 can achieve good accuracy on the task. Solid lines shows the result of model using exponential moving averaged (ema) weights, while dashed line shows the non-averaged weights. C. The task is learned in distinct stages. In both models, they first learn to generate objects but with wrong attri… view at source ↗

**Figure 3.** Figure 3: Illustration of our methods to find relevant heads A. Attention synopsis: The giant attention tensor is first reduced to those only between two interested groups of tokens (e.g. the relation token regardless specific words, or an object token regardless where or what it is). Then the reduced attention tensor is averaged over diffusion time steps, resulting in a layer × head map which we use to pinpoint rel… view at source ↗

**Figure 4.** Figure 4: The spatial relation heads in random-embedding-based DiT. A. We find specialized cross attention heads that contributes to the object image tokens (top: the object1 in the text; bottom: the object2 in the text) attending to the relation text tokens. B. We show the activation of this head across images tokens and sampling steps. The map for the composite relation “below and right” decomposes cleanly as the … view at source ↗

**Figure 5.** Figure 5: The object generation heads in random-embedding-based DiT. A. Specialized heads emerge in cross-attention synopses, by summarizing strength from image tokens of each object to its own shape token. B. Activation of this head (L4H3) across images tokens and sampling steps for the prompt “red square is below and to the right of the blue circle”: tokens at the eventual square location attend to “square,” while… view at source ↗

**Figure 6.** Figure 6: Schematics of the object relation circuit in DiT trained with random embedding tokens with matching tags attend to shape token of the corresponding object. The VO circuit maps these text features [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Mechanism for relational generation in T5-DiT. A. T5-based DiT is robust to attention ablation of relation word, but most sensitive to shape2 and EOS. B. Weight space screening for spatial relation heads via projection score, and its corresponding spatial gradients (L3H7). C. Vector arithmetic on factorized word embedding causally affects generated object relation. spatial relationships in generated images… view at source ↗

**Figure 8.** Figure 8: Benchmark scores of spatial relationship and object feature attributes of open- and closed- source models. Color of dots denote the text encoder [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of training dynamics of DiT models with different text encoding and scale. Specific evaluation prompt used was “blue triangle is to the upper left of red square”, sampled with 14 steps at cfg 4.5, sampled from the same noise seed. Further, T5 models immediately learn to achieve object attribute binding after learning attributes themselves, while random embedding model (RTE) gradually learn the c… view at source ↗

**Figure 10.** Figure 10: Observation on sampling dynamics Specific evaluation prompt used was “the red square is above and to the right of the blue circle”, sampled with 14 steps at cfg 4.5. Model used is RTE x DiT-B. A transition can be seen at step 4-6, where the two object at their final positions can be clearly seen from the expected outcome G(xˆ0(xt)) [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Cross attention energy during sampling dynamics. Highlighting the smooth varying attention strength, and the salient contribution of L2H8 head from first object to relation words in RTE x DiT-B [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Cross attention energy summary (max over time) for the specific prompt above [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Attention Synopsis for Shape to Relation word for RTE x DiT-mini, showing it’s invariant to the specific shape of object1 [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Attention Synopsis for Shape to Relation word for RTE x DiT-mini, showing it’s invariant to the specific spatial relation and phrasing [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Attention Synopsis for Shape to Relation word for RTE x DiT-micro, showing it’s invariant to the specific shape of object1 [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: Attention Synopsis for Shape to Relation word for RTE x DiT-micro, showing it’s invariant to the specific spatial relation and phrasing [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: Attention Synopsis for Shape to corresponding shape word for RTE x DiT-mini [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Attention Synopsis for Shape to corresponding Shape word for RTE x DiT-micro [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Attention Synopsis for Shape to Relation word for T5 x DiT-B. The pattern is much less clear than RTE [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 20.** Figure 20: Attention Synopsis for Shape to Relation word for T5 x DiT-B. The pattern is much less clear than RTE [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

**Figure 21.** Figure 21: Dimension reduction visualization of shape2 token representation (PCA, tSNE, UMAP). Top row: T5 contextual embedding (4096d), Bottom row: Caption projection (784d) using MLP from T5 x DiT-B [PITH_FULL_IMAGE:figures/full_fig_p019_21.png] view at source ↗

**Figure 22.** Figure 22: Weight-space relation head screening for RTE-DiT (B, mini, micro, nano). Each column shows a different alignment metric (cosine, projection, energy); each row shows a different model [PITH_FULL_IMAGE:figures/full_fig_p020_22.png] view at source ↗

**Figure 23.** Figure 23: Weight-space relation head screening for T5-DiT (shape2 token) (B, mini, with/without weight decay). Each column shows a different alignment metric (cosine, projection, energy); each row shows a different model. Note this one uses the spatial feature vectors from variance partitioning on shape2 token [PITH_FULL_IMAGE:figures/full_fig_p021_23.png] view at source ↗

**Figure 24.** Figure 24: Weight-space relation head screening for T5-DiT (shape1 token) (B, mini, with/without weight decay). Each column shows a different alignment metric (cosine, projection, energy); each row shows a different model. Note this one uses the spatial feature vectors from variance partitioning on shape1 token [PITH_FULL_IMAGE:figures/full_fig_p022_24.png] view at source ↗

**Figure 25.** Figure 25: Weight-space relation head screening for CLIP-DiT (B, mini). Each column shows a different alignment metric (cosine, projection, energy); each row shows a different model [PITH_FULL_IMAGE:figures/full_fig_p023_25.png] view at source ↗

**Figure 26.** Figure 26: Evaluation of model performance on trained and generalized prompt template [PITH_FULL_IMAGE:figures/full_fig_p024_26.png] view at source ↗

**Figure 27.** Figure 27: T5-DiT Robustness and Circuit Analysis (A) The change of contextual embedding of the shape2 token when the word “the” is added interfere with the 8 relation feature directions, esp. positively with lower left and upper right feature. (B) Unnormalized attention maps of the relation head (layer 3, head 7) before and after the prompt perturbation. (C) The generated spatial relationship, i.e. the displacement… view at source ↗

**Figure 28.** Figure 28: Prompt generalization behavior of RTE-DiT and T5-DiT. Each row shows a different model, each column shows a different prompt template. In each panel, we plot the displacement between the two objects (Dx, Dy) parsed from the generated images; prompts are represented by the color (relationship), size (color1) and marker type (shape1). One can see the addition of “the” at different positions (before first or… view at source ↗

**Figure 29.** Figure 29: Circuit dissection of PixArt with CLIP text encoder (A) General evaluation of CLIP-DiT-mini model on training and generalized prompt templates, showing it is relatively robust to prompt perturbation, comparing to T5-DiT. (B) Attention masking analysis showing CLIP-DiT is robust to attention masking of most words (including the relation word), but is most sensitive to shape2 and slightly to shape1, similar… view at source ↗

**Figure 30.** Figure 30: Pre-trained T2I models (A) The prompt set construction and evaluation pipeline. (B) Object and relation accuracy across various object pairs for the PixArt-Sigma model. (C) A text token ablation analysis demonstrating how masking specific tokens affects object and relation accuracy. (D) Projection scores used to identify salient spatial relation heads within the model’s layers To evaluate pre-trained mode… view at source ↗

read the original abstract

Diffusion Transformers (DiTs) have greatly advanced text-to-image generation, but models still struggle to generate the correct spatial relations between objects as specified in the text prompt. In this study, we adopt a mechanistic interpretability approach to investigate how a DiT can generate correct spatial relations between objects. We train, from scratch, DiTs of different sizes with different text encoders to learn to generate images containing two objects whose attributes and spatial relations are specified in the text prompt. We find that, although all the models can learn this task to near-perfect accuracy, the underlying mechanisms differ drastically depending on the choice of text encoder. When using random text embeddings, we find that the spatial-relation information is passed to image tokens through a two-stage circuit, involving two cross-attention heads that separately read the spatial relation and single-object attributes in the text prompt. When using a pretrained text encoder (T5), we find that the DiT uses a different circuit that leverages information fusion in the text tokens, reading spatial-relation and single-object information together from a single text token. We further show that, although the in-domain performance is similar for the two settings, their robustness to out-of-domain perturbations differs, potentially suggesting the difficulty of generating correct relations in real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small scratch-trained DiTs use different circuits for spatial relations depending on the text encoder, but the link to real models is missing.

read the letter

The paper's main finding is that when you train small Diffusion Transformers from scratch to generate images with two objects and specified spatial relations, the internal mechanisms for processing the text prompt depend heavily on the text encoder used. With random embeddings, the model routes spatial info through two separate cross-attention heads in sequence. With a pretrained T5 encoder, it instead fuses the relation and object details inside the text token representations before passing to the image tokens. This comparison is the new part. Prior work has noted spatial failures in text-to-image models, but this study isolates how the conditioning pathway changes with encoder type and shows the downstream effect on robustness. Both versions hit near-perfect accuracy on the in-distribution task, which lets the difference in circuits stand out clearly. The out-of-domain perturbation tests add a practical angle by showing one circuit holds up better than the other. The main soft spot is the jump to explaining real-world DiT behavior. The models here are small and trained on a narrow synthetic dataset. There's no follow-up work applying the same circuit analysis to large pretrained models or even fine-tuned versions. That leaves the claim about why production models struggle with spatial relations as an untested hypothesis rather than a demonstrated result. I also want to see more on how the circuits were identified. Things like which heads were ablated, what the activation patterns looked like, and whether there were controls for other possible pathways would strengthen the mechanistic claims. Readers working on interpretability of generative transformers or on fixing specific failure modes in diffusion models will find this useful as a case study. It is not a broad theoretical advance but a targeted dissection that could inform targeted interventions. Overall, this deserves peer review. The controlled experiments are clean and the mechanism difference is worth getting on record with referee feedback on the methods and the scope of the conclusions.

Referee Report

3 major / 1 minor

Summary. The paper trains small Diffusion Transformers (DiTs) from scratch on a synthetic two-object spatial-relation image generation task using either random text embeddings or a pretrained T5 encoder. It claims that all models reach near-perfect in-domain accuracy, yet the underlying circuits differ sharply: random embeddings induce a two-stage cross-attention pathway in which two heads separately read spatial relations and object attributes, while T5 induces information fusion within text tokens so that a single token supplies both relation and attribute information. The work further reports differing robustness to out-of-domain perturbations and suggests these mechanisms may help explain spatial-relation failures in large pretrained DiTs.

Significance. If the circuit identifications prove reproducible, the controlled comparison supplies concrete mechanistic evidence that text-encoder choice alters how spatial information is routed inside DiTs. This is a useful existence proof that distinct circuits can solve the same task and that out-of-domain robustness can diverge even when in-domain accuracy is matched. The absence of any bridging experiments to large-scale or pretrained models, however, confines the result to the small-scratch regime and weakens its immediate relevance to deployed text-to-image systems.

major comments (3)

[Methods] Methods section: the manuscript provides no description of the circuit-discovery procedure (activation patching, attention-head ablation, causal interventions, or quantitative metrics) used to identify the two-stage cross-attention pathway or the single-token fusion mechanism. Without these details the central claim that the circuits are distinct cannot be evaluated.
[Results] Results and §4: no error bars, run-to-run variance, or statistical controls accompany the “near-perfect accuracy” figures or the reported circuit differences. This omission is load-bearing because the paper’s strongest claim is that mechanisms differ despite matched performance.
[Discussion] Discussion: the suggestion that the observed circuits “may indicate why real-world DiTs struggle with spatial relations” is unsupported; all experiments use small models trained from scratch, and no ablation, head-role comparison, or circuit analysis on any pretrained or large-scale DiT checkpoint is reported.

minor comments (1)

[Figures] Figure captions and axis labels should explicitly state the number of training runs and random seeds underlying each plotted accuracy or attention value.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications and proposed revisions to strengthen the manuscript while remaining within the scope of our controlled small-scale experiments.

read point-by-point responses

Referee: [Methods] Methods section: the manuscript provides no description of the circuit-discovery procedure (activation patching, attention-head ablation, causal interventions, or quantitative metrics) used to identify the two-stage cross-attention pathway or the single-token fusion mechanism. Without these details the central claim that the circuits are distinct cannot be evaluated.

Authors: We agree that the Methods section requires a more explicit description of the circuit identification techniques. In the revised manuscript we will add a dedicated subsection detailing the activation patching protocol, attention-head ablation experiments, causal interventions, and quantitative metrics (including logit-difference scores and attention-map consistency measures) used to establish the two-stage cross-attention circuit for random embeddings versus the single-token fusion circuit for T5. This addition will include procedural steps and representative examples to allow full evaluation of the claims. revision: yes
Referee: [Results] Results and §4: no error bars, run-to-run variance, or statistical controls accompany the “near-perfect accuracy” figures or the reported circuit differences. This omission is load-bearing because the paper’s strongest claim is that mechanisms differ despite matched performance.

Authors: We accept that variance reporting is necessary to support the claim of matched in-domain performance with distinct mechanisms. The revised Results section and §4 will include error bars (standard deviation across at least three independent training seeds per configuration) for accuracy metrics and circuit-identification outcomes. We will also add a brief statistical summary confirming that circuit differences remain consistent across runs. revision: yes
Referee: [Discussion] Discussion: the suggestion that the observed circuits “may indicate why real-world DiTs struggle with spatial relations” is unsupported; all experiments use small models trained from scratch, and no ablation, head-role comparison, or circuit analysis on any pretrained or large-scale DiT checkpoint is reported.

Authors: We recognize that our experiments are limited to small models trained from scratch and that direct evidence on large pretrained DiTs is absent. The original phrasing presented the link as a hypothesis motivated by the observed out-of-domain robustness differences. In revision we will reword the Discussion to frame this explicitly as a speculative direction for future work, add a limitations paragraph stating the small-model scope, and remove any implication of direct applicability to deployed systems. revision: partial

standing simulated objections not resolved

We cannot perform circuit analysis on large-scale pretrained DiT checkpoints, as this would require access to model weights and computational resources beyond the controlled small-scale setting of the current study.

Circularity Check

0 steps flagged

No circularity in empirical circuit discovery

full rationale

The paper trains DiTs from scratch on a synthetic two-object spatial-relation task and identifies circuits via direct mechanistic analysis of attention heads and information flow. All claims rest on observed model behavior after training, head ablations, and comparisons between random embeddings versus T5 encoders; no parameters are fitted to the target result, no self-citations justify core premises, and no equations or definitions reduce the findings to their own inputs by construction. The derivation chain is therefore self-contained empirical observation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that mechanistic interpretability correctly isolates the functional circuits and that small from-scratch models reveal principles relevant to the broader field.

axioms (1)

domain assumption Mechanistic interpretability techniques can accurately identify the functional circuits in the trained DiT models responsible for spatial relations.
The paper relies on this to claim specific circuits like two-stage cross-attention or text-token fusion.

pith-pipeline@v0.9.0 · 5526 in / 1361 out tokens · 68935 ms · 2026-05-16T15:30:16.003112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 5 internal anchors

[1]

Albergo, Nicholas M

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden- Eijnden. Stochastic Interpolants: A Unifying Framework for Flows and Diffusions, 2023. 1

work page 2023
[2]

The opencv library.Dr

Gary Bradski. The opencv library.Dr. Dobb’s Journal of Software Tools, 2000. 3

work page 2000
[3]

Getting it right: Improving spatial consis- tency in text-to-image models, 2024

Agneet Chatterjee, Gabriela Ben Melech Stan, Estelle Aflalo, Sayak Paul, Dhruba Ghosh, Tejas Gokhale, Ludwig Schmidt, Hannaneh Hajishirzi, Vasudev Lal, Chitta Baral, and Yezhou Yang. Getting it right: Improving spatial consis- tency in text-to-image models, 2024. 1

work page 2024
[4]

Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models, 2023

Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based se- mantic guidance for text-to-image diffusion models, 2023. 1, 2

work page 2023
[5]

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Syn- thesis, 2023

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Syn- thesis, 2023. 2, 3, 34

work page 2023
[6]

Testing Relational Un- derstanding in Text-Guided Image Generation, 2022

Colin Conwell and Tomer Ullman. Testing Relational Un- derstanding in Text-Guided Image Generation, 2022. 1

work page 2022
[7]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffu- sion Models Beat GANs on Image Synthesis. https://arxiv.org/abs/2105.05233v4, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

work page 2021
[9]

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

Thomas Fel, Binxu Wang, Michael A Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S Lubana, Talia Konkle, Demba Ba, et al. Into the rabbit hull: From task-relevant concepts in dino to minkowski geometry.arXiv preprint arXiv:2510.08638,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Geneval: An object-focused framework for evaluating text- to-image alignment, 2023

Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment, 2023. 1

work page 2023
[11]

Pro- gressive compositionality in text-to-image generative mod- els

Xu Han, Linghao Jin, Xiaofeng Liu, and Paul Pu Liang. Pro- gressive compositionality in text-to-image generative mod- els. InThe Thirteenth International Conference on Learning Representations, 2025. 1

work page 2025
[12]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt im- age editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022. 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 3, 35

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 1

work page 2020
[15]

Video diffu- sion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffu- sion models. InAdvances in Neural Information Processing Systems, pages 8633–8646. Curran Associates, Inc., 2022. 1

work page 2022
[16]

T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive bench- mark for open-world compositional text-to-image genera- tion.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 1

work page 2023
[17]

T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to- Image Generation .IEEE Transactions on Pattern Analysis Machine Intelligence, (01):1–17, 2025

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to- Image Generation .IEEE Transactions on Pattern Analysis Machine Intelligence, (01):1–17, 2025. 1

work page 2025
[18]

Is clip ideal? no

Raphi Kang, Yue Song, Georgia Gkioxari, and Pietro Perona. Is clip ideal? no. can we fix it? yes!, 2025. 2

work page 2025
[19]

Analyzing and Improving the Training Dynamics of Diffusion Models, 2024

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and Improving the Training Dynamics of Diffusion Models, 2024. 3

work page 2024
[20]

Gligen: Open-set grounded text-to-image generation, 2023

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation, 2023. 1

work page 2023
[21]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow Matching for Generative Modeling, 2023. 1

work page 2023
[22]

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing, 2024

Bingyan Liu, Chengyu Wang, Tingfeng Cao, Kui Jia, and Jun Huang. Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing, 2024. arXiv:2403.03431 [cs]. 4

work page arXiv 2024
[23]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022. 3, 35

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Dick, and Hidenori Tanaka

Maya Okawa, Ekdeep Singh Lubana, Robert P. Dick, and Hidenori Tanaka. Compositional Abilities Emerge Multi- plicatively: Exploring Diffusion Models on a Synthetic Task,

work page
[25]

Emergence of Hid- den Capabilities: Exploring Learning Dynamics in Concept Space, 2024

Core Francisco Park, Maya Okawa, Andrew Lee, Hidenori Tanaka, and Ekdeep Singh Lubana. Emergence of Hid- den Capabilities: Exploring Learning Dynamics in Concept Space, 2024. 2

work page 2024
[26]

Grounded text-to-image synthesis with attention refocusing, 2023

Quynh Phung, Songwei Ge, and Jia-Bin Huang. Grounded text-to-image synthesis with attention refocusing, 2023. 2

work page 2023
[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2023. 3

work page 2023
[28]

High-Resolution Image Synthesis with Latent Diffusion Models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models, 2022. 2, 3

work page 2022
[29]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 34

work page 2022
[30]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, Lille, France, 2015. PMLR. 1

work page 2015
[31]

What the DAAM: Interpreting stable diffusion using cross attention

Raphael Tang, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Jimmy Lin, and Ferhan Ture. What the DAAM: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885, 2022. 4

work page arXiv 2022
[32]

Martin Wattenberg and Fernanda B. Vi´egas. Relational Com- position in Neural Networks: A Survey and Call to Action,

work page
[33]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers, 2024

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, and Song Han. SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers, 2024. 2

work page 2024
[34]

blue triangle is to the upper left of red square

Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runx- ing Liu, Hong Gu, Huaqi Zhang, and Xinguo Liu. Compass: Enhancing spatial understanding in text-to-image diffusion models, 2024. 2 A. Extended Results A.1. Evaluation and Benchmark 0.0 0.2 0.4 0.6 0.8 1.0 Mean(single: color, shape, texture) 0.0 0.2 0.4 0.6 0.8 1.0Mean(spatial: 2D, 3D) SD v1-4 SD v2 C...

work page 2024
[35]

Shuffle the labels of factorf:˜y (f) i =y (f) π(i)

work page
[36]

Reconstruct the permuted designZ (f) π and the full designZ π = [Z (f) π , Z−f]

work page
[37]

above”, “below

Form the permuted projectorP all,π and compute SS part f,π = tr A Pall,π −P −f .(19) GivenN perm permutations, we define the permutationp-value as pf = 1 + #{π:SS part f,π ≥SS part f } 1 +N perm .(20) Table 4.Example Variance partitioning results for representational factors of T5 contextual word vector. The model achieves a total explained variance ofR 2...

work page arXiv