pith. sign in

arxiv: 2605.00273 · v2 · pith:4VGU4GT2new · submitted 2026-04-30 · 💻 cs.CV · cs.AI

When Do Diffusion Models learn to Generate Multiple Objects?

Pith reviewed 2026-07-01 08:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords diffusion modelsmulti-object generationcompositional generalizationscene complexityconcept imbalancecountingtext-to-imagemosaic framework
0
0 comments X

The pith

Diffusion models fail at multi-object scenes mainly because of scene complexity and held-out combinations, not concept imbalance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks why text-to-image diffusion models struggle to generate reliable images containing multiple objects. It introduces the mosaic framework to create synthetic datasets that separately control concept frequencies and which combinations appear during training. Experiments across dataset sizes show that performance drops more with higher numbers of objects per scene than with uneven concept counts. Counting objects stands out as especially hard when data is limited, while the ability to combine concepts in unseen ways drops sharply once more combinations are withheld. These controlled results isolate data effects as a core source of the observed failures.

Core claim

Training diffusion models on mosaic datasets, which vary the number of objects, concept balance, and held-out combinations, shows that scene complexity exerts the strongest influence on generation failures, counting accuracy is uniquely sensitive to low data volume, and compositional generalization degrades as the fraction of withheld concept combinations increases.

What carries the argument

The mosaic framework, a controlled synthetic data generator that separately varies multi-object spatial relations, attribute assignments, and object counts while holding concept frequencies and combination coverage fixed.

If this is right

  • Scene complexity affects multi-object generation success more strongly than uneven concept frequencies.
  • Counting performance degrades more than other tasks when training data is scarce.
  • Compositional generalization to new concept pairs collapses once the number of withheld combinations grows.
  • The identified failure patterns appear consistently across the tested range of dataset sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-world image collections may contain similar hidden compositional gaps even when overall concept counts look balanced.
  • Architectures that add explicit mechanisms for tracking object counts or relations could bypass some of the observed limits.
  • Data curation that deliberately increases exposure to high-complexity scenes might yield larger gains than further balancing of single-concept frequencies.

Load-bearing premise

The synthetic scenes created by the mosaic framework produce the same kinds of failures that appear in real training data without adding new artifacts from the data-generation process itself.

What would settle it

Train identical diffusion models on a real photograph dataset whose concept frequencies and pairwise combinations have been matched to the mosaic controls and check whether the same ranking of failure modes by scene complexity and counting holds.

Figures

Figures reproduced from arXiv: 2605.00273 by Anna Rohrbach, Arnas Uselis, Iro Laina, Seong Joon Oh, Yujin Jeong.

Figure 1
Figure 1. Figure 1: Diffusion models struggle with multi-object compo￾sitional generation. (a) Diffusion models generate single object reliably, but struggle with multiple objects. We study two regimes: (b) Concept generalization: the model has seen each concept at least once, but may still fail to learn it reliably (e.g., under data im￾balance). Generation accuracy is evaluated using Geneval (Ghosh et al., 2023). (c) Composi… view at source ↗
Figure 2
Figure 2. Figure 2: Our controlled dataset MOSAIC is designed for analyzing multi-object generation. Each subset isolates a specific reasoning dimension by varying one factor while randomizing others. (i) Attribution: varies object colors while keeping positions randomized, enabling control over color–object associations (e.g., “black sphere and blue cube”). (ii) Spatial Relations: varies the angular placement between two obj… view at source ↗
Figure 3
Figure 3. Figure 3: Complex settings for Attribution and Spatial Rela￾tions. Scene complexity is increased by introducing additional objects: for Attribution, objects are duplicated, while for Spatial Relations, additional objects are added as distractors. Spatial grid layout variant. We additionally introduce the Grid setting, which reduces scene complexity for the Counting task. Although Counting inherently involves a large… view at source ↗
Figure 5
Figure 5. Figure 5: Composition setting for Spatial relation and Count￾ing. We add Color as an additional conditioning factor, forming compositional pairs of color × spatial relation and color × count. 4. Experimental setup In this section, we first describe the experimental designs (Sec. 4.1) for our two research questions, and then describe our training and evaluation setup (Sec. 4.2). 4.1. Experimental designs We define tw… view at source ↗
Figure 4
Figure 4. Figure 4: Grid setting for Counting. Objects are constrained to predefined radial cells with small positional jitter, reducing positional variability compared to the default setting where objects can appear anywhere in the image [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of concept imbalance settings. Uniform: all categories have the same number of samples. Skewed: the frequency of categories varies, with some categories appearing more often than others. ond corresponds to compositional generalization (RQ2). We vary dataset size, concept (im)balance, and the num￾ber of unseen compositions independently under each set￾ting, enabling a comprehensive controlled analys… view at source ↗
Figure 8
Figure 8. Figure 8: Training and evaluation pipeline. During training (left), a one-hot condition vector (e.g., “count = 10”) is embedded by a condition encoder and integrated into the diffusion model, such as a U-Net or a Diffusion Transformer (DiT), via attention with the VAE-encoded latent representation. The condition encoder and the diffusion model are trained jointly. During evaluation (right), the trained diffusion mod… view at source ↗
Figure 10
Figure 10. Figure 10: Scene complexity becomes critical in low-data regimes. We increase the number of objects for Attribution and introduce additional blue spheres as distractors for Spatial Rela￾tions. (Bottom) At a dataset size of 10k, accuracy drops for both Attribution and Spatial Relations, although the degradation is less severe than for Counting. sharply at intermediate scales (10k and 50k), and only grad￾ually recover… view at source ↗
Figure 11
Figure 11. Figure 11: Accuracy trajectories across training steps. At￾tribution and Spatial Relations use the Complex setting, while Counting uses the Base setting. (Top) With a U-Net backbone, Counting exhibits early peaking and subsequent degradation at dataset sizes of 10k and 50k, while Attribution and Spatial Re￾lations remain stable. (Bottom) With a DiT backbone, a similar early-peaking behavior is observed for Counting … view at source ↗
Figure 12
Figure 12. Figure 12: Introducing a spatial prior stabilizes counting per￾formance. Using a grid layout dramatically improves counting accuracy across all dataset sizes and distributions. learning objectives, exhibits improved robustness compared to U-Net under limited data. Counting exhibits distinct learning dynamics. To inves￾tigate why Counting degrades more severely in low-data regimes compared to other concepts, we analy… view at source ↗
Figure 15
Figure 15. Figure 15: Fine-tuning behavior of SD3-medium on SPEC for spatial relations and counting. (Left) Training dynamics for each subset, where the top row shows training loss and the bottom row shows evaluation accuracy. (Right) Qualitative generation exam￾ples: the top row shows spatial relation samples, and the bottom row shows counting samples. While training loss consistently decreases for both tasks, counting accura… view at source ↗
Figure 14
Figure 14. Figure 14: Confusion matrix on unseen diagonals when half of the compositions (5 diagonals) are unseen. Compared to Attribution and Counting, Spatial relations show no clear error pattern. generated images and confusion matrices when half of the compositions are held out ( [PITH_FULL_IMAGE:figures/full_fig_p008_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Compositional generalization under realistic ob￾ject co-occurrence settings. (Top) Example scenes from the less controlled MOSAIC OBJECTS variant, with object co-occurrence. (Bottom) Accuracy on seen and unseen object compositions as the number of held-out diagonal compositions. While performance remains high on seen compositions, accuracy on unseen compo￾sitions degrades rapidly as more combinations are … view at source ↗
Figure 17
Figure 17. Figure 17: shows that some spatial relation terms appear far more frequently than others, indicating a substantial imbalance in the Laion2B caption. This confirms that our concept imbalance design is not only relevant for analyzing counting, but also extends more broadly to spatial relations. in front of above behind below right of left of Frequency 348K 250K 142K 49K 40K 32K [PITH_FULL_IMAGE:figures/full_fig_p014_… view at source ↗
Figure 20
Figure 20. Figure 20: shows the corresponding results under increased scene complexity. At the smallest dataset sizes (e.g., 2k), all three tasks exhibit near-100% memorization, indicating strong overfitting. As the dataset scale increases, memoriza￾tion gradually decreases across all tasks. Notably, all tasks exhibit an intermediate regime in which memorization is no longer feasible. 0.0 0.2 0.4 0.6 0.8 1.0 UNet 0.96 0.80 0.0… view at source ↗
Figure 21
Figure 21. Figure 21: Training loss across dataset sizes and architectures. Training loss decreases consistently during training for all settings, suggesting stable optimization. distinct object instances is harder than generating a few, even in a uniform setting. The right-side plots show how per-class accuracy evolves during training. At 10k dataset, we observe a progressive collapse toward under-counting as training continu… view at source ↗
Figure 19
Figure 19. Figure 19: Memorization rate under the default setting. The x￾axis denotes dataset size and the y-axis denotes memorization rate. Memorization is near 100% at small dataset sizes but gradually decreases as the dataset scale increases. Training loss. As shown in [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗
Figure 22
Figure 22. Figure 22: Per-class accuracy for counting. (top) 10k dataset: Higher object counts suffer the most. Accuracy peaks early during training but later degrades, with partial recovery driven by memo￾rization rather than genuine generalization. (bottom) 100k dataset: Higher counts are still more difficult but remain much more stable. All count classes converge to high accuracy, and memorization stays near zero throughout… view at source ↗
Figure 24
Figure 24. Figure 24: (top) compares confusion matrices at the best validation step (2,000) and the final training step (20,000). At early steps, predictions are largely diagonal, indicating correct counts across most classes. However, by the end of training, predictions skew heavily toward lower count classes, demonstrating a systematic bias toward generating fewer instances over time. Importantly, this collapse is not accomp… view at source ↗
Figure 23
Figure 23. Figure 23: Pixel-space distance to nearest training samples by count label. Pixel-distance histograms to nearest training samples for 10k (Top) and 50k (Bottom) datasets. Lower count labels exhibit smaller distances, indicating initial memorization, while higher counts remain far from training samples. Confusion matrix at 10k dataset size. As discussed in Section 5 of the main paper, the 10k dataset exhibits a progr… view at source ↗
Figure 25
Figure 25. Figure 25: Validation loss and accuracy for Counting across dataset sizes. (Left) Validation loss increases for all dataset sizes except 100k, indicating overfitting in smaller datasets (2k, 10k, 50k). (Right) Unlike 2k and 50k, validation accuracy on 10k and 50k dataset size peak early then deteriorate. a cross-entropy classification loss (green), and (ii) a con￾trastive InfoNCE loss (Oord et al., 2018) (orange). W… view at source ↗
Figure 26
Figure 26. Figure 26: Condition embedding collapse under small data. (Top) PCA visualization of count-conditioned embeddings at the final training step for datasets of size 10k, 50k, and 100k. The 10k setting shows collapse across count classes, whereas 50k and 100k maintain clear separation. (Bottom) Validation accuracy across training steps for the 10k dataset shows that collapse persists even when additional classification … view at source ↗
Figure 29
Figure 29. Figure 29: Compositional generalization on dataset size and the number of unseen compositions on Unet. (Top) For seen compo￾sitions, Attribution and Spatial relations remain stable across all dataset sizes, while Counting improves noticeably as the dataset size increases. (Bottom) For unseen compositions, performance drops rapidly as the dataset size decreases or the number of held￾out compositions increases. 0-18° … view at source ↗
Figure 27
Figure 27. Figure 27: Effect of lowering spatial complexity on compo￾sitional settings. Columns report Joint accuracy, task-specific accuracy, and Color accuracy on unseen compositions (diagonals). Reducing scene complexity leads to modest improvements in Spa￾tial relation and Counting accuracy, but simultaneously degrades Color accuracy, resulting in overall performance comparable to the non-grid setting. LoRA ablation study … view at source ↗
Figure 28
Figure 28. Figure 28: LoRA ablation across ranks and learning rates. Re￾sults are consistent across hyperparameter settings: spatial relation accuracy improves with fine-tuning, while counting accuracy de￾grades. Compositional generalization with Unet backbone In addition to the analysis on compositional generalization with DiT, where models fail to generalize to unseen compo- 0 1 3 5 8 0.0 0.2 0.4 0.6 0.8 1.0 Seen 0.97 0.97 0… view at source ↗
Figure 31
Figure 31. Figure 31: Effect of lowering spatial complexity on compo￾sitional settings. Columns report Joint accuracy, task-specific accuracy, and Color accuracy on unseen compositions (diagonals). Reducing scene complexity leads to modest improvements in Spa￾tial relation and Counting accuracy, but simultaneously degrades Color accuracy, resulting in overall performance comparable to the non-grid setting. g 1 3 5 8 # of diago… view at source ↗
Figure 32
Figure 32. Figure 32: Effect of condition encoder disentanglement on compositional generalization. Dashed lines correspond to results obtained using a frozen, disentangled condition encoder, while solid lines correspond to the baseline, where the encoder is jointly trained with the diffusion model. The performance gap remains small, indicating limited benefit from disentangling the condition embeddings alone. adopts a cross-at… view at source ↗
Figure 33
Figure 33. Figure 33: Validation accuracy dynamics across the number of unseen diagonals. (Top) Accuracy on seen compositions. (Bottom) Accuracy on unseen compositions. Both curves plateau, indicating that longer training does not improve compositional generalization. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Training samples from MOSAIC (non-grid, RQ1). Examples used for in-distribution evaluation across counting, spatial relations, and attribute-binding tasks. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_34.png] view at source ↗
Figure 35
Figure 35. Figure 35 [PITH_FULL_IMAGE:figures/full_fig_p025_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Training samples from MOSAIC under grid layout (RQ1 and RQ2). Explicit spatial priors simplify scene structure, leading to improved counting and relational stability but reduced texture variation. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: shows example images from the SPEC benchmark (Peng et al., 2024), illustrating the two subsets: Relative Spatial Relations and Counting. Relative spatial Counting a photograph capturing one potted plant. the traffic light is positioned on the left of the bed. the snowboard is situated to the right of the wine glass. the cup is on top of the chair. the donut is above the teddy bear. the stop sign is situat… view at source ↗
Figure 38
Figure 38. Figure 38: Generated samples for concept generalization (non-grid, RQ1). Examples from the best-performing checkpoints trained on the 100k uniform dataset. Rows show (top) Attribution, (middle) Spatial Relations, and (bottom) Counting. Samples are shown without filtering for correctness to illustrate raw generative behavior. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Generation trajectory across diffusion timesteps. Global spatial structure is established early in the denoising process, while fine-grained shapes are refined in later steps. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Generated samples for compositional generalization (non-grid, RQ2). Results on unseen compositions when five diagonals are removed (100k dataset, best-validation checkpoint). Samples are shown without filtering for correctness to illustrate raw generation behavior under strong compositional shift. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_40.png] view at source ↗
read the original abstract

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that text-to-image diffusion models' failures in multi-object generation are driven primarily by scene complexity rather than concept imbalance. Using a new controlled 'mosaic' dataset generator, it studies concept generalization (under imbalance) versus compositional generalization (with held-out combinations) across dataset sizes, concluding that object count dominates, counting is uniquely hard in low-data regimes, and compositional performance collapses as more combinations are withheld.

Significance. If the claimed disentanglement holds, the work supplies useful empirical guidance on data effects in diffusion models for multi-object scenes and introduces a reusable controlled generator (mosaic) that could support further studies. The emphasis on counting and compositional hold-outs identifies concrete failure modes worth targeting with stronger inductive biases or data design.

major comments (1)
  1. [Mosaic framework] Mosaic framework (dataset generation procedure): increasing object count necessarily expands the space of possible attribute/relation tuples via combinatorial growth. The manuscript does not describe an explicit control that holds the number of distinct combinations fixed across complexity levels (e.g., via post-hoc matching or constrained sampling). Without this, the performance drop attributed to scene complexity could instead reflect reduced effective coverage of combinations, undermining the central claim that complexity dominates imbalance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: [Mosaic framework] Mosaic framework (dataset generation procedure): increasing object count necessarily expands the space of possible attribute/relation tuples via combinatorial growth. The manuscript does not describe an explicit control that holds the number of distinct combinations fixed across complexity levels (e.g., via post-hoc matching or constrained sampling). Without this, the performance drop attributed to scene complexity could instead reflect reduced effective coverage of combinations, undermining the central claim that complexity dominates imbalance.

    Authors: The referee correctly identifies a potential confound. The Mosaic framework fixes the underlying vocabulary of object types, attributes, and relations, but we did not enforce an identical number of distinct tuples across object-count levels. We will revise the manuscript to (1) explicitly report the number of unique combinations observed at each complexity level and (2) add controlled experiments that subsample training data to hold the number of observed combinations fixed while varying object count. This will allow a cleaner test of whether scene complexity remains the dominant factor. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical study with no derivation chain

full rationale

The paper is an empirical investigation that introduces a synthetic dataset generator (mosaic) and reports observations from training diffusion models under controlled regimes of concept imbalance and compositional hold-outs. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes are present in the provided text or abstract. Central claims rest on experimental results rather than any reduction to inputs by construction. Self-citations, if present, are not load-bearing for any derivation. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the mosaic dataset construction cleanly separates the two generalization regimes without confounding artifacts from synthetic generation.

axioms (1)
  • domain assumption The mosaic framework allows clean separation of concept generalization and compositional generalization regimes.
    Invoked to attribute performance differences specifically to data distribution effects rather than other factors.

pith-pipeline@v0.9.1-grok · 5715 in / 1245 out tokens · 36712 ms · 2026-07-01T08:04:47.032270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 29 canonical work pages · 11 internal anchors

  1. [1]

    Whydiffusionmodelsdon’tmem- orize: The role of implicit dynamical regularization in training.arXiv preprint arXiv:2505.17638, 2025

    Bonnaire, T., Urfin, R., Biroli, G., and M ´ezard, M. Why diffusion models don’t memorize: The role of implicit dynamical regularization in training.arXiv preprint arXiv:2505.17638,

  2. [2]

    Countsteer: Steering attention for object counting in diffusion models.arXiv preprint arXiv:2511.11253,

    Boo, H., Kim, H., Lee, M., Lee, S., Lee, J., Choi, J.-H., and Cho, H. Countsteer: Steering attention for object counting in diffusion models.arXiv preprint arXiv:2511.11253,

  3. [3]

    Local mechanisms of compositional gen- eralization in conditional diffusion.arXiv preprint arXiv:2509.16447,

    Bradley, A. Local mechanisms of compositional gen- eralization in conditional diffusion.arXiv preprint arXiv:2509.16447,

  4. [4]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y ., Wang, Z., Kwok, J., Luo, P., Lu, H., et al. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426,

  5. [5]

    Daunhawer, I., Bizeul, A., Palumbo, E., Marx, A., and V ogt, J. E. Identifiability results for multimodal contrastive learning.arXiv preprint arXiv:2303.09166,

  6. [6]

    A., and Zaharia, M

    Elmaaroufi, K., Lai, L., Svegliato, J., Bai, Y ., Seshia, S. A., and Zaharia, M. Graid: Enhancing spatial reasoning of vlms through high-fidelity data generation.arXiv preprint arXiv:2510.22118,

  7. [7]

    What Drives Compositional Generalization? The Importance of Continuous Training Objectives in Visual Generative Models

    Farid, K., Sahay, R., Alnaggar, Y . A., Schrodi, S., Fischer, V ., Schmid, C., and Brox, T. What drives compositional gen- eralization in visual generative models?arXiv preprint arXiv:2510.03075,

  8. [8]

    Early-stopping too late? traces of memorization be- fore overfitting in generative diffusion

    Garnier-Brun, J., Biggio, L., Mezard, M., and Saglietti, L. Early-stopping too late? traces of memorization be- fore overfitting in generative diffusion. InThe Impact of Memorization on Trustworthy Foundation Models: ICML 2025 Workshop,

  9. [9]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arxiv 2021.arXiv preprint arXiv:2106.09685, 10,

  10. [10]

    J., and Rohrbach, A

    Jeong & Uselis, Y ., Uselis, A., Oh, S. J., and Rohrbach, A. Diffusion classifiers understand compositionality, but conditions apply.arXiv preprint arXiv:2505.17955, 2,

  11. [11]

    A new approach to linear filtering and prediction problems

    Kamb, M. and Ganguli, S. An analytic theory of creativ- ity in convolutional diffusion models.arXiv preprint arXiv:2412.20292,

  12. [12]

    How Far is Video Generation from World Model: A Physical Law Perspective

    Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385,

  13. [13]

    Kang, S., Han, W., Ju, D., and Hwang, S. J. Rare text semantics were always there in your diffusion transformer. arXiv preprint arXiv:2510.03886, 2025a. Kang, W., Galim, K., Koo, H. I., and Cho, N. I. Counting guidance for high fidelity text-to-image synthesis. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 899–908. IEEE...

  14. [14]

    Kingma, D. P. and Welling, M. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  15. [15]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

  16. [16]

    Oord, A. v. d., Li, Y ., and Vinyals, O. Representation learn- ing with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

  17. [17]

    An Introduction to Convolutional Neural Networks

    O’shea, K. and Nash, R. An introduction to convolutional neural networks.arXiv preprint arXiv:1511.08458,

  18. [18]

    Zaki, Luca Ambrogioni, and Dmitry Krotov

    Pham, B., Raya, G., Negri, M., Zaki, M. J., Ambrogioni, L., and Krotov, D. Memorization to generalization: Emergence of diffusion models from associative memory. arXiv preprint arXiv:2505.21777,

  19. [19]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M ¨uller, J., Penna, J., and Rombach, R. Sdxl: Im- proving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

  20. [20]

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M

    URL https: //arxiv.org/abs/2507.10768. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

  21. [21]

    U-net: Con- volutional networks for biomedical image segmenta- tion

    Ronneberger, O., Fischer, P., and Brox, T. U-net: Con- volutional networks for biomedical image segmenta- tion. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international confer- ence, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241,

  22. [22]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

  23. [23]

    Qwen3 Technical Report

    URL https: //arxiv.org/abs/2505.09388. Thrush, T., Jiang, R., Bartolo, M., Singh, A., Williams, A., Kiela, D., and Ross, C. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248,

  24. [24]

    Diffusion lens: Interpreting text encoders in text-to- image pipelines.arXiv preprint arXiv:2403.05846,

    Toker, M., Orgad, H., Ventura, M., Arad, D., and Belinkov, Y . Diffusion lens: Interpreting text encoders in text-to- image pipelines.arXiv preprint arXiv:2403.05846,

  25. [25]

    Uselis, A., Dittadi, A., and Oh, S. J. Does data scaling lead to visual compositional generalization?arXiv preprint arXiv:2507.07102,

  26. [26]

    Wewer, C., Pogodzinski, B., Schiele, B., and Lenssen, J. E. Spatial reasoning with denoising models.arXiv preprint arXiv:2502.21075,

  27. [27]

    Pretraining frequency predicts compositional generalization of clip on real-world tasks.arXiv preprint arXiv:2502.18326,

    Wiedemer, T., Sharma, Y ., Prabhu, A., Bethge, M., and Brendel, W. Pretraining frequency predicts compositional generalization of clip on real-world tasks.arXiv preprint arXiv:2502.18326,

  28. [28]

    Omnigen: Uni- fied image generation.arXiv preprint arXiv:2409.11340,

    Xiao, S., Wang, Y ., Zhou, J., Yuan, H., Xing, X., Yan, R., Li, C., Wang, S., Huang, T., and Liu, Z. Omnigen: Uni- fied image generation.arXiv preprint arXiv:2409.11340,

  29. [29]

    1.58-bit flux.arXiv preprint arXiv:2412.18653, 2024a

    Yang, C., Liu, C., Deng, X., Kim, D., Mei, X., Shen, X., and Chen, L.-C. 1.58-bit flux.arXiv preprint arXiv:2412.18653, 2024a. Yang, Y ., Park, C. F., Lubana, E. S., Okawa, M., Hu, W., and Tanaka, H. Swing-by dynamics in concept learn- ing and compositional generalization.arXiv preprint arXiv:2410.08309, 2024b. Yoo, N., Russakovsky, O., and Zhu, Y . D2d: ...

  30. [30]

    Mitigating compositional issues in text-to-image generative models via enhanced text embeddings

    Zarei, A., Rezaei, K., Basu, S., Saberi, M., Moayeri, M., Kattakinda, P., and Feizi, S. Mitigating compositional issues in text-to-image generative models via enhanced text embeddings. Zhang, Z., Hu, F., Lee, J., Shi, F., Kordjamshidi, P., Chai, J., and Ma, Z. Do vision-language models represent space and how? evaluating spatial frame of reference under a...

  31. [31]

    1” or “one

    13 When Do Diffusion Models learn to Generate Multiple Objects? A. Appendix This supplemental material provides detailed experimental setup (Section A.1), extended experimental results (Sec- tion A.2), and qualitative examples (Section A.3). First, we describe experimental details, including those for Fig- ure 1 (b) in main paper, MOSAIC, design choices, ...

  32. [32]

    Prompt:Find number words (one, two, three, four, five, six, seven, eight, nine, ten) that appear next to or very close to nouns describing countable physical things of any size

    with carefully designed prompts. Prompt:Find number words (one, two, three, four, five, six, seven, eight, nine, ten) that appear next to or very close to nouns describing countable physical things of any size. ONLY count when: - The number word is adjacent to or within 1-2 words of a concrete noun (like: two dogs, one red car, three small boxes, four tal...

  33. [33]

    Following their object list, we uniformly generate 830 prompts for each target count

    on counting using prompts derived from CompBench (Huang et al., 2023). Following their object list, we uniformly generate 830 prompts for each target count. Evaluation follows the CompBench protocol, which uses UniDet (Zhou et al.,

  34. [34]

    in front of

    If none found, all counts should be 0 Figure 17 shows that some spatial relation terms appear far more frequently than others, indicating a substantial imbalance in the Laion2B caption. This confirms that our concept imbalance design is not only relevant for analyzing counting, but also extends more broadly to spatial relations. in front of above behind b...

  35. [35]

    For example, in a 100k dataset, theskeweddistribution allocates (22,550, 17,950, 14,350, 11,450, 9,150, 7,300, 5,850, 4,650, 3,750, 3,000) samples across the ten classes

    Class imbalance details.For theskewedsetting, we construct datasets with controlled degrees of class imbal- ance while keeping the skewness pattern consistent across different dataset sizes. For example, in a 100k dataset, theskeweddistribution allocates (22,550, 17,950, 14,350, 11,450, 9,150, 7,300, 5,850, 4,650, 3,750, 3,000) samples across the ten clas...

  36. [36]

    for the optimizer. 15 When Do Diffusion Models learn to Generate Multiple Objects? Table 4.Compositional configuration for Count × Color (Count- ing).Diagonal indices specify the order in which concept pairs are removed to create unseen composition settings. Higher unseen- diagonal counts correspond to harder compositional generalization. RED GREEN BLUE Y...

  37. [37]

    We adopt the DiT architecture from SD3 (Esser et al.,

    trained with rectified flow objectives (Lip- man et al., 2023). We adopt the DiT architecture from SD3 (Esser et al.,

  38. [38]

    To ensure comparable image quality across architectures, we fix the V AE to the one used in SD2, which is also used for all UNet-based experiments in this work

    and reduce the model size to approximately 90M parameters to closely match the capacity of our UNet-based baseline. To ensure comparable image quality across architectures, we fix the V AE to the one used in SD2, which is also used for all UNet-based experiments in this work. In the original SD3 design, text embeddings are injected through two pathways: (...

  39. [39]

    17 When Do Diffusion Models learn to Generate Multiple Objects? 0.6 0.8 1.0Accuracy Uniform Skewed 2k 10k20k50k 100k Dataset Size 0.0 0.5 1.0Memorization Model Size 40M 90M (Ours) 200M 2k 10k20k50k 100k Dataset Size Model Size 40M 90M (Ours) 200M Figure 18.Influence of model size on Counting accuracy.Larger or smaller models do not mitigate the failure at...

  40. [40]

    We also evaluate a frozen condition encoder (red) that is pretrained with cross-entropy loss and not jointly optimized with the diffusion model

    (orange). We also evaluate a frozen condition encoder (red) that is pretrained with cross-entropy loss and not jointly optimized with the diffusion model. However, as shown in Figure 26 (bottom), the embedding collapse persists even with these objectives, indicating that the issue does not stem solely from inade- quate supervision of the condition encoder...

  41. [41]

    provides only limited improvement and does not recover compositional generalization. Overall, the failure mode remains unchanged: increasing dataset scale alone is insufficient, and compositional generalization does not reli- ably emerge even with a stronger architecture and improved training objectives. Is a compositionally broken text encoder responsibl...

  42. [42]

    26 When Do Diffusion Models learn to Generate Multiple Objects? A.3.2

    8 9 106 7 (yellow, angle7) (purple, angle8) (white, angle9) (black, angle10)(green, angle6) Figure 36.Training samples fromMOSAICunder grid layout (RQ1 and RQ2).Explicit spatial priors simplify scene structure, leading to improved counting and relational stability but reduced texture variation. 26 When Do Diffusion Models learn to Generate Multiple Object...