pith. machine review for the scientific record. sign in

arxiv: 2403.05135 · v1 · submitted 2024-03-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-11 19:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelstext-to-image generationlarge language modelssemantic alignmentadaptersdense promptsprompt followingdenoising timesteps
0
0 comments X

The pith

ELLA connects pre-trained LLMs to diffusion models via a timestep-aware connector to improve following of dense, multi-object prompts without any fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ELLA, an adapter that lets text-to-image diffusion models draw on the semantic strengths of large language models for prompts describing multiple objects, attributes, and relationships. Standard CLIP encoders limit current models on long or intricate text, while LLMs excel at such understanding but cannot be plugged in directly. ELLA solves the gap with a Timestep-Aware Semantic Connector that pulls different LLM features at each denoising step to guide image formation. The approach requires no changes to the U-Net or LLM weights and integrates with existing community models. Tests on a new benchmark of 1,000 dense prompts show gains over prior methods, especially for complex compositions.

Core claim

ELLA equips diffusion models with LLMs through a Timestep-Aware Semantic Connector that dynamically extracts and adapts timestep-dependent semantic conditions from the LLM, enabling stronger alignment with dense prompts that include multiple objects, detailed attributes, and complex relationships without fine-tuning either the U-Net or the LLM.

What carries the argument

The Timestep-Aware Semantic Connector (TSC), a module that extracts and supplies timestep-specific semantic features from the LLM to condition the diffusion denoising process at each stage.

Load-bearing premise

The Timestep-Aware Semantic Connector can dynamically extract and adapt useful semantic features from the LLM across denoising timesteps without any fine-tuning of the U-Net or the LLM itself.

What would settle it

On the DPG-Bench of 1K dense prompts, if ELLA-equipped models generate images that match multiple-object compositions, attributes, and relationships no better than standard CLIP-based baselines, the claimed superiority would be falsified.

read the original abstract

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ELLA, an adapter module that integrates frozen LLMs into pre-trained text-to-image diffusion models (via a novel Timestep-Aware Semantic Connector, TSC) to improve handling of dense, complex prompts without any fine-tuning of the U-Net or LLM. It introduces the DPG-Bench benchmark of 1K dense prompts and reports that ELLA outperforms prior methods on prompt following, especially for multi-object scenes with attributes and relations.

Significance. If the empirical gains hold under proper controls, ELLA would offer a practical, low-cost way to upgrade existing diffusion models with stronger language understanding, addressing a known limitation of CLIP-based encoders. The DPG-Bench benchmark itself could become a useful standard for evaluating dense prompt alignment.

major comments (2)
  1. [§4.2 and §5] §4.2 (TSC design) and §5 (experiments): The central claim that TSC 'dynamically extracts timestep-dependent conditions' from the frozen LLM is not supported by any ablation that removes or freezes the timestep embedding input while holding all other components fixed. The reported comparisons are only against external baselines; without this control it is impossible to determine whether the timestep conditioning is necessary or whether a simpler static connector would produce equivalent gains on DPG-Bench.
  2. [§5.1 and Table 2] §5.1 and Table 2: The superiority claims on DPG-Bench are presented without reported standard deviations, number of runs, or statistical significance tests. Given that the benchmark is newly introduced, this makes it difficult to assess whether the observed margins are robust or could be affected by prompt sampling or evaluation variance.
minor comments (2)
  1. [Abstract] The abstract states quantitative superiority but does not report any numerical scores or baseline names; moving at least the headline DPG-Bench numbers into the abstract would improve readability.
  2. [§4.2] Notation for the TSC output (e.g., how the adapted condition is injected into the U-Net cross-attention layers) is described only in prose; an explicit equation or diagram would clarify the interface.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [§4.2 and §5] §4.2 (TSC design) and §5 (experiments): The central claim that TSC 'dynamically extracts timestep-dependent conditions' from the frozen LLM is not supported by any ablation that removes or freezes the timestep embedding input while holding all other components fixed. The reported comparisons are only against external baselines; without this control it is impossible to determine whether the timestep conditioning is necessary or whether a simpler static connector would produce equivalent gains on DPG-Bench.

    Authors: We agree that an internal ablation isolating the contribution of the timestep embedding is needed to rigorously support the claim. In the revised manuscript, we will add a controlled ablation in Section 5 that compares the full TSC against a static variant (with timestep embedding removed or replaced by a constant vector) while keeping all other components fixed. Results on DPG-Bench will be reported to quantify whether the dynamic conditioning provides measurable gains over a simpler static connector. revision: yes

  2. Referee: [§5.1 and Table 2] §5.1 and Table 2: The superiority claims on DPG-Bench are presented without reported standard deviations, number of runs, or statistical significance tests. Given that the benchmark is newly introduced, this makes it difficult to assess whether the observed margins are robust or could be affected by prompt sampling or evaluation variance.

    Authors: We acknowledge that reporting variability metrics is essential for a newly introduced benchmark. In the revision, we will re-run the evaluations using multiple random seeds (at least three) and report mean values with standard deviations for all methods in Table 2 and the corresponding text in §5.1. We will also add a brief discussion of result stability with respect to prompt sampling. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architectural proposal with independent experimental validation.

full rationale

The paper proposes ELLA as a new adapter architecture (TSC) connecting frozen LLM and U-Net, selected after investigating connector designs. No derivation chain, equations, or first-principles predictions are claimed; results are empirical comparisons on DPG-Bench and other benchmarks against external baselines. No self-citation is load-bearing for the core method, no fitted parameters are renamed as predictions, and no ansatz or uniqueness theorem is invoked. The contribution is self-contained as an engineering and benchmarking effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only: no explicit free parameters, axioms, or invented physical entities are described; the TSC is a new architectural module whose internal parameters are presumably learned but not detailed here.

pith-pipeline@v0.9.0 · 5539 in / 1063 out tokens · 57583 ms · 2026-05-11T19:37:27.293147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Continuous-Time Distribution Matching for Few-Step Diffusion Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.

  2. Asymmetric Flow Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...

  3. ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

  4. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

  5. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.

  6. LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling

    cs.CV 2026-05 unverdicted novelty 7.0

    LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.

  7. Long-Text-to-Image Generation via Compositional Prompt Decomposition

    cs.CV 2026-04 unverdicted novelty 7.0

    PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...

  8. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  9. Transfer between Modalities with MetaQueries

    cs.CV 2025-04 unverdicted novelty 7.0

    MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.

  10. InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 6.0

    InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

  11. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  12. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  13. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

    cs.AI 2026-05 unverdicted novelty 6.0

    Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...

  14. STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.

  15. DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

    cs.CV 2026-05 unverdicted novelty 6.0

    DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.

  16. Taming Outlier Tokens in Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

  17. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  18. Linearizing Vision Transformer with Test-Time Training

    cs.CV 2026-05 unverdicted novelty 6.0

    Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...

  19. SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness

    cs.CV 2026-04 unverdicted novelty 6.0

    SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.

  20. Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...

  21. The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents

    cs.CV 2026-04 unverdicted novelty 6.0

    A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.

  22. ViPO: Visual Preference Optimization at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.

  23. LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.

  24. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  25. Self-Adversarial One Step Generation via Condition Shifting

    cs.CV 2026-04 unverdicted novelty 6.0

    APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.

  26. Nucleus-Image: Sparse MoE for Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.

  27. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  28. SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    cs.CV 2024-10 unverdicted novelty 6.0

    Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.

  29. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  30. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  31. Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

    cs.CV 2026-05 unverdicted novelty 5.0

    Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

  32. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  33. Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.

  34. Context Unrolling in Omni Models

    cs.CV 2026-04 unverdicted novelty 5.0

    Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

  35. Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    UniRect-CoT is a training-free rectification chain-of-thought framework that treats diffusion denoising as visual reasoning and uses the model's inherent understanding to align and correct intermediate generation results.

  36. Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    cs.CV 2025-11 unverdicted novelty 5.0

    Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...

  37. BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    cs.CV 2025-05 conditional novelty 5.0

    BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.

  38. Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

    cs.GR 2026-05 unverdicted novelty 4.0

    JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.

  39. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

  40. TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    cs.AI 2026-04 unverdicted novelty 4.0

    TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

  41. Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

    cs.CV 2026-04 unverdicted novelty 4.0

    A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.

  42. Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    cs.AI 2025-01 conditional novelty 3.0

    Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 41 Pith papers · 13 internal anchors

  1. [1]

    https://civitai.com/ 11, 12

    Civitai. https://civitai.com/ 11, 12

  2. [2]

    https://github.com/christophschuhmann/ improved-aesthetic-predictor 6

    Clip+mlp aesthetic score predictor. https://github.com/christophschuhmann/ improved-aesthetic-predictor 6

  3. [3]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 5

  4. [4]

    Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3, 5

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3, 5

  5. [5]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022) 3, 4

  6. [6]

    Multidiffusion: Fusing diffusion paths for controlled image generation,

    Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113 (2023) 4

  7. [7]

    https://cdn

    Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee,J.,Guo,Y.,etal.:Improvingimagegenerationwithbettercaptions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023) 1, 2, 4

  8. [8]

    O’Reilly Media, Inc

    Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc." (2009) 6

  9. [9]

    https://github.com/kakaobrain/coyo-dataset (2022) 6

    Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022) 6

  10. [10]

    arXiv preprint arXiv:2304.08465 (2023) 5

    Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023) 5

  11. [11]

    Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-basedsemanticguidancefortext-to-imagediffusionmodels.ACMTrans- actions on Graphics (TOG)42(4), 1–10 (2023) 4, 9

  12. [12]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023) 2, 4, 9

  13. [13]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Chen, M., Laina, I., Vedaldi, A.: Training-free layout control with cross-attention guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5343–5353 (2024) 4, 5

  14. [14]

    In: ICLR (2024) 3, 7

    Cho, J., Hu, Y., Baldridge, J., Garg, R., Anderson, P., Krishna, R., Bansal, M., Pont-Tuset, J., Wang, S.: Davidsonian scene graph: Improving reliability in fine- grained evaluation for text-to-image generation. In: ICLR (2024) 3, 7

  15. [15]

    arXiv preprint arXiv:2305.15328 (2023) 5

    Cho, J., Zala, A., Bansal, M.: Visual programming for text-to-image generation and evaluation. arXiv preprint arXiv:2305.15328 (2023) 5

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11472–11481 (2022) 3

  17. [17]

    Emu: Enhanc- ing image generation models using photogenic needles in a haystack

    Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., Dubey, A., et al.: Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023) 4

  18. [18]

    arXiv preprint arXiv:2305.19599 (2023) 5 ELLA 17

    Fang, G., Jiang, Z., Han, J., Lu, G., Xu, H., Liang, X.: Boosting text-to-image dif- fusion models with fine-grained semantic rewards. arXiv preprint arXiv:2305.19599 (2023) 5 ELLA 17

  19. [19]

    E., and Wang, W

    Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y.: Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2022) 9

  20. [20]

    Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A.R., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y.: Training-free structured diffusion guidance for composi- tionaltext-to-imagesynthesis.In:TheEleventhInternationalConferenceonLearn- ing Representations (2023),https://openreview.net/forum?id=PUIqjT4rzq7 4, 5

  21. [21]

    Advances in Neural Information Processing Systems36 (2024) 5

    Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems36 (2024) 5

  22. [22]

    arXiv preprint arXiv:2311.17002 (2023) 5

    Feng, Y., Gong, B., Chen, D., Shen, Y., Liu, Y., Zhou, J.: Ranni: Taming text-to- image diffusion for accurate instruction following. arXiv preprint arXiv:2311.17002 (2023) 5

  23. [23]

    Advances in Neural Information Processing Systems36 (2024) 5

    Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image gener- ation. Advances in Neural Information Processing Systems36 (2024) 5

  24. [24]

    Diffit: Diffusion vision transformers for image generation,

    Hatamizadeh, A., Song, J., Liu, G., Kautz, J., Vahdat, A.: Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139 (2023) 3, 6

  25. [25]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022) 5

  26. [26]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) 3, 11

  27. [27]

    Advances in Neural Information Processing Systems36 (2024) 3, 5, 6, 9, 12

    Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems36 (2024) 3, 5, 6, 9, 12

  28. [28]

    In: ICCV (2023) 4

    Kim, Y., Lee, J., Kim, J.H., Ha, J.W., Zhu, J.Y.: Dense text-to-image generation with attention modulation. In: ICCV (2023) 4

  29. [29]

    mplug: Effective and efficient vision-language learning by cross-modal skip-connections

    Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005 (2022) 8, 10

  30. [30]

    https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic 8

    Li, D., Kamko, A., Sabet, A., Akhgari, E., Xu, L., Doshi, S.: Playground v2. https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic 8

  31. [31]

    arXiv preprint arXiv:2307.10864 (2023) 4

    Li, Y., Keuper, M., Zhang, D., Khoreva, A.: Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864 (2023) 4

  32. [32]

    LLM -grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

    Lian, L., Li, B., Yala, A., Darrell, T.: Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023) 5

  33. [33]

    In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014) 7

  34. [34]

    Advances in neural information processing systems36 (2024) 5

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36 (2024) 5

  35. [35]

    In: European Conference on Computer Vision

    Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual gen- eration with composable diffusion models. In: European Conference on Computer Vision. pp. 423–439. Springer (2022) 4, 9

  36. [36]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 9 18 X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu

  37. [37]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021) 4

  38. [38]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4195–4205 (2023) 3, 6

  39. [39]

    In: Proceedings of the AAAI conference on artificial intelligence

    Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018) 3

  40. [40]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 2, 4, 8, 9

  41. [41]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 2, 4

  42. [42]

    The Journal of Machine Learning Research21(1), 5485–5551 (2020) 2, 4, 5, 8

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research21(1), 5485–5551 (2020) 2, 4, 5, 8

  43. [43]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022) 1, 2, 4

  44. [44]

    Advances in Neural Information Processing Systems36 (2024) 4

    Rassin, R., Hirsch, E., Glickman, D., Ravfogel, S., Goldberg, Y., Chechik, G.: Lin- guistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Advances in Neural Information Processing Systems36 (2024) 4

  45. [45]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 2, 4, 8, 9

  46. [46]

    In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015) 3

  47. [47]

    Advances in Neural Information Processing Systems35, 36479–36494 (2022) 1, 2, 4

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems35, 36479–36494 (2022) 1, 2, 4

  48. [48]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes,T.,Jitsev,J.,Komatsuzaki,A.:Laion-400m:Opendatasetofclip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 6

  49. [49]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8430–8439 (2019) 7

  50. [50]

    arXiv preprint arXiv:2311.17946 (2023) 5 ELLA 19

    Sun, J., Fu, D., Hu, Y., Wang, S., Rassin, R., Juan, D.C., Alon, D., Herrmann, C., van Steenkiste, S., Krishna, R., et al.: Dreamsync: Aligning text-to-image genera- tion with image understanding feedback. arXiv preprint arXiv:2311.17946 (2023) 5 ELLA 19

  51. [51]

    Advances in Neural Information Processing Systems36 (2024) 6

    Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems36 (2024) 6

  52. [52]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 2, 4, 5

  53. [53]

    Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 6

  54. [54]

    arXiv preprint arXiv:2401.15688 (2024) 5

    Wang, Z., Xie, E., Li, A., Wang, Z., Liu, X., Li, Z.: Divide and conquer: Language models can plan and self-correct for compositional text-to-image generation. arXiv preprint arXiv:2401.15688 (2024) 5

  55. [55]

    Wu, W., Li, Z., He, Y., Shou, M.Z., Shen, C., Cheng, L., Li, Y., Gao, T., Zhang, D., Wang, Z.: Paragraph-to-image generation with information-enriched diffusion model (2023) 2, 3, 4

  56. [56]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Xie,J.,Li,Y.,Huang,Y.,Liu,H.,Zhang,W.,Zheng,Y.,Shou,M.Z.:Boxdiff:Text- to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7452–7461 (2023) 4

  57. [57]

    Advances in Neural Information Processing Systems36 (2024) 5

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36 (2024) 5

  58. [58]

    arXiv preprint arXiv:2401.11708 (2024) 5

    Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708 (2024) 5

  59. [59]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 3

  60. [60]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.107892(3), 5 (2022) 2, 3, 6, 7

  61. [61]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023) 3, 11

  62. [62]

    Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024) 4, 5

  63. [63]

    In: Proceedings of the 31st ACM International Conference on Multimedia

    Zhong, S., Huang, Z., Wen, W., Qin, J., Lin, L.: Sur-adapter: Enhancing text-to- image pre-trained diffusion models with large language models. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 567–578 (2023) 5