arxiv: 2403.05135 · v1 · submitted 2024-03-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu , Rui Wang , Yixiao Fang , Bin Fu , Pei Cheng , Gang Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 19:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelstext-to-image generationlarge language modelssemantic alignmentadaptersdense promptsprompt followingdenoising timesteps

0 comments

The pith

ELLA connects pre-trained LLMs to diffusion models via a timestep-aware connector to improve following of dense, multi-object prompts without any fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ELLA, an adapter that lets text-to-image diffusion models draw on the semantic strengths of large language models for prompts describing multiple objects, attributes, and relationships. Standard CLIP encoders limit current models on long or intricate text, while LLMs excel at such understanding but cannot be plugged in directly. ELLA solves the gap with a Timestep-Aware Semantic Connector that pulls different LLM features at each denoising step to guide image formation. The approach requires no changes to the U-Net or LLM weights and integrates with existing community models. Tests on a new benchmark of 1,000 dense prompts show gains over prior methods, especially for complex compositions.

Core claim

ELLA equips diffusion models with LLMs through a Timestep-Aware Semantic Connector that dynamically extracts and adapts timestep-dependent semantic conditions from the LLM, enabling stronger alignment with dense prompts that include multiple objects, detailed attributes, and complex relationships without fine-tuning either the U-Net or the LLM.

What carries the argument

The Timestep-Aware Semantic Connector (TSC), a module that extracts and supplies timestep-specific semantic features from the LLM to condition the diffusion denoising process at each stage.

Load-bearing premise

The Timestep-Aware Semantic Connector can dynamically extract and adapt useful semantic features from the LLM across denoising timesteps without any fine-tuning of the U-Net or the LLM itself.

What would settle it

On the DPG-Bench of 1K dense prompts, if ELLA-equipped models generate images that match multiple-object compositions, attributes, and relationships no better than standard CLIP-based baselines, the claimed superiority would be falsified.

read the original abstract

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELLA gives diffusion models an LLM adapter via a timestep-aware connector and a new dense-prompt benchmark, but the experiments skip the ablation needed to show the timestep part actually drives the gains.

read the letter

The core of this paper is a plug-in adapter that swaps CLIP for an LLM in text-to-image diffusion without retraining the U-Net or the language model. The Timestep-Aware Semantic Connector pulls features from the LLM at each denoising step using the current timestep embedding, and they back it with DPG-Bench, a 1K-prompt set focused on multi-object scenes with attributes and relations. The results claim better prompt following than prior methods, and the adapter works with existing community models, which is the practical angle that matters most here.

Referee Report

2 major / 2 minor

Summary. The paper proposes ELLA, an adapter module that integrates frozen LLMs into pre-trained text-to-image diffusion models (via a novel Timestep-Aware Semantic Connector, TSC) to improve handling of dense, complex prompts without any fine-tuning of the U-Net or LLM. It introduces the DPG-Bench benchmark of 1K dense prompts and reports that ELLA outperforms prior methods on prompt following, especially for multi-object scenes with attributes and relations.

Significance. If the empirical gains hold under proper controls, ELLA would offer a practical, low-cost way to upgrade existing diffusion models with stronger language understanding, addressing a known limitation of CLIP-based encoders. The DPG-Bench benchmark itself could become a useful standard for evaluating dense prompt alignment.

major comments (2)

[§4.2 and §5] §4.2 (TSC design) and §5 (experiments): The central claim that TSC 'dynamically extracts timestep-dependent conditions' from the frozen LLM is not supported by any ablation that removes or freezes the timestep embedding input while holding all other components fixed. The reported comparisons are only against external baselines; without this control it is impossible to determine whether the timestep conditioning is necessary or whether a simpler static connector would produce equivalent gains on DPG-Bench.
[§5.1 and Table 2] §5.1 and Table 2: The superiority claims on DPG-Bench are presented without reported standard deviations, number of runs, or statistical significance tests. Given that the benchmark is newly introduced, this makes it difficult to assess whether the observed margins are robust or could be affected by prompt sampling or evaluation variance.

minor comments (2)

[Abstract] The abstract states quantitative superiority but does not report any numerical scores or baseline names; moving at least the headline DPG-Bench numbers into the abstract would improve readability.
[§4.2] Notation for the TSC output (e.g., how the adapted condition is injected into the U-Net cross-attention layers) is described only in prose; an explicit equation or diagram would clarify the interface.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [§4.2 and §5] §4.2 (TSC design) and §5 (experiments): The central claim that TSC 'dynamically extracts timestep-dependent conditions' from the frozen LLM is not supported by any ablation that removes or freezes the timestep embedding input while holding all other components fixed. The reported comparisons are only against external baselines; without this control it is impossible to determine whether the timestep conditioning is necessary or whether a simpler static connector would produce equivalent gains on DPG-Bench.

Authors: We agree that an internal ablation isolating the contribution of the timestep embedding is needed to rigorously support the claim. In the revised manuscript, we will add a controlled ablation in Section 5 that compares the full TSC against a static variant (with timestep embedding removed or replaced by a constant vector) while keeping all other components fixed. Results on DPG-Bench will be reported to quantify whether the dynamic conditioning provides measurable gains over a simpler static connector. revision: yes
Referee: [§5.1 and Table 2] §5.1 and Table 2: The superiority claims on DPG-Bench are presented without reported standard deviations, number of runs, or statistical significance tests. Given that the benchmark is newly introduced, this makes it difficult to assess whether the observed margins are robust or could be affected by prompt sampling or evaluation variance.

Authors: We acknowledge that reporting variability metrics is essential for a newly introduced benchmark. In the revision, we will re-run the evaluations using multiple random seeds (at least three) and report mean values with standard deviations for all methods in Table 2 and the corresponding text in §5.1. We will also add a brief discussion of result stability with respect to prompt sampling. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architectural proposal with independent experimental validation.

full rationale

The paper proposes ELLA as a new adapter architecture (TSC) connecting frozen LLM and U-Net, selected after investigating connector designs. No derivation chain, equations, or first-principles predictions are claimed; results are empirical comparisons on DPG-Bench and other benchmarks against external baselines. No self-citation is load-bearing for the core method, no fitted parameters are renamed as predictions, and no ansatz or uniqueness theorem is invoked. The contribution is self-contained as an engineering and benchmarking effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only: no explicit free parameters, axioms, or invented physical entities are described; the TSC is a new architectural module whose internal parameters are presumably learned but not detailed here.

pith-pipeline@v0.9.0 · 5539 in / 1063 out tokens · 57583 ms · 2026-05-11T19:37:27.293147+00:00 · methodology

discussion (0)

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
cs.CV 2026-05 unverdicted novelty 8.0

CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
Asymmetric Flow Models
cs.CV 2026-05 unverdicted novelty 7.0

Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
cs.CV 2026-05 unverdicted novelty 7.0

LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
Long-Text-to-Image Generation via Compositional Prompt Decomposition
cs.CV 2026-04 unverdicted novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
cs.CV 2026-04 conditional novelty 7.0

1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
Transfer between Modalities with MetaQueries
cs.CV 2025-04 unverdicted novelty 7.0

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
cs.AI 2026-05 unverdicted novelty 6.0

Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
cs.CV 2026-05 unverdicted novelty 6.0

STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
cs.CV 2026-05 unverdicted novelty 6.0

DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
Taming Outlier Tokens in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Linearizing Vision Transformer with Test-Time Training
cs.CV 2026-05 unverdicted novelty 6.0

Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
cs.CV 2026-04 unverdicted novelty 6.0

A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
ViPO: Visual Preference Optimization at Scale
cs.CV 2026-04 unverdicted novelty 6.0

Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
cs.CV 2026-04 unverdicted novelty 6.0

LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
cs.CV 2026-04 unverdicted novelty 6.0

By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
Self-Adversarial One Step Generation via Condition Shifting
cs.CV 2026-04 unverdicted novelty 6.0

APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
Nucleus-Image: Sparse MoE for Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
cs.CV 2024-10 unverdicted novelty 6.0

Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
cs.CV 2026-05 unverdicted novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Context Unrolling in Omni Models
cs.CV 2026-04 unverdicted novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
cs.CV 2026-04 unverdicted novelty 5.0

UniRect-CoT is a training-free rectification chain-of-thought framework that treats diffusion denoising as visual reasoning and uses the model's inherent understanding to align and correct intermediate generation results.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
cs.CV 2025-05 conditional novelty 5.0

BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
cs.CV 2026-04 unverdicted novelty 4.0

A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 41 Pith papers · 13 internal anchors

[1]

https://civitai.com/ 11, 12

Civitai. https://civitai.com/ 11, 12

work page
[2]

https://github.com/christophschuhmann/ improved-aesthetic-predictor 6

Clip+mlp aesthetic score predictor. https://github.com/christophschuhmann/ improved-aesthetic-predictor 6

work page
[3]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3, 5

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3, 5

work page 2022
[5]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022) 3, 4

work page internal anchor Pith review arXiv 2022
[6]

Multidiffusion: Fusing diffusion paths for controlled image generation,

Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113 (2023) 4

work page arXiv 2023
[7]

https://cdn

Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee,J.,Guo,Y.,etal.:Improvingimagegenerationwithbettercaptions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023) 1, 2, 4

work page 2023
[8]

O’Reilly Media, Inc

Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc." (2009) 6

work page 2009
[9]

https://github.com/kakaobrain/coyo-dataset (2022) 6

Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022) 6

work page 2022
[10]

arXiv preprint arXiv:2304.08465 (2023) 5

Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023) 5

work page arXiv 2023
[11]

Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-basedsemanticguidancefortext-to-imagediffusionmodels.ACMTrans- actions on Graphics (TOG)42(4), 1–10 (2023) 4, 9

work page 2023
[12]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023) 2, 4, 9

work page internal anchor Pith review arXiv 2023
[13]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Chen, M., Laina, I., Vedaldi, A.: Training-free layout control with cross-attention guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5343–5353 (2024) 4, 5

work page 2024
[14]

In: ICLR (2024) 3, 7

Cho, J., Hu, Y., Baldridge, J., Garg, R., Anderson, P., Krishna, R., Bansal, M., Pont-Tuset, J., Wang, S.: Davidsonian scene graph: Improving reliability in fine- grained evaluation for text-to-image generation. In: ICLR (2024) 3, 7

work page 2024
[15]

arXiv preprint arXiv:2305.15328 (2023) 5

Cho, J., Zala, A., Bansal, M.: Visual programming for text-to-image generation and evaluation. arXiv preprint arXiv:2305.15328 (2023) 5

work page arXiv 2023
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11472–11481 (2022) 3

work page 2022
[17]

Emu: Enhanc- ing image generation models using photogenic needles in a haystack

Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., Dubey, A., et al.: Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023) 4

work page arXiv 2023
[18]

arXiv preprint arXiv:2305.19599 (2023) 5 ELLA 17

Fang, G., Jiang, Z., Han, J., Lu, G., Xu, H., Liang, X.: Boosting text-to-image dif- fusion models with fine-grained semantic rewards. arXiv preprint arXiv:2305.19599 (2023) 5 ELLA 17

work page arXiv 2023
[19]

E., and Wang, W

Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y.: Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2022) 9

work page arXiv 2022
[20]

Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A.R., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y.: Training-free structured diffusion guidance for composi- tionaltext-to-imagesynthesis.In:TheEleventhInternationalConferenceonLearn- ing Representations (2023),https://openreview.net/forum?id=PUIqjT4rzq7 4, 5

work page 2023
[21]

Advances in Neural Information Processing Systems36 (2024) 5

Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems36 (2024) 5

work page 2024
[22]

arXiv preprint arXiv:2311.17002 (2023) 5

Feng, Y., Gong, B., Chen, D., Shen, Y., Liu, Y., Zhou, J.: Ranni: Taming text-to- image diffusion for accurate instruction following. arXiv preprint arXiv:2311.17002 (2023) 5

work page arXiv 2023
[23]

Advances in Neural Information Processing Systems36 (2024) 5

Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image gener- ation. Advances in Neural Information Processing Systems36 (2024) 5

work page 2024
[24]

Diffit: Diffusion vision transformers for image generation,

Hatamizadeh, A., Song, J., Liu, G., Kautz, J., Vahdat, A.: Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139 (2023) 3, 6

work page arXiv 2023
[25]

Prompt-to-Prompt Image Editing with Cross Attention Control

Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022) 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) 3, 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Advances in Neural Information Processing Systems36 (2024) 3, 5, 6, 9, 12

Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems36 (2024) 3, 5, 6, 9, 12

work page 2024
[28]

In: ICCV (2023) 4

Kim, Y., Lee, J., Kim, J.H., Ha, J.W., Zhu, J.Y.: Dense text-to-image generation with attention modulation. In: ICCV (2023) 4

work page 2023
[29]

mplug: Effective and efficient vision-language learning by cross-modal skip-connections

Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005 (2022) 8, 10

work page arXiv 2022
[30]

https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic 8

Li, D., Kamko, A., Sabet, A., Akhgari, E., Xu, L., Doshi, S.: Playground v2. https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic 8

work page
[31]

arXiv preprint arXiv:2307.10864 (2023) 4

Li, Y., Keuper, M., Zhang, D., Khoreva, A.: Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864 (2023) 4

work page arXiv 2023
[32]

LLM -grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models

Lian, L., Li, B., Yala, A., Darrell, T.: Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023) 5

work page arXiv 2023
[33]

In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014) 7

work page 2014
[34]

Advances in neural information processing systems36 (2024) 5

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36 (2024) 5

work page 2024
[35]

In: European Conference on Computer Vision

Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual gen- eration with composable diffusion models. In: European Conference on Computer Vision. pp. 423–439. Springer (2022) 4, 9

work page 2022
[36]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 9 18 X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021) 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4195–4205 (2023) 3, 6

work page 2023
[39]

In: Proceedings of the AAAI conference on artificial intelligence

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018) 3

work page 2018
[40]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 2, 4, 8, 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 2, 4

work page 2021
[42]

The Journal of Machine Learning Research21(1), 5485–5551 (2020) 2, 4, 5, 8

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research21(1), 5485–5551 (2020) 2, 4, 5, 8

work page 2020
[43]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022) 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Advances in Neural Information Processing Systems36 (2024) 4

Rassin, R., Hirsch, E., Glickman, D., Ravfogel, S., Goldberg, Y., Chechik, G.: Lin- guistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Advances in Neural Information Processing Systems36 (2024) 4

work page 2024
[45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 2, 4, 8, 9

work page 2022
[46]

In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015) 3

work page 2015
[47]

Advances in Neural Information Processing Systems35, 36479–36494 (2022) 1, 2, 4

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems35, 36479–36494 (2022) 1, 2, 4

work page 2022
[48]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes,T.,Jitsev,J.,Komatsuzaki,A.:Laion-400m:Opendatasetofclip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 6

work page internal anchor Pith review arXiv 2021
[49]

In: Proceedings of the IEEE/CVF international conference on computer vision

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8430–8439 (2019) 7

work page 2019
[50]

arXiv preprint arXiv:2311.17946 (2023) 5 ELLA 19

Sun, J., Fu, D., Hu, Y., Wang, S., Rassin, R., Juan, D.C., Alon, D., Herrmann, C., van Steenkiste, S., Krishna, R., et al.: Dreamsync: Aligning text-to-image genera- tion with image understanding feedback. arXiv preprint arXiv:2311.17946 (2023) 5 ELLA 19

work page arXiv 2023
[51]

Advances in Neural Information Processing Systems36 (2024) 6

Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems36 (2024) 6

work page 2024
[52]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 2, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 6

work page 2023
[54]

arXiv preprint arXiv:2401.15688 (2024) 5

Wang, Z., Xie, E., Li, A., Wang, Z., Liu, X., Li, Z.: Divide and conquer: Language models can plan and self-correct for compositional text-to-image generation. arXiv preprint arXiv:2401.15688 (2024) 5

work page arXiv 2024
[55]

Wu, W., Li, Z., He, Y., Shou, M.Z., Shen, C., Cheng, L., Li, Y., Gao, T., Zhang, D., Wang, Z.: Paragraph-to-image generation with information-enriched diffusion model (2023) 2, 3, 4

work page 2023
[56]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xie,J.,Li,Y.,Huang,Y.,Liu,H.,Zhang,W.,Zheng,Y.,Shou,M.Z.:Boxdiff:Text- to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7452–7461 (2023) 4

work page 2023
[57]

Advances in Neural Information Processing Systems36 (2024) 5

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36 (2024) 5

work page 2024
[58]

arXiv preprint arXiv:2401.11708 (2024) 5

Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708 (2024) 5

work page arXiv 2024
[59]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.107892(3), 5 (2022) 2, 3, 6, 7

work page internal anchor Pith review arXiv 2022
[61]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023) 3, 11

work page 2023
[62]

Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024) 4, 5

work page 2024
[63]

In: Proceedings of the 31st ACM International Conference on Multimedia

Zhong, S., Huang, Z., Wen, W., Qin, J., Lin, L.: Sur-adapter: Enhancing text-to- image pre-trained diffusion models with large language models. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 567–578 (2023) 5

work page 2023