Recognition: 2 theorem links
· Lean TheoremELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Pith reviewed 2026-05-11 19:37 UTC · model grok-4.3
The pith
ELLA connects pre-trained LLMs to diffusion models via a timestep-aware connector to improve following of dense, multi-object prompts without any fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ELLA equips diffusion models with LLMs through a Timestep-Aware Semantic Connector that dynamically extracts and adapts timestep-dependent semantic conditions from the LLM, enabling stronger alignment with dense prompts that include multiple objects, detailed attributes, and complex relationships without fine-tuning either the U-Net or the LLM.
What carries the argument
The Timestep-Aware Semantic Connector (TSC), a module that extracts and supplies timestep-specific semantic features from the LLM to condition the diffusion denoising process at each stage.
Load-bearing premise
The Timestep-Aware Semantic Connector can dynamically extract and adapt useful semantic features from the LLM across denoising timesteps without any fine-tuning of the U-Net or the LLM itself.
What would settle it
On the DPG-Bench of 1K dense prompts, if ELLA-equipped models generate images that match multiple-object compositions, attributes, and relationships no better than standard CLIP-based baselines, the claimed superiority would be falsified.
read the original abstract
Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ELLA, an adapter module that integrates frozen LLMs into pre-trained text-to-image diffusion models (via a novel Timestep-Aware Semantic Connector, TSC) to improve handling of dense, complex prompts without any fine-tuning of the U-Net or LLM. It introduces the DPG-Bench benchmark of 1K dense prompts and reports that ELLA outperforms prior methods on prompt following, especially for multi-object scenes with attributes and relations.
Significance. If the empirical gains hold under proper controls, ELLA would offer a practical, low-cost way to upgrade existing diffusion models with stronger language understanding, addressing a known limitation of CLIP-based encoders. The DPG-Bench benchmark itself could become a useful standard for evaluating dense prompt alignment.
major comments (2)
- [§4.2 and §5] §4.2 (TSC design) and §5 (experiments): The central claim that TSC 'dynamically extracts timestep-dependent conditions' from the frozen LLM is not supported by any ablation that removes or freezes the timestep embedding input while holding all other components fixed. The reported comparisons are only against external baselines; without this control it is impossible to determine whether the timestep conditioning is necessary or whether a simpler static connector would produce equivalent gains on DPG-Bench.
- [§5.1 and Table 2] §5.1 and Table 2: The superiority claims on DPG-Bench are presented without reported standard deviations, number of runs, or statistical significance tests. Given that the benchmark is newly introduced, this makes it difficult to assess whether the observed margins are robust or could be affected by prompt sampling or evaluation variance.
minor comments (2)
- [Abstract] The abstract states quantitative superiority but does not report any numerical scores or baseline names; moving at least the headline DPG-Bench numbers into the abstract would improve readability.
- [§4.2] Notation for the TSC output (e.g., how the adapted condition is injected into the U-Net cross-attention layers) is described only in prose; an explicit equation or diagram would clarify the interface.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [§4.2 and §5] §4.2 (TSC design) and §5 (experiments): The central claim that TSC 'dynamically extracts timestep-dependent conditions' from the frozen LLM is not supported by any ablation that removes or freezes the timestep embedding input while holding all other components fixed. The reported comparisons are only against external baselines; without this control it is impossible to determine whether the timestep conditioning is necessary or whether a simpler static connector would produce equivalent gains on DPG-Bench.
Authors: We agree that an internal ablation isolating the contribution of the timestep embedding is needed to rigorously support the claim. In the revised manuscript, we will add a controlled ablation in Section 5 that compares the full TSC against a static variant (with timestep embedding removed or replaced by a constant vector) while keeping all other components fixed. Results on DPG-Bench will be reported to quantify whether the dynamic conditioning provides measurable gains over a simpler static connector. revision: yes
-
Referee: [§5.1 and Table 2] §5.1 and Table 2: The superiority claims on DPG-Bench are presented without reported standard deviations, number of runs, or statistical significance tests. Given that the benchmark is newly introduced, this makes it difficult to assess whether the observed margins are robust or could be affected by prompt sampling or evaluation variance.
Authors: We acknowledge that reporting variability metrics is essential for a newly introduced benchmark. In the revision, we will re-run the evaluations using multiple random seeds (at least three) and report mean values with standard deviations for all methods in Table 2 and the corresponding text in §5.1. We will also add a brief discussion of result stability with respect to prompt sampling. revision: yes
Circularity Check
No circularity: empirical architectural proposal with independent experimental validation.
full rationale
The paper proposes ELLA as a new adapter architecture (TSC) connecting frozen LLM and U-Net, selected after investigating connector designs. No derivation chain, equations, or first-principles predictions are claimed; results are empirical comparisons on DPG-Bench and other benchmarks against external baselines. No self-citation is load-bearing for the core method, no fitted parameters are renamed as predictions, and no ansatz or uniqueness theorem is invoked. The contribution is self-contained as an engineering and benchmarking effort.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 42 Pith papers
-
Continuous-Time Distribution Matching for Few-Step Diffusion Distillation
CDM migrates distribution matching distillation to continuous time via dynamic random-length schedules and active off-trajectory latent alignment, yielding competitive few-step image fidelity on SD3 and Longcat-Image.
-
Asymmetric Flow Models
Asymmetric Flow Modeling restricts noise prediction to a low-rank subspace for high-dimensional flow generation, reaching 1.57 FID on ImageNet 256x256 and new state-of-the-art pixel text-to-image performance via finet...
-
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
-
Normalizing Trajectory Models
NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.
-
Normalizing Trajectory Models
NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
-
LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling
LENS shapes low-frequency eigen noise with a lightweight network to enable efficient, high-quality sampling in distilled diffusion models.
-
Long-Text-to-Image Generation via Compositional Prompt Decomposition
PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
Transfer between Modalities with MetaQueries
MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
-
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
-
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Auto-Rubric as Reward externalizes VLM preferences into structured rubrics and applies Rubric Policy Optimization to create more reliable binary rewards for multimodal generation, outperforming pairwise models on text...
-
STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation
STARFlow2 presents an autoregressive flow-based architecture for unified multimodal text-image generation by interleaving a VLM stream with a TarFlow stream via residual skips and a unified latent space.
-
DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models
DynT2I-Eval creates fresh prompts via dimension decomposition and dynamic sampling to evaluate text-to-image models on text alignment, quality, and aesthetics while maintaining a stable leaderboard.
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Linearizing Vision Transformer with Test-Time Training
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
-
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
-
Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
Refinement via Regeneration (RvR) reformulates image refinement in unified multimodal models as conditional regeneration using prompt and semantic tokens from the initial image, yielding higher alignment scores than e...
-
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
-
ViPO: Visual Preference Optimization at Scale
Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
-
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
LLaDA2.0-Uni unifies multimodal understanding and generation inside one discrete diffusion large language model with a semantic tokenizer, MoE backbone, and diffusion decoder.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
Self-Adversarial One Step Generation via Condition Shifting
APEX derives self-adversarial gradients from condition-shifted velocity fields in flow models to achieve high-fidelity one-step generation, outperforming much larger models and multi-step teachers.
-
Nucleus-Image: Sparse MoE for Image Generation
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
-
Context Unrolling in Omni Models
Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
-
Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding
UniRect-CoT is a training-free rectification chain-of-thought framework that treats diffusion denoising as visual reasoning and uses the model's inherent understanding to align and correct intermediate generation results.
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
-
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
BLIP3-o uses a diffusion transformer to generate CLIP image features and a sequential pretraining strategy to build open models that perform strongly on both image understanding and generation benchmarks.
-
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
-
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Reference graph
Works this paper leans on
- [1]
-
[2]
https://github.com/christophschuhmann/ improved-aesthetic-predictor 6
Clip+mlp aesthetic score predictor. https://github.com/christophschuhmann/ improved-aesthetic-predictor 6
-
[3]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3, 5
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems35, 23716– 23736 (2022) 3, 5
work page 2022
-
[5]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022) 3, 4
work page internal anchor Pith review arXiv 2022
-
[6]
Multidiffusion: Fusing diffusion paths for controlled image generation,
Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113 (2023) 4
-
[7]
Betker, J., Goh, G., Jing, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee,J.,Guo,Y.,etal.:Improvingimagegenerationwithbettercaptions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf2(3), 8 (2023) 1, 2, 4
work page 2023
-
[8]
Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc." (2009) 6
work page 2009
-
[9]
https://github.com/kakaobrain/coyo-dataset (2022) 6
Byeon, M., Park, B., Kim, H., Lee, S., Baek, W., Kim, S.: Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset (2022) 6
work page 2022
-
[10]
arXiv preprint arXiv:2304.08465 (2023) 5
Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023) 5
-
[11]
Chefer, H., Alaluf, Y., Vinker, Y., Wolf, L., Cohen-Or, D.: Attend-and-excite: Attention-basedsemanticguidancefortext-to-imagediffusionmodels.ACMTrans- actions on Graphics (TOG)42(4), 1–10 (2023) 4, 9
work page 2023
-
[12]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Chen, J., Yu, J., Ge, C., Yao, L., Xie, E., Wu, Y., Wang, Z., Kwok, J., Luo, P., Lu, H., et al.: Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426 (2023) 2, 4, 9
work page internal anchor Pith review arXiv 2023
-
[13]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Chen, M., Laina, I., Vedaldi, A.: Training-free layout control with cross-attention guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5343–5353 (2024) 4, 5
work page 2024
-
[14]
Cho, J., Hu, Y., Baldridge, J., Garg, R., Anderson, P., Krishna, R., Bansal, M., Pont-Tuset, J., Wang, S.: Davidsonian scene graph: Improving reliability in fine- grained evaluation for text-to-image generation. In: ICLR (2024) 3, 7
work page 2024
-
[15]
arXiv preprint arXiv:2305.15328 (2023) 5
Cho, J., Zala, A., Bansal, M.: Visual programming for text-to-image generation and evaluation. arXiv preprint arXiv:2305.15328 (2023) 5
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Choi, J., Lee, J., Shin, C., Kim, S., Kim, H., Yoon, S.: Perception prioritized training of diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11472–11481 (2022) 3
work page 2022
-
[17]
Emu: Enhanc- ing image generation models using photogenic needles in a haystack
Dai, X., Hou, J., Ma, C.Y., Tsai, S., Wang, J., Wang, R., Zhang, P., Vandenhende, S., Wang, X., Dubey, A., et al.: Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023) 4
-
[18]
arXiv preprint arXiv:2305.19599 (2023) 5 ELLA 17
Fang, G., Jiang, Z., Han, J., Lu, G., Xu, H., Liang, X.: Boosting text-to-image dif- fusion models with fine-grained semantic rewards. arXiv preprint arXiv:2305.19599 (2023) 5 ELLA 17
-
[19]
Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y.: Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032 (2022) 9
-
[20]
Feng, W., He, X., Fu, T.J., Jampani, V., Akula, A.R., Narayana, P., Basu, S., Wang, X.E., Wang, W.Y.: Training-free structured diffusion guidance for composi- tionaltext-to-imagesynthesis.In:TheEleventhInternationalConferenceonLearn- ing Representations (2023),https://openreview.net/forum?id=PUIqjT4rzq7 4, 5
work page 2023
-
[21]
Advances in Neural Information Processing Systems36 (2024) 5
Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems36 (2024) 5
work page 2024
-
[22]
arXiv preprint arXiv:2311.17002 (2023) 5
Feng, Y., Gong, B., Chen, D., Shen, Y., Liu, Y., Zhou, J.: Ranni: Taming text-to- image diffusion for accurate instruction following. arXiv preprint arXiv:2311.17002 (2023) 5
-
[23]
Advances in Neural Information Processing Systems36 (2024) 5
Hao, Y., Chi, Z., Dong, L., Wei, F.: Optimizing prompts for text-to-image gener- ation. Advances in Neural Information Processing Systems36 (2024) 5
work page 2024
-
[24]
Diffit: Diffusion vision transformers for image generation,
Hatamizadeh, A., Song, J., Liu, G., Kautz, J., Vahdat, A.: Diffit: Diffusion vision transformers for image generation. arXiv preprint arXiv:2312.02139 (2023) 3, 6
-
[25]
Prompt-to-Prompt Image Editing with Cross Attention Control
Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022) 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021) 3, 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[27]
Advances in Neural Information Processing Systems36 (2024) 3, 5, 6, 9, 12
Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems36 (2024) 3, 5, 6, 9, 12
work page 2024
-
[28]
Kim, Y., Lee, J., Kim, J.H., Ha, J.W., Zhu, J.Y.: Dense text-to-image generation with attention modulation. In: ICCV (2023) 4
work page 2023
-
[29]
mplug: Effective and efficient vision-language learning by cross-modal skip-connections
Li, C., Xu, H., Tian, J., Wang, W., Yan, M., Bi, B., Ye, J., Chen, H., Xu, G., Cao, Z., et al.: mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv preprint arXiv:2205.12005 (2022) 8, 10
-
[30]
https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic 8
Li, D., Kamko, A., Sabet, A., Akhgari, E., Xu, L., Doshi, S.: Playground v2. https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic 8
-
[31]
arXiv preprint arXiv:2307.10864 (2023) 4
Li, Y., Keuper, M., Zhang, D., Khoreva, A.: Divide & bind your attention for improved generative semantic nursing. arXiv preprint arXiv:2307.10864 (2023) 4
-
[32]
Lian, L., Li, B., Yala, A., Darrell, T.: Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655 (2023) 5
-
[33]
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014) 7
work page 2014
-
[34]
Advances in neural information processing systems36 (2024) 5
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36 (2024) 5
work page 2024
-
[35]
In: European Conference on Computer Vision
Liu, N., Li, S., Du, Y., Torralba, A., Tenenbaum, J.B.: Compositional visual gen- eration with composable diffusion models. In: European Conference on Computer Vision. pp. 423–439. Springer (2022) 4, 9
work page 2022
-
[36]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 9 18 X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021) 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4195–4205 (2023) 3, 6
work page 2023
-
[39]
In: Proceedings of the AAAI conference on artificial intelligence
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018) 3
work page 2018
-
[40]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 (2023) 1, 2, 4, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021) 2, 4
work page 2021
-
[42]
The Journal of Machine Learning Research21(1), 5485–5551 (2020) 2, 4, 5, 8
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research21(1), 5485–5551 (2020) 2, 4, 5, 8
work page 2020
-
[43]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022) 1, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Advances in Neural Information Processing Systems36 (2024) 4
Rassin, R., Hirsch, E., Glickman, D., Ravfogel, S., Goldberg, Y., Chechik, G.: Lin- guistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Advances in Neural Information Processing Systems36 (2024) 4
work page 2024
-
[45]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022) 1, 2, 4, 8, 9
work page 2022
-
[46]
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, Oc- tober 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015) 3
work page 2015
-
[47]
Advances in Neural Information Processing Systems35, 36479–36494 (2022) 1, 2, 4
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text- to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems35, 36479–36494 (2022) 1, 2, 4
work page 2022
-
[48]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes,T.,Jitsev,J.,Komatsuzaki,A.:Laion-400m:Opendatasetofclip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021) 6
work page internal anchor Pith review arXiv 2021
-
[49]
In: Proceedings of the IEEE/CVF international conference on computer vision
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A large-scale, high-quality dataset for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8430–8439 (2019) 7
work page 2019
-
[50]
arXiv preprint arXiv:2311.17946 (2023) 5 ELLA 19
Sun, J., Fu, D., Hu, Y., Wang, S., Rassin, R., Juan, D.C., Alon, D., Herrmann, C., van Steenkiste, S., Krishna, R., et al.: Dreamsync: Aligning text-to-image genera- tion with image understanding feedback. arXiv preprint arXiv:2311.17946 (2023) 5 ELLA 19
-
[51]
Advances in Neural Information Processing Systems36 (2024) 6
Sun, K., Pan, J., Ge, Y., Li, H., Duan, H., Wu, X., Zhang, R., Zhou, A., Qin, Z., Wang, Y., et al.: Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems36 (2024) 6
work page 2024
-
[52]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023) 2, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., Tang, J.: Cogvlm: Visual expert for pretrained language models (2023) 6
work page 2023
-
[54]
arXiv preprint arXiv:2401.15688 (2024) 5
Wang, Z., Xie, E., Li, A., Wang, Z., Liu, X., Li, Z.: Divide and conquer: Language models can plan and self-correct for compositional text-to-image generation. arXiv preprint arXiv:2401.15688 (2024) 5
-
[55]
Wu, W., Li, Z., He, Y., Shou, M.Z., Shen, C., Cheng, L., Li, Y., Gao, T., Zhang, D., Wang, Z.: Paragraph-to-image generation with information-enriched diffusion model (2023) 2, 3, 4
work page 2023
-
[56]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Xie,J.,Li,Y.,Huang,Y.,Liu,H.,Zhang,W.,Zheng,Y.,Shou,M.Z.:Boxdiff:Text- to-image synthesis with training-free box-constrained diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7452–7461 (2023) 4
work page 2023
-
[57]
Advances in Neural Information Processing Systems36 (2024) 5
Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36 (2024) 5
work page 2024
-
[58]
arXiv preprint arXiv:2401.11708 (2024) 5
Yang, L., Yu, Z., Meng, C., Xu, M., Ermon, S., Cui, B.: Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. arXiv preprint arXiv:2401.11708 (2024) 5
-
[59]
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023) 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Yu, J., Xu, Y., Koh, J.Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B.K., et al.: Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.107892(3), 5 (2022) 2, 3, 6, 7
work page internal anchor Pith review arXiv 2022
-
[61]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3836–3847 (2023) 3, 11
work page 2023
-
[62]
Zhang, P., Zeng, G., Wang, T., Lu, W.: Tinyllama: An open-source small language model (2024) 4, 5
work page 2024
-
[63]
In: Proceedings of the 31st ACM International Conference on Multimedia
Zhong, S., Huang, Z., Wen, W., Qin, J., Lin, L.: Sur-adapter: Enhancing text-to- image pre-trained diffusion models with large language models. In: Proceedings of the 31st ACM International Conference on Multimedia. pp. 567–578 (2023) 5
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.