pith. sign in

arxiv: 2406.11831 · v3 · pith:D27DI6EWnew · submitted 2024-06-17 · 💻 cs.CV

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

classification 💻 cs.CV
keywords modelsllmsdiffusionpromptframeworkmodelcapabilitieslanguage
0
0 comments X
read the original abstract

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The LLM-Infused Diffuser framework is also one of the core technologies powering SenseMirage, a highly advanced text-to-image model.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

    cs.CV 2026-05 unverdicted novelty 6.0

    ElasticDiT introduces an elastic DiT architecture with adjustable spatial compression and block depth plus Shift Sparse Block Attention and a distilled VAE to enable a single model to cover multiple fidelity-latency p...

  2. Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Patch Forcing enables diffusion models to denoise image patches at varying rates based on predicted difficulty, advancing easier regions first to improve context and achieve better generation quality on ImageNet while...

  3. Premier: Personalized Preference Modulation with Learnable User Embedding in Text-to-Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Premier learns user-specific embeddings to modulate text-to-image generation, outperforming prior methods on preference alignment, text consistency, and expert ratings even with limited history.

  4. HunyuanVideo: A Systematic Framework For Large Video Generative Models

    cs.CV 2024-12 unverdicted novelty 5.0

    HunyuanVideo presents a 13B-parameter open-source video generative model with integrated data, architecture, training, and inference systems whose professional evaluations show it outperforming prior SOTA models inclu...