Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu Qiao, Hongsheng · 2024 · arXiv 2405.05945

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Your Pre-trained Diffusion Model Secretly Knows Restoration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on models like WAN and FLUX without fine-tuning.

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

LTX-2: Efficient Joint Audio-Visual Foundation Model

cs.CV · 2026-01-06 · conditional · novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

cs.CV · 2025-11-27 · unverdicted · novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.

citing papers explorer

Showing 4 of 4 citing papers.

Your Pre-trained Diffusion Model Secretly Knows Restoration cs.CV · 2026-04-06 · unverdicted · none · ref 14
Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on models like WAN and FLUX without fine-tuning.
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer cs.CV · 2026-04-07 · unverdicted · none · ref 24
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
LTX-2: Efficient Joint Audio-Visual Foundation Model cs.CV · 2026-01-06 · conditional · none · ref 6
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer cs.CV · 2025-11-27 · unverdicted · none · ref 20
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and streamlined training.

Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers

fields

years

verdicts

representative citing papers

citing papers explorer