Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on models like WAN and FLUX without fine-tuning.
Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 8roles
background 3polarities
background 3representative citing papers
Introduces SciIR-82k dataset and SciIR-Bench for scientific image reasoning generation organized by Peirce's semiotic triad, with fine-tuning raising model score from 35% to 43%.
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
HunyuanVideo presents a 13B-parameter open-source video generative model with integrated data, architecture, training, and inference systems whose professional evaluations show it outperforming prior SOTA models including Runway Gen-3 and Luma 1.6.
citing papers explorer
-
Your Pre-trained Diffusion Model Secretly Knows Restoration
Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on models like WAN and FLUX without fine-tuning.
-
SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation
Introduces SciIR-82k dataset and SciIR-Bench for scientific image reasoning generation organized by Peirce's semiotic triad, with fine-tuning raising model score from 35% to 43%.
-
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
-
HunyuanVideo: A Systematic Framework For Large Video Generative Models
HunyuanVideo presents a 13B-parameter open-source video generative model with integrated data, architecture, training, and inference systems whose professional evaluations show it outperforming prior SOTA models including Runway Gen-3 and Luma 1.6.
- iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
- Rethinking Cross-Layer Information Routing in Diffusion Transformers
- Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer