UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Lei Zhu; Song Fei; Tian Ye

arxiv: 2511.18050 · v1 · pith:4W2MSN5Rnew · submitted 2025-11-22 · 💻 cs.CV · cs.AI

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Tian Ye , Song Fei , Lei Zhu This is my paper

classification 💻 cs.CV cs.AI

keywords acrossultrafluxaestheticar-awareaspectco-designdata-modeldiverse

0 comments

read the original abstract

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression
cs.CV 2026-05 unverdicted novelty 7.0

OP4KSR enables efficient one-step 4K super-resolution without patches by adapting Flux with RoPE rescaling and periodicity loss to suppress artifacts.
SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
cs.CV 2026-05 unverdicted novelty 6.0

SANA-Streaming delivers 1280x704 streaming video editing at 24 FPS end-to-end on an RTX 5090 using hybrid DiT blocks, cycle-reverse training, and mixed-precision quantization.
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claime...
PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution
cs.CV 2026-05 unverdicted novelty 5.0

PixelWizard decouples global structure from fine details via a spatiotemporal anchor and introduces Noise-Span Aligned Shortcut Training with biased sampling to achieve over 10x faster sampling for high-fidelity 2K/4K...
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
cs.CV 2026-05 unverdicted novelty 5.0

PixVerve introduces a 95K ultra-high-resolution image-text dataset and training strategies that enable native 100-megapixel text-to-image generation together with a new evaluation benchmark.