OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
High-resolution image synthesis with latent diffusion models
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
ProGIC applies residual vector quantization with a lightweight CNN-attention backbone to deliver progressive generative image compression with claimed perceptual gains and over 10x faster encoding/decoding versus MS-ILLM.
citing papers explorer
-
OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
OmniSonic introduces a TriAttn-DiT architecture with MoE gating to jointly generate on-screen, off-screen, and speech audio from video and text, outperforming prior models on a new UniHAGen-Bench.
-
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
DeCo decouples high- and low-frequency generation in pixel diffusion via a DiT plus lightweight decoder and a frequency-aware flow-matching loss, reaching FID 1.62 at 256x256 and 2.22 at 512x512 on ImageNet while closing the gap to latent diffusion methods.
-
ProGIC: Progressive and Lightweight Generative Image Compression with Residual Vector Quantization
ProGIC applies residual vector quantization with a lightweight CNN-attention backbone to deliver progressive generative image compression with claimed perceptual gains and over 10x faster encoding/decoding versus MS-ILLM.