Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Jiayan Teng; Jie Tang; Shizhan Liu; Xiaotao Gu; Xinran Deng; Zhuoyi Yang

arxiv: 2512.05394 · v2 · pith:FTGRJWGWnew · submitted 2025-12-05 · 💻 cs.CV

Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

Shizhan Liu , Xinran Deng , Zhuoyi Yang , Jiayan Teng , Xiaotao Gu , Jie Tang This is my paper

classification 💻 cs.CV

keywords latentdiffusionvaesvideopropertiesreconstructionspectralssvae

0 comments

read the original abstract

Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a $3\times$ speedup in text-to-video generation convergence and a 10\% gain in video reward, outperforming strong open-source VAEs. The code is available at https://github.com/zai-org/SSVAE.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability
cs.CV 2026-06 unverdicted novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
cs.CV 2026-05 unverdicted novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Video Generation with Predictive Latents
cs.CV 2026-05 unverdicted novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.