pith. sign in

arxiv: 2602.15287 · v2 · pith:SATOL7MInew · submitted 2026-02-17 · 💻 cs.CV

Consistency-Preserving Diverse Video Generation

classification 💻 cs.CV
keywords diversityvideoconsistencygenerationtemporalbackpropagationbatchdecoder
0
0 comments X
read the original abstract

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity close to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Our code is available at https://github.com/XinshuangL/Diverse-Video.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.