Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Fu-Yun Wang; Guanglu Song; Han-Jia Ye; Hongsheng Li; Wenshuo Chen; Yu Liu

arxiv: 2305.18264 · v1 · pith:3I7PCH6Hnew · submitted 2023-05-29 · 💻 cs.CV

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Fu-Yun Wang , Wenshuo Chen , Guanglu Song , Han-Jia Ye , Yu Liu , Hongsheng Li This is my paper

classification 💻 cs.CV

keywords editinggenerationvideovideosmodelsdiffusiongen-l-videosegments

0 comments

read the original abstract

Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning
cs.RO 2026-05 unverdicted novelty 7.0

CoFi is a two-stage coarse-to-fine sampler that enforces global coherence via scaffold alignment before restoring local structure with a pretrained prior, yielding better quality and 2-8x fewer evaluations across plan...
MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
cs.CV 2026-05 unverdicted novelty 7.0

MBench is a new benchmark that quantifies long-term memory in video world models via three hierarchical consistency dimensions evaluated on curated real videos.
TIE: Time Interval Encoding for Video Generation over Events
cs.CV 2026-05 unverdicted novelty 7.0

TIE derives a sinc-based interval encoding from temporal integrability and duration invariance principles, raising temporal constraint satisfaction from 77% to 96% on the OmniEvents dataset while preserving visual quality.
TIE: Time Interval Encoding for Video Generation over Events
cs.CV 2026-05 unverdicted novelty 7.0

TIE derives a sinc-based interval encoding from Temporal Integrability and Duration Invariance principles, raising human-verified temporal constraint satisfaction from 77.34% to 96.03% while preserving visual quality ...
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

TunerDiT adds event-partitioned masking and cross-event prompt fusion to diffusion transformers for training-free multi-event video generation, with gains scaling by event count on a new Meve benchmark.
DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
cs.GR 2026-05 unverdicted novelty 6.0

DrawVideo is a sketch-guided framework that decomposes long videos into controllable shots using keyframe sketches, appearance prompts, and motion prompts, supported by a new SketchLongVideo dataset.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos
cs.CV 2026-05 unverdicted novelty 6.0

MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on...
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
cs.CV 2025-10 unverdicted novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
cs.CV 2025-03 unverdicted novelty 6.0

FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.
Character-Centered Dialogue Generation from Scene-Level Prompts
cs.CV 2025-05 unverdicted novelty 4.0

A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling
cs.CV 2025-03 unverdicted novelty 3.0

A prompt fusion approach combines bidirectional time-weighted latent blending, dynamics-informed prompt weighting via CLIP, and semantic action representations to produce temporally consistent long videos from text wi...