LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

· 2026 · cs.CV · arXiv 2605.18739

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

We present LongLive-2.0, an NVFP4-based parallel infrastructure throughout the full training and inference workflow of long video generation, addressing speed and memory bottlenecks. For training, we introduce sequence-parallel autoregressive (AR) training, instantiated as Balanced SP, which co-designs the efficient teacher-forcing layout with SP execution by pairing clean-history and noisy-target temporal chunks on each rank, enabling a natural teacher-forcing mask with SP-aware chunked VAE encoding. Combined with NVFP4 precision, it reduces GPU memory cost and accelerates GEMM computation during training, the proportion of which increases as video length grows. Moreover, we show that a high-quality infrastructure and dataset enable a remarkably clean training pipeline. Unlike existing Self-Forcing series methods that rely on ODE initialization and subsequent distribution matching distillation (DMD), LongLive-2.0 directly tunes a diffusion model into a long, multi-shot, interactive auto-regressive (AR) diffusion model. It can be further converted to real-time generation (4 to 2 denoising steps) with standalone LoRA weights. For inference on Blackwell GPUs, we enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding. On non-Blackwell GPU architectures, we deploy SP inference to match the speed on Blackwell GPUs, while the quantized KV cache can lower inter-GPU communication of SP. Experiments show up to 2.15x speedup in training, and 1.84x in inference. LongLive-2.0-5B achieves 45.7 FPS inference while attaining strong performance on benchmarks. To our knowledge, LongLive-2.0 is the first NVFP4 training and inference system for long video generation.

representative citing papers

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

cs.CV · 2026-06-21 · unverdicted · novelty 5.0

Sol Video Inference Engine uses parallel skill agents to optimize cache, sparse attention, token pruning, quantization, and kernel fusion, delivering over 2x end-to-end acceleration with near-lossless quality on three video models.

DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model

cs.LG · 2026-06-29 · unverdicted · novelty 3.0

A preview system demonstrates real-time controllable world modeling at 14-15 FPS on RTX 4090 by adapting open video backbones with action pathways for keyboard/mouse control and multimodal features.

citing papers explorer

Showing 2 of 2 citing papers.

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation cs.CV · 2026-06-21 · unverdicted · none · ref 26 · internal anchor
Sol Video Inference Engine uses parallel skill agents to optimize cache, sparse attention, token pruning, quantization, and kernel fusion, delivering over 2x end-to-end acceleration with near-lossless quality on three video models.
DreamForge-World 0.1 Preview: A Low-Compute Real-Time Controllable World Model cs.LG · 2026-06-29 · unverdicted · none · ref 26 · internal anchor
A preview system demonstrates real-time controllable world modeling at 14-15 FPS on RTX 4090 by adapting open video backbones with action pathways for keyboard/mouse control and multimodal features.

LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

fields

years

verdicts

representative citing papers

citing papers explorer