pith. sign in

hub Mixed citations

Latte: Latent Diffusion Transformer for Video Generation

Mixed citation behavior. Most common role is background (69%).

49 Pith papers citing it
Background 69% of classified citations
abstract

We propose Latte, a novel Latent Diffusion Transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to the text-to-video generation (T2V) task, where Latte achieves results that are competitive with recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

hub tools

citation-role summary

background 9 method 3 other 1

citation-polarity summary

representative citing papers

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA

MultiAnimate: Pose-Guided Image Animation Made Extensible

cs.CV · 2026-02-25 · unverdicted · novelty 7.0

MultiAnimate adds Identifier Assigner and Identifier Adapter modules to diffusion video models so they can handle multiple characters without identity mix-ups, generalizing from two-character training data to more characters.

History-Guided Video Diffusion

cs.LG · 2025-02-10 · unverdicted · novelty 7.0

DFoT enables flexible history conditioning in video diffusion, with history guidance methods that boost temporal consistency and support long rollouts.

ReactiveGWM: Steering NPC in Reactive Game World Models

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.

DiffATS: Diffusion in Aligned Tensor Space

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.

citing papers explorer

Showing 49 of 49 citing papers.