hub

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li · 2024 · cs.CV · arXiv 2401.03048

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open full Pith review browse 22 citing papers arXiv PDF

abstract

We propose Latte, a novel Latent Diffusion Transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to the text-to-video generation (T2V) task, where Latte achieves results that are competitive with recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space

cs.LG · 2026-04-30 · unverdicted · novelty 7.0

ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

cs.CV · 2024-07-02 · unverdicted · novelty 7.0

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.

FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.

DiffATS: Diffusion in Aligned Tensor Space

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.

TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.

AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.

MAGI-1: Autoregressive Video Generation at Scale

cs.CV · 2025-05-19 · unverdicted · novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

cs.CV · 2024-10-09 · unverdicted · novelty 6.0

Aligning noisy hidden states in diffusion transformers to clean features from pretrained visual encoders speeds up training over 17x and reaches FID 1.42.

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

cs.RO · 2024-10-08 · unverdicted · novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

cs.CV · 2024-08-12 · unverdicted · novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

cs.CV · 2024-04-02 · unverdicted · novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

Video Generation with Predictive Latents

cs.CV · 2026-05-04 · unverdicted · novelty 5.0

PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.

Motion-Aware Caching for Efficient Autoregressive Video Generation

cs.CV · 2026-05-03 · unverdicted · novelty 5.0

MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.

Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

cs.CV · 2026-04-14 · unverdicted · novelty 5.0

PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.

Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity

cs.LG · 2026-04-08 · unverdicted · novelty 5.0

Local optimization on token windows plus a continuity loss lets autoregressive video models train on fewer frames with less error accumulation, cutting training cost in half while matching baseline quality.

Open-Sora: Democratizing Efficient Video Production for All

cs.CV · 2024-12-29 · unverdicted · novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.

Movie Gen: A Cast of Media Foundation Models

cs.CV · 2024-10-17 · unverdicted · novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Cosmos World Foundation Model Platform for Physical AI

cs.CV · 2025-01-07 · unverdicted · novelty 3.0

The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.

Evolution of Video Generative Foundations

cs.CV · 2026-04-07 · unverdicted · novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

cs.CV · 2024-02-27 · unverdicted · novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

citing papers explorer

Showing 22 of 22 citing papers.

ABC: Any-Subset Autoregression via Non-Markovian Diffusion Bridges in Continuous Time and Space cs.LG · 2026-04-30 · unverdicted · none · ref 39 · internal anchor
ABC enables any-subset autoregressive generation of continuous stochastic processes via non-Markovian diffusion bridges that track physical time and allow path-dependent conditioning.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation cs.CV · 2024-07-02 · unverdicted · none · ref 5 · internal anchor
OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity cs.CV · 2026-05-12 · unverdicted · none · ref 29 · internal anchor
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
DiffATS: Diffusion in Aligned Tensor Space cs.LG · 2026-05-10 · unverdicted · none · ref 37 · internal anchor
DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with high compression.
TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation cs.CV · 2026-04-21 · unverdicted · none · ref 16 · internal anchor
TS-Attn dynamically separates and rearranges attention in existing text-to-video models to improve temporal consistency and prompt adherence for videos with multiple sequential actions.
AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation cs.CV · 2026-04-20 · unverdicted · none · ref 24 · internal anchor
AdaCluster delivers a training-free adaptive query-key clustering framework for sparse attention in video DiTs, yielding 1.67-4.31x inference speedup with negligible quality loss on CogVideoX-2B, HunyuanVideo, and Wan-2.1.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation cs.CV · 2026-04-11 · unverdicted · none · ref 24 · internal anchor
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 31 · internal anchor
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think cs.CV · 2024-10-09 · unverdicted · none · ref 171 · internal anchor
Aligning noisy hidden states in diffusion transformers to clean features from pretrained visual encoders speeds up training over 17x and reaches FID 1.42.
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation cs.RO · 2024-10-08 · unverdicted · none · ref 25 · internal anchor
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer cs.CV · 2024-08-12 · unverdicted · none · ref 61 · internal anchor
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation cs.CV · 2024-04-02 · unverdicted · none · ref 132 · internal anchor
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 29 · internal anchor
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
Motion-Aware Caching for Efficient Autoregressive Video Generation cs.CV · 2026-05-03 · unverdicted · none · ref 24 · internal anchor
MotionCache speeds up autoregressive video generation by 6.28x on SkyReels-V2 and 1.64x on MAGI-1 via motion-weighted cache reuse based on inter-frame differences, with negligible quality loss on VBench.
Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation cs.CV · 2026-04-14 · unverdicted · none · ref 12 · internal anchor
PASA uses curvature-aware dynamic budgeting, grouped approximations, and stochastic attention routing to accelerate video diffusion transformers while eliminating temporal flickering from sparse patterns.
Accelerating Training of Autoregressive Video Generation Models via Local Optimization with Representation Continuity cs.LG · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
Local optimization on token windows plus a continuity loss lets autoregressive video models train on fewer frames with less error accumulation, cutting training cost in half while matching baseline quality.
Open-Sora: Democratizing Efficient Video Production for All cs.CV · 2024-12-29 · unverdicted · none · ref 20 · internal anchor
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tasks with claimed high fidelity.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 46 · internal anchor
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 35 · internal anchor
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Cosmos World Foundation Model Platform for Physical AI cs.CV · 2025-01-07 · unverdicted · none · ref 130 · internal anchor
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 76 · internal anchor
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models cs.CV · 2024-02-27 · unverdicted · none · ref 66 · internal anchor
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Latte: Latent Diffusion Transformer for Video Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer