VDT: General-Purpose Video Diffusion Transformers via Mask Modeling

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding · 2023 · arXiv 2305.13311

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

cs.MM · 2026-04-22 · unverdicted · novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both an XAI probe and creative tool.

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

cs.DC · 2026-06-11 · unverdicted · novelty 6.0 · 2 refs

GF-DiT dynamically adapts parallelism during DiT serving via trajectory tasks and group-free collectives, reporting up to 6x throughput and 95% latency reduction versus static configurations.

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

RoPeSLR combines 3D RoPE-guided sparse attention with head-wise low-rank parameterization to achieve sub-quadratic complexity in DiTs while preserving distance awareness for efficient ultra-long video synthesis.

Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation

cs.CV · 2024-11-24 · unverdicted · novelty 6.0

LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.

Making Time Editable in Video Diffusion Transformers

cs.CV · 2026-06-08 · unverdicted · novelty 3.0

Introduces a lightweight temporal module to extend pretrained DiT video models with time editing capabilities while preserving the original generative prior.

Evolution of Video Generative Foundations

cs.CV · 2026-04-07 · unverdicted · novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

citing papers explorer

Showing 1 of 1 citing paper after filters.

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving cs.DC · 2026-06-11 · unverdicted · none · ref 20 · 2 links
GF-DiT dynamically adapts parallelism during DiT serving via trajectory tasks and group-free collectives, reporting up to 6x throughput and 95% latency reduction versus static configurations.

VDT: General-Purpose Video Diffusion Transformers via Mask Modeling

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer