hub Canonical reference

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang · 2023 · cs.CV · arXiv 2310.19512

Canonical reference. 85% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 85% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 \times 576$, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 baseline 2

citation-polarity summary

background 11 baseline 2

representative citing papers

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.

DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

cs.CV · 2026-05-16 · unverdicted · novelty 7.0

DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

cs.RO · 2026-05-15 · unverdicted · novelty 7.0

WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零

R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

cs.CV · 2026-05-01 · unverdicted · novelty 7.0

CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.

Novel View Synthesis as Video Completion

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos

cs.CV · 2026-03-03 · unverdicted · novelty 7.0

EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.

CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion

cs.CV · 2025-09-24 · unverdicted · novelty 7.0

CamPVG is the first diffusion-based framework for generating geometrically consistent panoramic videos from camera pose inputs using a panoramic Plücker embedding and spherical epipolar attention module.

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

cs.CV · 2024-07-02 · unverdicted · novelty 7.0

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.

FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.

GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

cs.CV · 2026-05-11 · unverdicted · novelty 6.0 · 4 refs

GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.

Detecting AI-Generated Videos with Spiking Neural Networks

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.

CineAGI: Character-Consistent Movie Creation through LLM-Orchestrated Multi-Modal Generation and Cross-Scene Integration

cs.MM · 2026-04-26 · unverdicted · novelty 6.0

CineAGI is a multi-agent LLM framework that generates multi-scene movies with improved character consistency, narrative coherence, and audio-visual alignment.

Generative Refinement Networks for Visual Synthesis

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.

ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity

cs.CV · 2026-04-05 · unverdicted · novelty 6.0

ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.

Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

cs.CV · 2026-02-08 · unverdicted · novelty 6.0

Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.

Splatent: Splatting Diffusion Latents for Novel View Synthesis

cs.CV · 2025-12-10 · conditional · novelty 6.0

Splatent recovers fine details for latent-space 3D Gaussian Splatting by applying multi-view attention in 2D rather than reconstructing in 3D space.

SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation

cs.CV · 2025-11-24 · unverdicted · novelty 6.0

SteadyDancer is an I2V framework using condition reconciliation, synergistic pose modulation, and staged training to achieve robust first-frame preservation and coherent motion control in human image animation.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

cs.CV · 2025-07-10 · unverdicted · novelty 6.0

Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.

citing papers explorer

Showing 37 of 37 citing papers after filters.

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision cs.CV · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
Tiny-Engram uses small n-gram-indexed memory tables to bind trigger phrases to target visual identities in diffusion models while preserving compositional control from the surrounding prompt.
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis cs.CV · 2026-05-16 · unverdicted · none · ref 44 · internal anchor
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4D, iPhone, and DL3DV datasets.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow cs.CV · 2026-05-13 · unverdicted · none · ref 143 · 2 links · internal anchor
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 26 · internal anchor
FreeSpec uses SVD-based spectral reconstruction to fuse global low-rank and local high-rank features, reducing content drift and preserving temporal dynamics in long video generation.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 222 · internal anchor
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection cs.CV · 2026-05-01 · unverdicted · none · ref 52 · internal anchor
CMTA detects AI-generated videos by capturing unnatural temporal stability in visual-textual semantic alignment via joint embeddings and multi-grained temporal modeling, outperforming prior methods in cross-generator tests.
Novel View Synthesis as Video Completion cs.CV · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
EduVQA: Towards Concept-Aware Assessment of Educational AI-Generated Videos cs.CV · 2026-03-03 · unverdicted · none · ref 2 · internal anchor
EduVQA introduces the first concept-aware benchmark for educational AI-generated video assessment and a S2D-MoE framework that jointly evaluates perceptual quality and fine-grained semantic alignment.
CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion cs.CV · 2025-09-24 · unverdicted · none · ref 1 · internal anchor
CamPVG is the first diffusion-based framework for generating geometrically consistent panoramic videos from camera pose inputs using a panoramic Plücker embedding and spherical epipolar attention module.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation cs.CV · 2024-07-02 · unverdicted · none · ref 2 · internal anchor
OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CV · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity cs.CV · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
Head Forcing assigns tailored KV cache strategies to local, anchor, and memory attention heads plus head-wise RoPE re-encoding to extend autoregressive video generation from seconds to minutes without training.
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity cs.CV · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth cs.CV · 2026-05-11 · unverdicted · none · ref 4 · 4 links · internal anchor
GemDepth adds explicit camera-pose geometry embeddings and an alternating spatio-temporal transformer to produce sharper, more temporally consistent video depth maps than prior smoothing-based methods.
Detecting AI-Generated Videos with Spiking Neural Networks cs.CV · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 8 · internal anchor
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models cs.CV · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
NUMINA improves counting accuracy in text-to-video diffusion models by up to 7.4% via a training-free identify-then-guide framework on the new CountBench dataset.
ATSS: Detecting AI-Generated Videos via Anomalous Temporal Self-Similarity cs.CV · 2026-04-05 · unverdicted · none · ref 65 · internal anchor
ATSS detects AI-generated videos by measuring unnatural repetitive temporal correlations in triple similarity matrices derived from frame visuals and semantic descriptions.
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion cs.CV · 2026-02-08 · unverdicted · none · ref 14 · internal anchor
Rolling Sink is a training-free cache adjustment technique that maintains visual consistency in autoregressive video diffusion models for ultra-long open-ended generation beyond training horizons.
Splatent: Splatting Diffusion Latents for Novel View Synthesis cs.CV · 2025-12-10 · conditional · none · ref 7 · internal anchor
Splatent recovers fine details for latent-space 3D Gaussian Splatting by applying multi-view attention in 2D rather than reconstructing in 3D space.
SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation cs.CV · 2025-11-24 · unverdicted · none · ref 4 · internal anchor
SteadyDancer is an I2V framework using condition reconciliation, synergistic pose modulation, and staged training to achieve robust first-frame preservation and coherent motion control in human image animation.
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling cs.CV · 2025-07-10 · unverdicted · none · ref 13 · internal anchor
Geometry Forcing aligns video diffusion representations with geometric foundation model features via angular cosine and scale regression objectives to improve 3D consistency in generated videos.
We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback cs.CV · 2025-04-24 · unverdicted · none · ref 8 · internal anchor
NeuS-E is a post-generation refinement method that uses neuro-symbolic analysis of a formal video representation to detect and correct semantic and temporal inconsistencies in text-to-video outputs, improving prompt alignment by nearly 40%.
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness cs.CV · 2025-03-27 · accept · none · ref 23 · internal anchor
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs, and anomaly detection methods.
Multimodal Diffusion Transformer with Memory Bank for Scalable Long-Duration Talking Video Generation cs.CV · 2024-11-24 · unverdicted · none · ref 32 · internal anchor
LetsTalk combines a multimodal diffusion transformer, noise-regularized memory bank, deep compression autoencoder, and symbiotic/direct fusion schemes to achieve state-of-the-art quality and efficiency in long-duration talking video generation.
CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation cs.CV · 2024-06-04 · unverdicted · none · ref 8 · internal anchor
CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.
VideoPoet: A Large Language Model for Zero-Shot Video Generation cs.CV · 2023-12-21 · unverdicted · none · ref 9 · internal anchor
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation cs.CV · 2025-06-30 · unverdicted · none · ref 12 · internal anchor
SynMotion combines disentangled semantic embeddings, parameter-efficient motion adapters, and alternate subject-motion training on a new SPV dataset to improve motion customization in text-to-video and image-to-video generation.
Movie Gen: A Cast of Media Foundation Models cs.CV · 2024-10-17 · unverdicted · none · ref 9 · internal anchor
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models cs.CV · 2023-11-07 · unverdicted · none · ref 4 · internal anchor
I2VGen-XL applies cascaded diffusion models with a base stage for semantic preservation via hierarchical encoders and a refinement stage for detail and resolution, trained on 35 million text-video and 6 billion text-image pairs.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 15 · internal anchor
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Image-to-Video Diffusion: From Foundations to Open Frontiers cs.CV · 2026-05-17 · unverdicted · none · ref 62 · internal anchor
A survey that organizes diffusion image-to-video methods into a taxonomy, distills core designs in condition encoding, temporal modeling, noise prior, and upsampling, and discusses applications plus challenges.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation cs.CV · 2026-04-13 · unverdicted · none · ref 21 · internal anchor
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm cs.CV · 2026-05-12 · unreviewed · ref 16 · internal anchor
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey cs.CV · 2026-04-13 · unreviewed · ref 124 · internal anchor
Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation cs.CV · 2026-02-02 · unreviewed · ref 4 · 2 links · internal anchor

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer